Tech & Engineering Blog

Katalina: an open-source Android string deobfuscator

Why am I frying my brain with this?

Android malware has long relied on basic string obfuscation techniques to make analysts suffer while deobfuscating it. Right now, there are very few free tools that can deobfuscate Android malware. Because of that, the Android malware ecosystem is severely skewed in favor of the attackers and/or big corporations that can handle the man hours/have the big bucks needed for the commercial tools. Katalina tries to fix that, by giving the community a free tool that researchers can extend and enrich.

Katalina uses a novel approach(ish, at least for Android) way of deobfuscating Android malware FOR FREE. Think of it like JEB’s string deobfuscation feature, except that the bytecode is executed in a better sanitized sandbox (a VM written from scratch). It’s got a GNU GPL license, so you can do whatever you want with it as long as that includes bashing malware! 

Katalina is a completely new tool that executes Android byte code, enabling users to deobfuscate strings found in Android malware. It’s written in Python and emulates each individual bytecode instruction that was used back in the day for Android. Its main purpose is emulating functions that hide strings behind complex logic that take good analyst hours to deobfuscate. The tool works with most modern obfuscator strains.

The cloak of invisibility

The current state of the art in mass string deobfuscation relies on two techniques: the first is manually executing/analyzing the sample and hoping to get some hits on the methods with the interesting strings, while the second is forking over big bucks for some well-known tools in the industry. 

Both the level of effort and the financial impact of these methods can severely hinder an independent researcher's ability to tackle modern Android malware. My solution is simple: build an environment that can execute Android bytecode one instruction at a time. 

While this isn’t a new  approach (Unicorn comes to mind), there is no such tool available for the Android ecosystem. This kind of a tool allows researchers to speed up their reversing efforts and tackle more intricate and advanced malware.

As of 2023, malware authors are using advanced obfuscation techniques that make detection and analysis of malicious code significantly more challenging. Techniques like encryption, reflection, dynamic loading, and anti-debugging are commonly employed to hide the actual functionality and intent of the malicious code. The prevalence of packers, which can further complicate static and dynamic analysis, is also increasing. Despite advances in detection methodologies, the constant arms race between malware authors and security researchers continues to escalate. It's becoming clear that the future of malware detection and prevention will likely require more sophisticated machine learning techniques and even deeper systems knowledge to keep up with these advanced obfuscation methods.

Code obfuscation—which includes renaming the package, classes, methods, and variables to meaningless or misleading names—is the primary tactic. It can significantly  impact the readability of decompiled code, making it hard for researchers to understand the malware's functionality. 

Dynamic code loading is another common obfuscation method, in which malware may download additional code during runtime, or load encrypted classes from the application's resources, circumventing static analysis. API and SDK-level obfuscation is also widely used in mobile malware. Malicious code could leverage reflective calls to APIs to obscure the program’s control flow or misuse legitimate SDKs, masking malicious behaviors behind regular operations. 

Furthermore, advanced encryption techniques, such as AES or RSA, can be used to encrypt malicious payloads or communication with command and control servers, making it harder to understand the malware's intent. Lastly, anti-analysis techniques like detecting the presence of a debugger, emulator, or sandbox environment are commonly used to inhibit dynamic analysis. If such an environment is detected, the malware might alter its own behavior or even stop executing entirely.

All of these techniques act like a cloak of invisibility, or better yet, Saruman’s twisted clothing (I mean the book version, not the movie one) confusing analysts about its true intentions. However, we only need to pierce one of the layers mentioned above in order to piece together a robust detection technique. I found that the easiest problem to tackle was string obfuscation. Once I could break this, it would give me a pretty solid way of finding new variants of the same sample no matter the obfuscator employed.

Dazed and confused
source: The Many Colors of Saruman by Harold Jig

This is because malware often contains strings that are obfuscated to hide malicious intent, such as C2 server URLs, key strings used in encryption or decryption, and specific identifiers that can be traced back to the malware author or group. Through string deobfuscation, these hidden details can be uncovered, which can provide valuable insights into the malware's behavior and its origin. 

The deobfuscated strings can be used to create a fingerprint of the malware, identifying unique patterns or artifacts that can be used to track the evolution of the malware, its variants, or even link different malware families back to the same author or group. This kind of information can be vital in creating effective detection rules and in threat intelligence efforts to anticipate or understand the broader strategy of the adversaries. 

Usually this is a very arduous and manual task, that involves either going by hand through each sample or writing decryptors for each individual obfuscator version. This is not scalable and can seriously impact an analyst’s workload. Katalina aims to speed this up, by executing the bytecode that does the obfuscation.

Dalvik - blessing and curse

Dalvik is a discontinued virtual machine (VM) in Google's Android operating system that executes applications written for Android. Named after an Icelandic fishing village, Dalvik was designed and written by Dan Bornstein, who sought to have a reliable and compact virtual machine to run on mobile devices where memory and processing power are limited.

Dalvik was an integral part of Android at its launch in 2008. The bytecode it executed, the Dalvik Executable (.dex) format, was designed to be memory-efficient and optimized for mobile devices with limited processing power and battery life. It featured a Just-In-Time (JIT) compiler, which provided a balance between the low memory footprint of interpreted code and the speed of native code execution.

However, with the advent of Android 5.0 Lollipop in 2014 Android has transitioned from the Dalvik virtual machine to the Android Runtime (ART). ART offers ahead-of-time (AOT) compilation, improved garbage collection, and better developmental debugging support, which significantly enhance app performance. 

However, Dalvik and Android bytecode are still relevant. Despite the transition to ART, applications are still written in Java or Kotlin and then compiled into bytecode, which is then translated into native code. The bytecode form of the application can still be examined for reverse engineering or analysis purposes.

In many malware samples, strings are obfuscated to hide malicious behavior or intent, making it challenging to identify what the malware is doing or who it's communicating with. By running the bytecode that is responsible for the obfuscation in a controlled environment, these strings can be deobfuscated at runtime, revealing the concealed information.

Before Katalina
source: 2877b27f1b6c7db466351618dda4f05d6a15e9a26028f3fc064fa144ec3a1850

The ability to execute bytecode and retrieve these deobfuscated strings provides a powerful tool in malware fingerprinting. We can create specific detection patterns based on these unique strings, and thereby track malware families, identify new variants, or even uncover connections between seemingly disparate malware samples.

Hopefully Katalina can fill this gap in the ecosystem, giving malware analysts a tool that can automate a previously tedious process and spit out meaningful IoCs.

The innards

Here be dragons

The premise is simple: implement how each Android bytecode works, parse the .dex file using KaitaiStruct and then piece together the instructions and execute them. Because only the main .dex files will be loaded, Katalina greatly outpaces other tools. In turn, I’m mocking a lot of framework calls (like base64 decode, array classes/operations and many others) via Python’s own implementations.

 

High level overview of Katalina

Everything can be found under mocks.py and follows a very simple format of the fully qualified class name also used by Android with a bit of a twist: replacing the / with _. Just make sure to replace constructor functions (_init_) names to 0init0 and you’re set. Follow the other examples and all calls to the original framework function will be redirected to Python.

No laughing matter for this mocking

How do I use Katalina?

Easy: just extract the .dex file from the sample and run the tool against it. While currently multidex is not supported, it’s a planned feature on the roadmap, so stay tuned for more :)

python3 main.py -xe classes.dex

It should now look for common Android entry points, execute them and spit out all the strings and the functions that obfuscate them.

After Katalina
source: 2877b27f1b6c7db466351618dda4f05d6a15e9a26028f3fc064fa144ec3a1850

If you want to only execute a certain function, you can do so with the -x flag and by specifying the parameters we want to call the function with:

python3 main.py classes.dex -x 'Lcom/njzbfugl/lzzhmzl/App;->$(III)' '67,84,1391'

Another useful function is the ability to specify a denylist. Any class/function that matches the comma separated denylist strings will be skipped and not executed. This allows us to bypass a lot of boilerplate code and focus on actual obfuscated code. A must have denylist entry that I use all the time is androidx. By skipping the inner workings of this Android framework, we can focus on executing the actual malicious code.

Credits

This couldn’t have been achieved without Gabor Paller’s amazing work! He went through the painstaking job of documenting each opcode call and also adding some simple examples that served as a quick sanity (or what remained of it) check.

http://pallergabor.uw.hu/androidblog/dalvik_opcodes.html

In addition, Kaitai Struct did an amazing job at almost flawlessly parsing dex files:

https://kaitai.io/

And of course, nothing could have been done without Android’s (albeit a bit lacking at times) documentation:

https://source.android.com/docs/core/runtime/dalvik-bytecode

https://source.android.com/docs/core/runtime/dex-format