Tuesday, August 21, 2012

Automated Static Malware Analysis with Pythonect

About 5 months ago I have released the first version of Pythonect - a new, experimental, general-purpose high-level dataflow programming language based on Python, written in Python.
It aims to combine the intuitive feel of shell scripting (and all of its perks like implicit parallelism) with the flexibility and agility of Python.

Crazy? Most definitely. And yet, strangely enough, it works!

Pythonect, being a dataflow programming language, treats data as something that originates from a source, flows through a number of processing components, and arrives at some final destination.
As such, it is most suitable for creating applications that are themselves focused on the "flow" of data. Perhaps the most readily available example of a dataflow-oriented applications comes from the realm of real-time signal processing, e.g. a video signal processor which perhaps starts with a video input, modifies it through a number of processing components (video filters), and finally outputs it to a video display.

As with video, malware analysis can be expressed as a network of different components such as: disassemblers, regular expressions, debuggers and etc. that are connected by a number of communication channels.
The benefits, and perhaps the greatest incentives, of expressing malware analysis this way is scalability and parallelism. The different components in the network can be maneuvered to create entirely unique dataflows without necessarily requiring the relationship to be hardcoded. Also, the design and concept of components make it easier to run on distributed systems and parallel processors.

In this tutorial I will show you how to automate static malware analysis using Pythonect. The examples will be simple enough that you can extend them if you want to.
Before you read this tutorial you should have at least a basic knowledge of x86 Assembly, Python, and Pythonect (I recommend reading the Pythonect Tutorial: Learn By Example).

Note: I have decided to go with static malware analysis because it's easier to demonstrate, and to use open source tools because they are more accessible. Nonetheless, this does not go to show that Pythonect or dataflow programming cannot be used to automate dynamic malware analysis, or integrated with a commercial software. The only limit is your imagination.

There isn't exactly a "Hello, world" program in the malware analysis realm, so I will start with my equivalent to "Hello, world", an example program that computes a MD5 digest of a file:
"MALWARE.EXE" -> os.system("/usr/bin/md5sum " + _)
The program above uses the md5sum program of GNU coreutils to compute and print MALWARE.EXE's MD5 digest. Let's extend it to compute the MALWARE.EXE's SHA1 digest as well:
"MALWARE.EXE" -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
The new program above uses the md5sum and sha1sum of GNU coreutils to compute and print MALWARE.EXE's MD5 and SHA1 digests. Let's keep improving it:
sys.argv[1] -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
Now, the new program reads the malware filename from a command-line argument. To run the script just save it (e.g. md5_and_sha1_sums) and run the Pythonect interpreter like this:
% /usr/local/bin/pythonect md5_and_sha1_sums /bin/ls
92385e9b8864032488e253ebde0534c3  /bin/ls
8800fee57584ed1c44b638225c2f1eec818a27c2  /bin/ls
Often, the goal is to handle the large volume of malware samples collected each day, let's change the program to work on all the executables (i.e. *.EXE) in the current directory:
glob.glob('*.EXE') -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
Of course it can be further finetuned or customized at will. Also, it's worth mentioning that the program above is multi-threaded. Meaning, each file starts a new thread.

So far, I have used Python's os.system() function in all of the example programs. The os.system() is handy when it comes to writing small scripts, it executes a command in a subshell and returns it's exit status.
But since there is little interest in passing the exit status to another component, a different command executing function will be needed when building an advanced script. subprocess.check_output().
"MALWARE.EXE" -> subprocess.check_output(['/usr/bin/md5sum', _]) -> print
Much like the original example program, the program above uses the md5sum program of GNU coreutils to compute MALWARE.EXE's MD5 digest, but prints the result using Pythonect's print() function.

Moving on. The Python Standard Library is a rich set of libraries (modules and packages) for tackling just about every programming task. For example:
"MALWARE.EXE" -> open(_, 'r').read() -> hashlib.md5 -> _.hexdigest() -> print
The program above is an alternative to the original example program, it uses Python's hashlib.md5() module to compute and MALWARE.EXE's MD5 digest and Pythonect's print() to display it. What else?
"MALWARE.EXE" \
    -> open(_, 'r').read() \
    -> [re.finditer("\xcc", _), re.finditer("\xcd\x03", _)] \
    -> print "Found INT3 between Offset #%d and #%d" % _.span(0)
The program above searches for all the INT 3 instructions occurrences in MALWARE.EXE file, and prints the offsets of the beginning and end of each matched record.

Now, for the times when the Python Standard Library don't have what you looking for. You can always implement your own in Python:
import math

def entropy(data):
    entropy = 0
    if data:
        for x in range(2**8):
            p_x = float(data.count(chr(x))) / len(data)
            if p_x > 0:
                entropy += - p_x * math.log(p_x, 2)
    return entropy
The above is an implementation of Shannon's entropy equation in Python. To use it, simply save it (e.g. entropy.py), and reference it in a program:
"MALWARE.EXE" -> open(_, 'r').read() -> entropy.entropy -> print
The program above uses entropy() of entropy.py to measure and print MALWARE.EXE's entropy. To conclude this tutorial, let's tweak it one more time:
"MALWARE.EXE" -> subprocess.check_output(['/usr/bin/objcopy', '-O', 'binary', '-j', '.text', _, '/dev/stdout']) -> entropy.entropy -> print
Now, the program above uses entropy() of entropy.py to measure and print MALWARE.EXE's .text section (using objcopy of GNU binutils) entropy.

Pythonect is still under heavy development, there's a ton of unimplemented features and even more bugs. It's not ready for production yet, but you still can start to play with it and have plenty of fun!

That's all for now.