Wednesday, August 7, 2013

Pythonect Has New Graphs, Documentation, Tutorial, and More!

About two weeks ago I have released a new version of Pythonect (0.6) with new features, documentation, tutorial, and an (small, but growing) example directory.
I’d like to take this opportunity to discuss the past, present and future of the Pythonect Project.

Nearly 2 years ago I started working on Pythonect with the intention to help software developers to connect the dots and make mashup, rapid prototyping, and developing scalable distributed applications easy. Pythonect is a new, experimental, general-purpose dataflow programming language based on Python. It aims to combine the intuitive feel of shell scripting (and all of its perks like implicit parallelism) with the flexibility and agility of Python. Pythonect interpreter (and reference implementation) is a free and open source software written completely in Python, and is available under the BSD 3-Clause license.

Why Pythonect? Pythonect, being a dataflow programming language, treats data as something that originates from a source, flows through a number of processing components, and arrives at some final destination. As such, it is most suitable for creating applications that are themselves focused on the "flow" of data. Perhaps the most readily available example of a dataflow-oriented applications comes from the realm of real-time signal processing, e.g. a video signal processor which perhaps starts with a video input, modifies it through a number of processing components (video filters), and finally outputs it to a video display.

As with video, many applications can be expressed as a network of different components that are connected by a number of communication channels. The benefits, and perhaps the greatest incentives, of expressing an application this way is scalability and parallelism. The different components in the network can be maneuvered to create entirely unique dataflows without necessarily requiring the relationship to be hardcoded. Also, the design and concept of components make it easier to run on distributed systems and parallel processors.

Here is the canonical "Hello, world" example program in Pythonect:
"Hello, world" -> print
And here is the canonical "Hello, world" multi-threaded example program in Pythonect:
"Hello, world" -> [print, print]
Not to mention that you can go from multi-threaded to multi-processed as easy as:
"Hello, world" -> [print &, print &]
Or remotely call a procedure using XML-RPC:
"Hello, world" -> print@xmlrpc://localhost:8081
The language couldn't possibly be simpler...
Okay, so what's new you're asking? *I was wrong*, it can be simpler, and it is in Pythonect version 0.6 :-)

In Pythonect 0.6.0 I have re-written the engine and some large parts of the backend. Pythonect is now using graph (NetworkX. DiGraph) as its data structure, and it's also supporting multiple file formats as an input. Currently, Pythonect (since version 0.6) supports 3 file formats:
  • *.P2Y (text-based scripting language aims to combine the quick and intuitive feel of shell scripting, with the power of Python)
  • *.DIA (visual programming language enabled by Dia)
  • *.VDX (visual programming language enabled by Microsoft Visio XML)
In other words:


is equal to:
"Hello, world" -> print
And vice versa. You can launch (almost) any graph/diagram editor and save a graph/diagram as *.VDX or *.DIA format and Pythonet will be able to parse and run it (even if it's gzipped!). Curious to see how a multi-threading/processing graph looks like? See below!


Yup, it's that simple. One node with two edges. The graph above is equal to:
"Hello, world" -> [print, print]
Which is the canonical "Hello, world" multi-threaded example program. Now, another issue that I have addressed in this release is the reduce functionally.
The famous reduce from big data. Let's say that we want to write a program that will add one to every integer input and eventually sum all the results:
[1,2,3] -> _ + 1 -> sum -> print
The above example won't work because Pythonect maps (think MapReduce) each iterable value to its own thread, so the sum function will actually receive 2, 3, 4 separately and not as a list. A workaround for this will be:
sum(`[1,2,3] -> _+1`) -> print
But with the new reduce functionally in Python 0.6, it is as easy as:
[1,2,3] -> _ + 1 -> sum(_!) -> print
By using the _! Identifier, the Pythonect interrupter will automatically join all the values (and threads/processes) into a single list and pass it to the Python function without any prerequisites. The same applies when using a graph:


is equal to:
[1,2,3] -> _ + 1 -> sum(_!) -> print
Now let's talk about the future of Pythonect. Here's a link to the TODO list, where you can find future directions. In a nutshell, more graphs, more Python implementation support, and more Service-oriented architecture (SOA).

Right now, the biggest application of Pythonect (to the best of my knowledge) is my second project, Hackersh. Hacker Shell (hackersh) is a free and open source command-line shell and scripting language designed especially for security testing. It is written in Python and uses Pythonect as its scripting engine. The upcoming release of Hackersh (work in progress!) will also enjoy the Pythonect 0.6 features such as graphs (*.VDX and *.DIA) as scripts and a better reduce functionally.

To learn more about Pythonect, please visit its homepage: http://www.pythonect.org and be sure to check out the new documentation at: http://docs.pythonect.org/en/latest/ where you can find an up-to-date tutorial and installation instructions.

That's all for now!

Wednesday, April 3, 2013

Hackersh 0.1 Release Announcement

I am pleased to announce the Official 0.1 launch of Hackersh ("Hacker Shell") - a shell (command interpreter) written in Python with built-in security commands, and out of the box wrappers for various security tools. It uses Pythonect as its scripting engine. Since it's the first release of Hackersh, I'd like to take this opportunity to explain how it works and why you should be using it.

Hackersh is an interactive console for security research and testing. It uses Pythonect as its scripting language. Pythonect is a new, experimental, general-purpose high-level dataflow programming language based on Python. It aims to combine the intuitive feel of shell scripting (and all of its perks like implicit parallelism) with the flexibility and agility of Python. The combination of the two makes:
"http://localhost" -> url -> nmap -> w3af -> print
Return something like this:
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| VULNERABILITY DESCRIPTION                                                    | URL                                                             |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| Cross Site Scripting was found at:                                           | http://localhost:8080/black/vulnerabilities/xss_r/              |
| "http://localhost:8080/black/vulnerabilities/xss_r/", using HTTP method GET. |                                                                 |
| The sent data was:                                                           |                                                                 |
| "name=%3CSCrIPT%3Efake_alert%28%22v3bd%22%29%3C%2FSCrIPT%3E". This           |                                                                 |
| vulnerability affects ALL browsers                                           |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The whole target has no protection (X-Frame-Options header) against          | Undefined                                                       |
| ClickJacking attack                                                          |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| "X-Powered-By" header for this HTTP server is: "PHP/5.3.3-7+squeeze3"        | Undefined                                                       |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The server header for the remote web server is: "Apache/2.2.16 (Debian)"     | Undefined                                                       |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| An error page sent this Apache version: "addressApache/2.2.16 (Debian)       | http://localhost:8080/black/vulnerabilities/xss_r/_vti_inf.html |
| Server at localhost Port 8080/address"                                       |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The remote Web server sent a strange HTTP response code: "405" with the      | http://localhost:8080/black/vulnerabilities/xss_r/GeBrG         |
| message: "Method Not Allowed", manual inspection is advised                  |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The remote Web server sent a strange HTTP reason message: "The HTTP server   | http://localhost:8080/black/login.php                           |
| returned a redirect error that would lead to an infinite loop. The last 30x  |                                                                 |
| error message was: Found" manual inspection is advised                       |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The remote Web server has a custom configuration, in which any non existent  | http://localhost:8080/black/vulnerabilities/xss_r/              |
| methods that are invoked are defaulted to GET instead of returning a "Not    |                                                                 |
| Implemented" response                                                        |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The URL: "http://localhost:8080/black/vulnerabilities/xss_r/" sent the       | http://localhost:8080/black/vulnerabilities/xss_r/              |
| cookie: "security=low"                                                       |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| The URL: "http://localhost:8080/black/index.php" sent the cookie:            | http://localhost:8080/black/index.php                           |
| "PHPSESSID=lut893qvd4gdngp1rud5ei8pc2; path=/"                               |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
| A cookie matching the cookie fingerprint DB has been found when requesting   | http://localhost:8080/black/index.php                           |
| "http://localhost:8080/black/index.php" . The remote platform is: "PHP"      |                                                                 |
+------------------------------------------------------------------------------+-----------------------------------------------------------------+
So, how does it work? As a dataflow programming language, Pythonect treats data as something that originates from a source - it flows through a number of processing components and arrives at a final destination. As such, it is most suitable for creating applications that are themselves focused on the "flow" of data. Perhaps the most readily available example of a dataflow-oriented application comes from the realm of real-time signal processing, e.g. a video signal processor which starts with a video input, modifies it through a number of processing components (i.e. video filters), and finally outputs it to a video display.

As with video, penetration testing (and other security domains) can be expressed as a network of different components such as: targets, network scanners, web security scanners, etc, connected by a number of communication channels. These components (and more) are provided by Hackersh, and can be either internal (e.g. url is an internal component that converts String to URL) or external (e.g. nmap is a wrapper around the Nmap security scanner). Every Hackersh component (except the Hackersh Root Component) is standardized to accept and return a context. Context is a dict (i.e. associative array) that can be piped through different components, just like text can be piped through different Unix tools (e.g. cat, grep, wc, and etc.).

Back to real life examples, here is how you can pass command line arguments to an external Hackersh component (e.g. nmap):
"http://localhost" -> url -> nmap("-sS -P0 -T3") -> w3af -> print
Here is how you can debug a Hackersh component:
"http://localhost" -> url -> nmap("-sS -P0 -T3", debug=True) -> w3af -> print
Please note that this is not a component-specific option as almost every Hackersh component can be debugged this way.

Moving on to more advanced options:
"http://localhost" -> url -> nmap("-sS -P0 -T3") -> [_['PORT'] == '8080' and _['SERVICE'] == 'HTTP'] -> w3af -> print
Support for Metadata is a major strength of Hackersh as it enables potential AI applications to fine-tune their service selection strategy based on service-specific characteristics.
"http://localhost" -> url -> [nmap, pass] -> amap
The script above is an example for a multithreaded application. It scans http://localhost alternately, using nmap + amap and amap. The output is:
http://localhost
  +-3306/tcp (MYSQL)
  +-25/tcp (SMTP)
  +-25/tcp (NNTP)
  +-902/tcp (VMWARE-AUTHD)
  +-21/tcp (FTP)
  +-21/tcp (SMTP)
  +-22/tcp (SSH)
  +-22/tcp (SSH-OPENSSH)
  +-80/tcp (HTTP)
  +-80/tcp (HTTP-APACHE-2)
  +-80/tcp (HTTP)
  +-80/tcp (HTTP-APACHE-2)
  +-631/tcp (HTTP)
  +-631/tcp (HTTP-APACHE-2)
  +-631/tcp (HTTP-CUPS)
  +-8080/tcp (HTTP)
  +-631/tcp (SSL)
  +-8080/tcp (HTTP)
  +-8080/tcp (HTTP-APACHE-2)
  +-53/tcp (DNS)
  +-8080/tcp (HTTP-APACHE-2)
  +-2222/tcp (SSH)
  +-2222/tcp (SSH-OPENSSH)
  +-3000/tcp (HTTP)
  +-111/tcp (RPC)
  `-111/tcp (RPC-RPCBIND-V4)
To read more about Pythonect's multi-thread and multi-process capabilities, please visit Pythonect Tutorial: Learn By Example.

External Hackersh components (sorted by alphabetical order) supported in this version include: As well as the internal Hackersh components (in alphabetical order) supported in this version include:
  • Hostname
  • IPv4_Address
  • IPv4_Range (supports CIDR, Netmask Source-IP Notation, IP Range and etc.)
  • Nslookup
  • Stateful programmatic Web Browser (i.e. Browse, Submit, and Iterate_Links)
  • URL
To familiarize yourself with Pythonect, you should also read these other blog posts: Make sure you check out these resources as well. Good luck, and May the Force be with you!

Sunday, February 10, 2013

Password Policy: You Are Doing It Wrong (When 2^56 Becomes 2^42)

They say the road to hell is paved with good intentions. This is often the case with non-standard password policies. About a month ago I visited my "favorite airplane company" website, and after successfully logging with my Frequent Flyer credentials, I've been redirected to an Update Password page where I've been asked to change my password according to the following criteria:

Please insert an 8 characters password
The 4 first characters need to include at least 2 letters (A-Z)
The last 4 characters must be all digits

At first sight this may seem like a good password policy, 8 characters long password, must include at least 2 A-Z letters, must include at least 4 digits. But is it really going to result in a strong password?

The answer is no, and to understand why, it is necessary to understand how brute-force attack works. Brute-force attack consists of systematically trying all possible passwords until the correct password is found. In the worst case, this would involve traversing the entire search space. Now, the more search space there is, the longer (run time) it will take the brute-force to cover it. This doesn't guarantee that a given password won't be the 1st or 2nd option in the search space, but statistically speaking, if there are more options - then there are more passwords combinations to check for.

The password policy defines the search space, depending on the password policy it can either define a global search space (e.g. 8 ASCII characters password) or an individual search space per character (e.g. 8 ASCII characters password, first character must be a digit). The latter is weaker than the former. To demonstrate this, I have developed a small Python script called alphapasswd.py that calculates the search space of a password policy given it's formation and a working sample password.
#!/usr/bin/env  python

# Copyright (C) 2013 Itzik Kotler <ik@ikotler.org>
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA

import sys
import re
import math


def main():

    try:

        print 'Evaluating Password Formation: "%s" with Sample Password "%s"' % (sys.argv[1], sys.argv[2])

        formation = re.compile(sys.argv[1])

        if not formation.match(sys.argv[2]):

            print 'Sample Password "%s" does not match Password Formation "%s"' % (sys.argv[2], sys.argv[1])

            return 0

        sample_passwd = list(formation.match(sys.argv[2]).group(0))

        print "INPUT: %s" % sample_passwd

        values_per_col = []

        exponent = 0

        # Itereate each Character in Sample Password

        for col_idx in xrange(0, len(sample_passwd)):

            total_values_per_col = 0

            # Itereate 2^8 Values

            for byte in xrange(0, 255):

                old_value = sample_passwd[col_idx]

                sample_passwd[col_idx] = chr(byte)

                # GO / NO GO ?

                if formation.match(''.join(sample_passwd)):

                    total_values_per_col = total_values_per_col + 1

                sample_passwd[col_idx] = old_value

            values_per_col.append(total_values_per_col)

        for col_idx in xrange(0, len(values_per_col)):

            print "PASSWORD BYTE #%d SEARCH SPACE 2^%d (%d)" % (col_idx+1, math.ceil(math.log(values_per_col[col_idx], 2)), values_per_col[col_idx])

            exponent = exponent + math.ceil(math.log(values_per_col[col_idx], 2))

        print "EXPONENT = %d" % exponent

        print "TOTAL = %d = 2^%d" % (2**exponent, exponent)

    except IndexError as e:

        print 'Missing password formation or sample password'
        print 'e.g. %s "[a-zA-Z0-9]{4}" "abcd"' % sys.argv[0]
        print 'Usage: %s <password formation> <sample password>' % sys.argv[0]


if __name__ == "__main__":
    main()
What the script above does is calculate how many bits (as eventually the password will be stored digitally and bit is the smallest unit of measurement used for information storage in computers) are needed to represent each character in the password given the password policy, and then it sums all the bits and outputs the maximum password strength (in bits) that this password policy can yield.

Going back to our original question, now that we have alphapasswd.py, it is possible to compare between two or more password policies and see which yields a better theoretical password (remember this is not testing against common passwords or obvious mistakes, just testing the search space). The second password policy that I will be using for comparecent is very similar to the one in question, but simpler, it's an 8 ASCII characters long password with no restrictions policy. Now that we have competitors, let's start measuring their search space, starting with the airplane company password policy:
./alphapasswd.py "[a-zA-Z][a-zA-Z][a-zA-Z0-9\!\@\#\$\%\^\&\*\(\)]{2}[0-9]{4}" "abcd1234"
The output should be:
Evaluating Password Formation: "[a-zA-Z][a-zA-Z][a-zA-Z0-9\!\@\#$\%\^\&\*\(\)]{2}[0-9]{4}" with Sample Password "abcd1234"
INPUT: ['a', 'b', 'c', 'd', '1', '2', '3', '4']
PASSWORD BYTE #1 SEARCH SPACE 2^6 (52)
PASSWORD BYTE #2 SEARCH SPACE 2^6 (52)
PASSWORD BYTE #3 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #4 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #5 SEARCH SPACE 2^4 (10)
PASSWORD BYTE #6 SEARCH SPACE 2^4 (10)
PASSWORD BYTE #7 SEARCH SPACE 2^4 (10)
PASSWORD BYTE #8 SEARCH SPACE 2^4 (10)
EXPONENT = 42
TOTAL = 4398046511104 = 2^42
From this output it is possible to see that from an 8 characters long password, the #1, #2, #5, #6, #7, and #8 bytes have a smaller search space (generally speaking the upper bounds of an ASCII byte search space is 2^7), and as a result, the maximum search space is 2^42. Now, let's try the second password policy (i.e. 8 ASCII characters long password with no restrictions):
./alphapasswd.py "[a-zA-Z0-9\!\@\#\$\%\^\&\*\(\)]{8}" "abcd1234"
The output should be:
Evaluating Password Formation: "[a-zA-Z0-9\!\@\#$\%\^\&\*\(\)]{8}" with Sample Password "abcd1234"
INPUT: ['a', 'b', 'c', 'd', '1', '2', '3', '4']
PASSWORD BYTE #1 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #2 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #3 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #4 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #5 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #6 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #7 SEARCH SPACE 2^7 (72)
PASSWORD BYTE #8 SEARCH SPACE 2^7 (72)
EXPONENT = 56
TOTAL = 72057594037927936 = 2^56
From this output it is possible to see that from an 8 characters long password, all the bytes have the maximum search space possible given the upper bounds of an ASCII byte search space (i.e. 2^7), and as a result, the maximum search space is 2^56. In other words, the first password policy is 16384 (i.e. 72057594037927936/4398046511104) times weaker than the second password policy. Reviewing the first password policy again, it's obiovus that the 4 digits requirement is what limits the search space the most. If so, why did my "favorite airplane company" request it? On the same Update Password page it says (on the bottom) that:

The last four characters in your password (the digits) will serve as your secret code to identify yourself at the Telephone Service Center

And so the mystery is solved, due to a legacy IVR (Interactive Voice Response), and the fact that my "favorite airplane company" did not want to seperate between their Website and IVR credentials, they composed a password security policy that is in fact weaker than an any 8 ASCII characters long password policy. Now, come to think about it, if I only need to enter 4 digits to log-in in the IVR, how are they storing the passwords then? It can't be hashed and compared as the IVR will only accept 4 digits, while the password is 8 characters long? Oh well, I guess that's a story for another day.

Tuesday, December 25, 2012

Scraping LinkedIn Public Profiles for Fun and Profit

Reconnaissance and Information Gathering is a part of almost every penetration testing engagement. Often, the tester will only perform network reconnaissance in an attempt to disclose and learn the company's network infrastructure (i.e. IP addresses, domain names, and etc), but there are other types of reconnaissance to conduct, and no, I'm not talking about dumpster diving. Thanks to social networks like LinkedIn, OSINT/WEBINT is now yielding more information. This information can then be used to help the tester test anything from social engineering to weak passwords.

In this blog post I will show you how to use Pythonect to easily generate potential passwords from LinkedIn public profiles. If you haven't heard about Pythonect yet, it is a new, experimental, general-purpose dataflow programming language based on the Python programming language. Pythonect is most suitable for creating applications that are themselves focused on the "flow" of the data. An application that generates passwords from the employees public LinkedIn profiles of a given company - have a coherence and clear dataflow:

(1) Find all the employees public LinkedIn profiles(2) Scrap all the employees public LinkedIn profiles(3) Crunch all the data into potential passwords

Now that we have the general concept and high-level overview out of the way, let's dive in to the details.

Finding all the employees public LinkedIn profiles will be done via Google Custom Search Engine, a free service by Google that allows anyone to create their own search engine by themselves. The idea is to create a search engine that when searching for a given company name - will return all the employees public LinkedIn profiles. How? When creating a Google Custom Search Engine it's possible to refine the search results to a specific site (i.e. 'Sites to search'), and we're going to limit ours to: linkedin.com. It's also possible to fine-tune the search results even further, e.g. uk.linkedin.com to find only employees from United Kingdom.

The access to the newly created Google Custom Search Engine will be made using a free API key obtained from Google API Console. Why go through the Google API? because it allows automation (No CAPTCHA's), and it also means that the search-result pages will be returned as JSON (as oppose to HTML). The only catch with using the free API key is that it's limited to 100 queries per day, but it's possible to buy an API key that will not be limited.

Scraping the profiles is a matter of iterating all over the hCards in all the search-result pages, and extracting the employee name from each hCard. Whats is a hCard? hCard is a micro format for publishing the contact details of people, companies, organizations, and places. hCard is also supported by social networks such as Facebook, Google+, LinkedIn and etc. for exporting public profiles. Google (when indexing) parses hCard, and when relevant, uses them in search-result pages. In other words, when search-result pages include LinkedIn public profiles, it will appear as hCards, and could be easily parsed.

Let's see the implementation of the above:
#!/usr/bin/python
#
# Copyright (C) 2012 Itzik Kotler
#
# scraper.py is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# scraper.py is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with scraper.py.  If not, see <http://www.gnu.org/licenses/>.

"""Simple LinkedIn public profiles scraper that uses Google Custom Search"""

import urllib
import simplejson


BASE_URL = "https://www.googleapis.com/customsearch/v1?key=<YOUR GOOGLE API KEY>&cx=<YOUR GOOGLE SEARCH ENGINE CX>"


def __get_all_hcards_from_query(query, index=0, hcards={}):

    url = query

    if index != 0:

        url = url + '&start=%d' % (index)

    json = simplejson.loads(urllib.urlopen(url).read())

    if json.has_key('error'):

        print "Stopping at %s due to Error!" % (url)

        print json

    else:

        for item in json['items']:

            try:

                hcards[item['pagemap']['hcard'][0]['fn']] = item['pagemap']['hcard'][0]['title']

            except KeyError as e:

                pass

        if json['queries'].has_key('nextPage'):

            return __get_all_hcards_from_query(query, json['queries']['nextPage'][0]['startIndex'], hcards)

    return hcards


def get_all_employees_by_company_via_linkedin(company):

    queries = ['"at %s" inurl:"in"', '"at %s" inurl:"pub"']

    result = {}

    for query in queries:

        _query = query % company

        result.update(__get_all_hcards_from_query(BASE_URL + '&q=' + _query))

    return list(result)
Replace <YOUR GOOGLE API KEY> and <YOUR GOOGLE SEARCH ENGINE CX> in the code above with your Google API Key and Google Search Engine CX respectively, save it to a file called scraper.py, and you're ready!

To kick-start, here is a simple program in Pythonect (that utilizes the scraper module) that searchs and prints all the Pythonect company employees full names:
"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> print
The output should be:
Itzik Kotler
In my LinkedIn Profile, I have listed Pythonect as a company that I work for, and since no one else is working there, when searching for all the employees of Pythonect company - only my LinkedIn profile comes up.
For demonstration purposes I will keep using this example (i.e. "Pythonect" company, and "Itzik Kotler" employee), but go ahead and replace Pythonect with other, more popular, companies names and see the results.

Now that we have a working skeleton, let's take its output and start crunching it. Keep in mind that every "password generation forumla" is merely a guess. The examples below are only a sampling of what can be done. There are, obviously many more possibilities and you are encouraged to experiment. But first, let's normalize the output - this way it's going to be consistent before operations are performed on it:
"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> string.lower(''.join(_.split()))
The normalization procedure is short and simple: convert the string to lowercase and remove any spaces, and so the output should be now:
itzikkotler
As for data manipulation, out of the box (Thanks to The Python Standard Library) we've got itertools and it's combinatoric generators. Let's start by applying itertools.product:
"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin -> string.lower(''.join(_.split())) -> itertools.product(_, repeat=4) -> print
The code above will generate and print every 4 characters password from the letters: i, t, z, k, o, t, l , e, r. However, it won't cover passwords with uppercase letters in it. And so, here's a simple and straightforward implementation of a cycle_uppercase function that cycles the input letters yields a copy of the input with letter in uppercase:
def cycle_uppercase(i):
    s = ''.join(i)
    for idx in xrange(0, len(s)):
        yield s[:idx] + s[idx].upper() + s[idx+1:]
To use it, save it to a file called itertools2.py, and then simply add it to the Pythonect program after the itertools.product(_, repeat=4) block, as follows:
"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin \
    -> string.lower(''.join(_.split())) \
        -> itertools.product(_, repeat=4) \
            -> itertools2.cycle_uppercase \
                -> print
Now, the program will also cover passwords that include a single uppercase letter in it. Moving on with the data manipulation, sometimes the password might contain symbols that are not found within the scrapped data. In this case, it is necessary to build a generator that will take the input and add symbols to it. Here is a short and simple generator implemented as a Generator Expression:
[_ + postfix for postfix in ['123','!','$']]
To use it, simply add it to the Pythonect program after the itertools2.cycle_uppercase block, as follows:
"Pythonect" -> scraper.get_all_employees_by_company_via_linkedin \
    -> string.lower(''.join(_.split())) \
        -> itertools.product(_, repeat=4) \
            -> itertools2.cycle_uppercase \
                -> [_ + postfix for postfix in ['123','!','$']] \
                    -> print
The result is that now the program adds the strings: '123', '!', and '$' to every generated password, which increases the chances of guessing the user's right password, or not, depends on the password :)

To summarize, it's possible to take OSINT/WEBINT data on a given person or company and use it to generate potential passwords, and it's easy to do with Pythonect. There are, of course, many different ways to manipulate the data into passwords and many programs and filters that can be used. In this aspect, Pythonect being a flow-oriented language makes it easy to experiment and research with different modules and programs in a "plug and play" manner.

Monday, September 17, 2012

Fuzzing Like A Boss with Pythonect

In my previous post Automated Static Malware Analysis with Pythonect, I wrote about how to use Pythonect to automate static malware analysis. In this post I'll describe how to use Pythonect and all of its perks to fuzz file formats, network protocols, and command line arguments. The examples provided are only a sampling of what can be done. There are, obviously many more possibilities and you are encouraged to experiment. Before you read this tutorial you should have at least a basic knowledge of Fuzz testing, Python and Pythonect (I recommend reading the Pythonect Tutorial: Learn By Example).

Let's see some code!
['A', 'a', '0', '!', '$', '%', '*', '+', ',', '-', '.', '/', ':', '?', '@', '^', '_'] \
    -> [_ * n for n in [256, 512, 1024, 2048, 4096]] \
        -> os.system('/bin/ping ' + _)
The code above tries to fuzz the command-line arguments of a *nix command-line tool (e.g. /bin/ping). Let's go line by line and explain what's going on with these 3 lines of code.

The first line defines a list of inputs to try (i.e. ['A', 'a', '0', ...]]), the second line defines a list of length parameters (i.e. [256, 512, 1024, ...]), and the last line executes the command-line tool with the generated argument as argv[1] (e.g. /bin/ping "AAAAAA ... 250 times"). In addition, this fuzzer is multi-threaded and uses asynchronous communication. What does it mean? It means that it's not waiting for a thread to finish before continuing with the loop, and as a result, it's not guaranteed to fuzz in sorted order (.e. A * 255, A * 512, A * 1024, ...)

You can easily extend the code above to include testing for format string vulnerabilities:
['%s', '%n', 'A', 'a', '0', '!', '$', '%', '*', '+', ',', '-', '.', '/', ':', '?', '@', '^', '_'] \
    -> [_ * n for n in [256, 512, 1024, 2048, 4096]] \
        -> os.system('/bin/ping ' + _)
If you want the format string testing inputs to run first (i.e. fuzz in sorted order), change the forward pipe operator from asynchronous to synchronous:
['%s', '%n', 'A', 'a', '0', '!', '$', '%', '*', '+', ',', '-', '.', '/', ':', '?', '@', '^', '_'] \
    | [_ * n for n in [256, 512, 1024, 2048, 4096]] \
        -> os.system('/bin/ping ' + _)
If you also want the length parameters to run in sorted order (i.e. '%s' * 256, '%s' * 512, '%s' * 1024, ...), change the 2nd forward pipe operator to synchronous as well:
['%s', '%n', 'A', 'a', '0', '!', '$', '%', '*', '+', ',', '-', '.', '/', ':', '?', '@', '^', '_'] \
    | [_ * n for n in [256, 512, 1024, 2048, 4096]] \
        | os.system('/bin/ping ' + _)
Keep in mind, that the latter is no longer multi-threaded (due to the fact that it's waiting for both, the inputs and length threads to finish).

Moving on, here is an example of a generic file format fuzzer:
open('dana.jpg', 'r').read() \
    -> itertools.permutations \
        -> open('output_' + hex(_.__hash__()) + '.jpg', 'w').write(''.join(_))
The code above reads the content of dana.jpg and passes it to itertools.permutations, and that in turn returns dana.jpg-length tuples, all possible orderings, no repeated elements.
Each dana.jpg-length tuple is saved into a unique output_ prefixed file. Afterwards, testing the JPEG libraries is as easy as: eog *.jpg or zgv *.jpg

This is another example of a generic file format fuzzer:
open('dana.jpg', 'r').read() \
    -> [list(_) + [os.urandom(1) for n in xrange(0, len(_))]] \
        -> [tuple(random.sample(_, len(_)/2)) for i in xrange(0, len(_)*2)] \
            -> open('output_' + hex(_.__hash__()) + '.jpg', 'w').write(''.join(_))
The code above reads the content of dana.jpg, generates a dana.jpg-length random bytes buffer, joins them, and then randomly samples dana.jpg-length*2 dana.jpg-length chunks.
Each dana.jpg-length chunk is saved into a unique output_ prefixed file. Again, testing the JPEG libraries is as easy as: eog *.jpg or zgv *.jpg

Last but not least, here's a network protocol (FTP) fuzzer:
ftplib.FTP('localhost') \
    -> _.login().startswith('230') \
    -> [_.mkd(s) for s in reduce(lambda x,y: x+y, map(lambda c: [chr(c) * 2**l for l in range(8,13)], xrange(1, 255)))]
The code above uses ftplib module to connect to a FTP site, logins as an anonymous, generates strings from byte value 1-255 * 256, 512 and etc. and passes each string as pathname for MKD.

Lastly, if you have suggestions on how we can make Pythonect better, head over to Pythonect's github page and create a new ticket or fork. Enjoy the examples and have fun with Pythonect!

Tuesday, August 21, 2012

Automated Static Malware Analysis with Pythonect

About 5 months ago I have released the first version of Pythonect - a new, experimental, general-purpose high-level dataflow programming language based on Python, written in Python.
It aims to combine the intuitive feel of shell scripting (and all of its perks like implicit parallelism) with the flexibility and agility of Python.

Crazy? Most definitely. And yet, strangely enough, it works!

Pythonect, being a dataflow programming language, treats data as something that originates from a source, flows through a number of processing components, and arrives at some final destination.
As such, it is most suitable for creating applications that are themselves focused on the "flow" of data. Perhaps the most readily available example of a dataflow-oriented applications comes from the realm of real-time signal processing, e.g. a video signal processor which perhaps starts with a video input, modifies it through a number of processing components (video filters), and finally outputs it to a video display.

As with video, malware analysis can be expressed as a network of different components such as: disassemblers, regular expressions, debuggers and etc. that are connected by a number of communication channels.
The benefits, and perhaps the greatest incentives, of expressing malware analysis this way is scalability and parallelism. The different components in the network can be maneuvered to create entirely unique dataflows without necessarily requiring the relationship to be hardcoded. Also, the design and concept of components make it easier to run on distributed systems and parallel processors.

In this tutorial I will show you how to automate static malware analysis using Pythonect. The examples will be simple enough that you can extend them if you want to.
Before you read this tutorial you should have at least a basic knowledge of x86 Assembly, Python, and Pythonect (I recommend reading the Pythonect Tutorial: Learn By Example).

Note: I have decided to go with static malware analysis because it's easier to demonstrate, and to use open source tools because they are more accessible. Nonetheless, this does not go to show that Pythonect or dataflow programming cannot be used to automate dynamic malware analysis, or integrated with a commercial software. The only limit is your imagination.

There isn't exactly a "Hello, world" program in the malware analysis realm, so I will start with my equivalent to "Hello, world", an example program that computes a MD5 digest of a file:
"MALWARE.EXE" -> os.system("/usr/bin/md5sum " + _)
The program above uses the md5sum program of GNU coreutils to compute and print MALWARE.EXE's MD5 digest. Let's extend it to compute the MALWARE.EXE's SHA1 digest as well:
"MALWARE.EXE" -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
The new program above uses the md5sum and sha1sum of GNU coreutils to compute and print MALWARE.EXE's MD5 and SHA1 digests. Let's keep improving it:
sys.argv[1] -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
Now, the new program reads the malware filename from a command-line argument. To run the script just save it (e.g. md5_and_sha1_sums) and run the Pythonect interpreter like this:
% /usr/local/bin/pythonect md5_and_sha1_sums /bin/ls
92385e9b8864032488e253ebde0534c3  /bin/ls
8800fee57584ed1c44b638225c2f1eec818a27c2  /bin/ls
Often, the goal is to handle the large volume of malware samples collected each day, let's change the program to work on all the executables (i.e. *.EXE) in the current directory:
glob.glob('*.EXE') -> [os.system("/usr/bin/md5sum " + _), os.system("/usr/bin/sha1sum " + _)]
Of course it can be further finetuned or customized at will. Also, it's worth mentioning that the program above is multi-threaded. Meaning, each file starts a new thread.

So far, I have used Python's os.system() function in all of the example programs. The os.system() is handy when it comes to writing small scripts, it executes a command in a subshell and returns it's exit status.
But since there is little interest in passing the exit status to another component, a different command executing function will be needed when building an advanced script. subprocess.check_output().
"MALWARE.EXE" -> subprocess.check_output(['/usr/bin/md5sum', _]) -> print
Much like the original example program, the program above uses the md5sum program of GNU coreutils to compute MALWARE.EXE's MD5 digest, but prints the result using Pythonect's print() function.

Moving on. The Python Standard Library is a rich set of libraries (modules and packages) for tackling just about every programming task. For example:
"MALWARE.EXE" -> open(_, 'r').read() -> hashlib.md5 -> _.hexdigest() -> print
The program above is an alternative to the original example program, it uses Python's hashlib.md5() module to compute and MALWARE.EXE's MD5 digest and Pythonect's print() to display it. What else?
"MALWARE.EXE" \
    -> open(_, 'r').read() \
    -> [re.finditer("\xcc", _), re.finditer("\xcd\x03", _)] \
    -> print "Found INT3 between Offset #%d and #%d" % _.span(0)
The program above searches for all the INT 3 instructions occurrences in MALWARE.EXE file, and prints the offsets of the beginning and end of each matched record.

Now, for the times when the Python Standard Library don't have what you looking for. You can always implement your own in Python:
import math

def entropy(data):
    entropy = 0
    if data:
        for x in range(2**8):
            p_x = float(data.count(chr(x))) / len(data)
            if p_x > 0:
                entropy += - p_x * math.log(p_x, 2)
    return entropy
The above is an implementation of Shannon's entropy equation in Python. To use it, simply save it (e.g. entropy.py), and reference it in a program:
"MALWARE.EXE" -> open(_, 'r').read() -> entropy.entropy -> print
The program above uses entropy() of entropy.py to measure and print MALWARE.EXE's entropy. To conclude this tutorial, let's tweak it one more time:
"MALWARE.EXE" -> subprocess.check_output(['/usr/bin/objcopy', '-O', 'binary', '-j', '.text', _, '/dev/stdout']) -> entropy.entropy -> print
Now, the program above uses entropy() of entropy.py to measure and print MALWARE.EXE's .text section (using objcopy of GNU binutils) entropy.

Pythonect is still under heavy development, there's a ton of unimplemented features and even more bugs. It's not ready for production yet, but you still can start to play with it and have plenty of fun!

That's all for now.

Sunday, July 8, 2012

Modulation and Data Loss Prevention (DLP) Solutions

Last year, my colleague Iftach (Ian) Amit and I gave a talk called 'Sounds Like Botnets' at DEFCON 19 and BSides Las Vegas conferences. Here is a link to the slides [PDF].
In the talk, we demonstrated how a combination of modulation and VoIP can be used to bypass enterprise security controllers. Here are the links to the poc #1, and poc #2.
This year, I won't be able to make it to Las Vegas for any of the conferences. Dwelling on the past, I have decided to revisit the 'Sounds Like Botnets' talk and add some content to it.

Data loss prevention (DLP) solutions are designed to detect and prevent potential data breach incidents. There are many types of DLP systems, the one that I'll address is the Endpoint DLP software.
Endpoint DLP software runs on an end-user workstations and monitors and controls access to physical devices (e.g. mobile devices) among other things. But does it monitor the sound card?
It is possible to modulate data into sound, and than to play it out from the workstation (using the sound card) to a 3rd party such as a voice recorder or any mobile with external microphone input.

Modulation vs. DLP #1:

Keep in mind that this is a proof of concept, so it's not going to work 100% of the time. If it's not working, try: (a) a smaller document/payload or (b) a different recording device.

To modulate:
  • Download data2sound.py
  • Pick a file
  • Modulate the file
  • $ ./data2sound.py -i secret.txt -o foobar.wav
  • Connect the recording device to the workstation sound card (Headphones output)
  • Start recording on the recording device
  • Play the generated WAV file (i.e. foobar.wav)
  • Stop the recording on the recording device
To demodulate:
  • Download sound2data.py

  • Then, if possible, copy the file "AS IT IS" from the recording device to the computer, and demodulate it:
    $ ./sound2data.py -i foobar.wav -o secret.txt
    If not, try the following steps:
    • Connect the recording device to the workstation sound card (Microphone input)
    • Start recording on the workstation
    • Play the file on the recording device
    • Stop the recording on the workstation
    • Demodulate the file
Try this (at home, and at your own risk) and post a comment with what file and sound card equipment you tried, and whether it worked for you or not. Now, the next method is really more theory than practice.

Modulation vs. DLP #2:

By bridging between the computer soundcard and a smart phone broadband modem, it is possible to upgrade the previous method to be an on-line, or real time one. In other words, Build Your Own Modem.

The setup:
  • Connect the computer headphone output into the smart phone external microphone input. This way, the computer can output signal to the smart phone.
  • Connect the smart phone headphone output into the computer external microphone input. This way, the smart phone can output signal to the computer.
This should (in theory) make sure that a signal can go from side to side. Now, let's see what each side should do.

On the smart phone:
  • Call to the remote site
  • (The caller signal should be sent to the computer via headphone output, if not, try playing with the settings)
  • (The calle signal should be received from the computer via microphone input, if not, try playing with the settings)
There's also the option of pairing (via Bluetooth) the computer and the smart phone: The computer identifies as a headset and gains access to smart phone speaker/microphone. But it's preventable by DLP.

On the computer:
  • Modulate the file you wish to trasnfer
  • Play the generated WAV file
That's the basic idea, of course, you can install a software on the computer which will modulate-demoulate (i.e. MODEM) on the fly, making it possible to get transmission from the remote site and respond to it.

Before wrapping up this post, I'd like to give a big shout out to Mickey Shaktov and Iftach (Ian) Amit, each of them will be presenting this year at Blackhat USA. Go see their talks, you won't be disappointed!