HowTo: Add Intelligence to Analysis Processes


How many times do we launch a tool to parse some data, and then sit there looking at the output, wondering how someone would see something "suspicious" or "malicious" in the output?  How many times do we look at lines of data, wondering how someone else could easily look at the same data and say, "there it is...there's the malware"?  I've done IR engagements where I could look at the output of a couple of tools and identify the "bad" stuff, after someone else had spent several days trying to find out what was going wrong with their systems.  How do we go about doing this?

The best and most effective way I've found to get to this point is to take what I learned on one engagement and roll it into the next.  If I find something unusual...a file path of interest, something particular within the binary contents of a file, etc...I'll attempt to incorporate that information into my overall analysis process and use it during future engagements.  Anything that's interesting, as a result of either direct or ancillary analysis will be incorporated into my analysis process.  Over time, I've found that some things keep coming back, while other artifacts are only seen every now and then.  Those artifacts that are less frequent are no less important, not simply because of the specific artifacts themselves, but also for the trends that they illustrate over time.

Before too long, the analysis process includes, "access this data, run this tool, and look for these things..."; we can then make this process easier on ourselves by taking the "look for these things" section of the process and automating it.  After all, we're human, get tired from looking at a lot of data, and we can make mistakes, particularly when there is a LOT of data.  By automating what we look for (or, what we've have found before), we can speed up those searches and reduce the potential for mistakes.

Okay, I know what you're going to say..."I already do keyword searches, so I'm good".  Great, that's fine...but what I'm talking about goes beyond keyword searches.  Sure, I'll open up a lot of lines of output (RegRipper output, web server logs) in UltraEdit or Notepad++, and search for specific items, based on information I have about the particular analysis that I'm working on (what are my goals, etc.).  However, more often than not, I tend to take that keyword search one step further...the keyword itself will indicate items of interest, but will be loose enough that I'm going have a number of false positives.  Once I locate a hit, I'll look for other items in the same line that are of interest.

For example, let's take a look at Corey Harrell's recent post regarding locating an injected iframe.  This is an excellent, very detailed post where Corey walks through his analysis process, and at one point, locates two 'suspicious' process names in the output of a volatile data collection script.  The names of the processes themselves are likely random, and therefore difficult to include in a keyword list when conducting a search.  However, what we can take away from just that section of the blog post is that executable files located in the root of the ProgramData folder would be suspicious, and potentially malicious.  Therefore, a script that that parses the file path and looks for that condition would be extremely useful, and written in Perl, might look something like this:

my @path = split(/\\/,$filepath);
my $len = scalar(@path);
if (lc($path[$len - 2]) eq "programdata" && lc($path[$len - 1]) =~ m/\.exe$/) {
  print "Suspicious path found: ".$filepath."\n";
}

Similar paths of interest might include "AppData\Local\Temp"; we see this one and the previous one in one of the images that Corey posted of his timeline later in the blog post, specifically associated with the AppCompatCache data output.

Java *.idx files
A while back, I posted about parsing Java deployment cache index (*.idx) files, and incorporating the information into a timeline.  One of the items I'd seen during analysis that might indicate something suspicious is the last modified time embedded in the server response be relatively close (in time) to when the file was actually sent to the client (indicated by the "date:" field).  As such, I added a rule to my own code, and had the script generate an alert if the "last modified" field was within 5 days of the "date" field; this value was purely arbitrary, but it would've thrown an alert when parsing the files that Corey ran across and discussed in his blog.

Adding intel is generally difficult to do with third-party, closed source tools that we download from someone else's web site, particularly GUI tools.  In such cases, we have to access the data in question, export that data out to a different format, and then run our analysis process against that data.  This is why I recommend that DFIR analysts develop some modicum of programming skill...you can either modify someone else's open source code, or write your own parsing tool to meet your own specific needs.  I tend to do this...many of the tools I've written and use, including those for creating timelines, will incorporate some modicum of alerting functionality.  For example, RegRipper version 2.8 incorporates alerting functionality directly into the plugins. This alerting functionality can greatly enhance our analysis processes when it comes to detecting persistence mechanisms, as well as illustrating suspicious artifacts as a result of program execution.

Writing Tools
I tend to write my own tools for two basic reasons:

First, doing so allows me to develop a better understanding of the data being parsed or analyzed.  Prior to writing the first version of RegRipper, I had written a Registry hive file parser; as such, I had a very deep understanding of the data being parsed.  That way, I'm better able to troubleshoot an issue with any similar tool, rather than simply saying, "it doesn't work", and not being able to describe what that means.  Around the time that Mandiant released their shim cache parsing script, I found that the Perl module used by RegRipper was not able to parse value "big data"; rather than contacting the author and saying simply, "it doesn't work", I was able to determine what about the code wasn't working, and provide a fix.  A side effect of having this level of insight into data structures is that you're able to recognize which tools work correctly, and select the proper tool for the job.

Second, I'm able to update and make changes to the scripts I write in pretty short order, and don't have to rely on someone else's schedule to allow me to get the data that I'm interested in or need.  I've been able to create or update RegRipper plugins in around 10 - 15 minutes, and when needed, create new tools in an hour or so.

We don't always have to get our intelligence just from our own analysis. For example, this morning on Twitter, I saw a tweet from +Chris Obscuresec indicating that he'd found another DLL search order issue, this one on Windows 8 (application looked for cryptbase.dll in the ehome folder before looking in system32); as soon as I saw that, I thought, "note to self: add checking for this specific issue to my Win8 analysis process, and incorporate it into my overall DLL search order analysis process".

The key here is that no one of us knows everything, but together, we're smarter than any one of us.

I know that what we've discussed so far in this post sounds a lot like the purpose behind the OpenIOC framework.  I agree that there needs to be a common framework or "language" for representing and sharing this sort of information, but it would appear that some of the available frameworks may be too stringent, not offer enough flexibility, or are simply internal to some organizations.  Or, the issue may be as Chris Pogue mentioned during the 2012 SANS DFIR Summit..."no one is going to share their secret sauce."  I still believe that this is the case, but I also believe that there are some fantastic opportunities being missed because so much is being incorporated under the umbrella of "secret sauce"; sometimes, simply sharing that you're seeing something similar to what others are seeing can be a very powerful data point.

Regardless of the reason, we need to overcome our own (possibly self-imposed) roadblocks for sharing those things that we learn, as sharing information between analysts has considerable value.  Consider this post...who had heard of the issue with imm32.dll prior to reading that post?  We all become smarter through sharing information and intelligence.  This way, we're able to incorporate not just our own intelligence into our analysis processes, but we're also able to extend our capabilities by adding intelligence derived and shared by others.