Data Structures, Revisited

A while back, I wrote this article regarding understanding data structures.  The importance of this topic has
not diminished with time; if anything, it deserves much more visibility.  Understanding data structures provides analysts with insight into the nature and context of artifacts, which in turn provides a better picture of their overall case.

First off, what am I talking about?  When I say, "data structures", I'm referring to the stuff that makes up files.  Most of us probably tend to visualize files on a system as being either lines of ASCII text (*.txt files, some log files, etc.), or an amorphous blob of binary data.  We may sometimes even visualize these blobs of binary data as text files, because of how our tools present the information found in those blobs.  However, as we've seen over time, there are parts of these blobs that can be extremely meaningful to us, particularly during an examination.  For example, in some of these blobs, there may be an 8-byte sequence that is the FILETIME format time stamp that represents when a file was accessed, or when a device was installed on a system.

A while back, as an exercise to learn more about the format of the IE (version 5 - 9) index.dat file, I wrote a script that would parse the file based on the contents of the header, which includes a directory table that points to all of the valid records within the file, according to information available on the ForensicsWiki (thanks to Joachim Metz for documenting the format, the PDF of which can be found here).  Again, this was purely an exercise for me, and not something monumentally astounding...I'm sure that we're all familiar with pasco.  Using what I'd learned, I wrote another script that I could use to parse just the headers of the index.dat as part of malware detection, the idea being that if a user account such as "Default User", LocalService, or NetworkService has a populated index.dat file, this would be an indication that malware on the system is running with System-level privileges and communicating off-system via the WinInet API.  I've not only discussed this technique on this blog and in my books, but I've also used this technique quite successfully a number of times, most recently to quickly identify a system infected with ZeroAccess.

More recently, I was analyzing a user's index.dat, as I'd confirmed that the user was using IE during the time frame in question.  I parsed the index.dat with pasco, and did not find any indication of a specific domain in which I was interested.  I tried my script again...same results.  Exactly.  I then mounted the image as a read-only volume and ran strings across the user's "Temporary Internet Files" subfolders (with the '-o' switch), looking specifically for the domain name...that command looked like this:

C:\tools>strings -o -n 4 -s | find "domain" /i

Interestingly enough, I got 14 hits for the domain name in the index.dat file.  Hhhhmmmm....that got me to thinking.  Since I had used the '-o' switch in the strings command, the output included the offsets within the file to the hits, so I opened the index.dat in a hex editor and manually scrolled on down to one of the offsets; in the first case, I found full records (based on the format specification that Joachim had published).  In another case, there was only a partial record, but the string I was looking for was right there.  So, I wrote another script that would parse through the file, from beginning to end, and locate records without using the directory table.  When the script finds a complete record, it will parse it and display the record contents.  If the record is not complete, the script will dump the bytes in a hex dump so that I could see the contents.  In this way, I was able to retrieve 10 complete records that were not listed in the directory table (and were essentially deleted), and 4 partial records, all of which contained the domain that I was looking for.

Microsoft refers to the compound file binary file format as a "file system within a file", and if you dig into the format document just a bit, you'll start to see why...the specification details sectors of two sizes, not all of which are necessarily allocated.  This means that you can have strings and other data buried within the file that are not part of the file when viewed through the appropriate application.
CFB Format
The Compound File Binary Format document available from MS specifies the use of a sector allocation table, as well as a small sector allocation table. For Jump Lists in particular, these structures specify which sectors are in use; mapping the ones that are in use, and targeting just those sectors within the file that are not in use can allow you to recover potentially deleted information.

MS Office documents no longer use this file format specification, but it is used in *.automaticDestinations-ms Jump Lists on Windows 7 and 8. The Registry is similar, in that the various "cells" that comprise a hive file can allow for a good bit of unallocated or "deleted" data...either deleted keys and values, or residual information in sectors that were allocated to the hive file as it continued to grow in size.  MS does a very good job of making the Windows XP/2003 Event Log record format structure available; as such, not only can Event Logs from these systems be parsed on a binary basis (to not only locate valid records within the .evt file that are "hidden" by the information in the header), but records can also be recovered from unallocated space and other unstructured data.  MFT records have been shown to contain useful data , particularly as a file moves from being resident to non-resident (specific to the $DATA attribute), and that can be particularly true for systems on which MFT records are 4K in size (rather than the 1K that most of us are familiar with).

Understanding data structures can help us develop greater detail and additional context with respect to the available data during an examination.  We can recover data from within files that is not "visible" in a file by going beyond the API.  Several years ago, I was conducting a PCI forensic audit, and found several potential credit card numbers "in" a Registry hive...understanding the structures within the file, and a bit of a closer look revealed that what I was seeing wasn't part of the Registry structure, but instead part of the sectors allocated to the hive file as it grew...they simply hadn't been overwritten with key and value cells yet.  This information had a significant impact on the examination.  In another instance, I was trying to determine which files a user had accessed, and found that the user did not have a RecentDocs key within their NTUSER.DAT; I found this to be odd, as even a newly-created profile will have a RecentDocs key.  Using regslack.exe, I was able to retrieve the deleted RecentDocs key, as well as several subkeys and values.
Summary
Understanding the nature of the data that we're looking at is critical, as it directs our interpretations of that data. This interpretation will not only direct subsequent analysis, but also significantly impact our conclusions. If we don't understand the nature of the data and the underlying data structures, our interpretation can be significantly impacted. Is that credit card number, which we found via a search, actually stored in the Registry as value data? Just because our search utility located it within the physical sectors associated with a particular file name, do we understand enough about the file's underlying data structures to understand the true nature and context of the data?