Understanding Data Structures
Sometimes at conferences or during a presentation, I'll provide a list of tools for parsing a specific artifact (i.e., MFT, Prefetch files, etc.), and I'll mention a tool or script that I wrote that presents specific data in a particular format. Invariably when this happens, someone asks for a copy of the tool/script. Many times, these scripts may not be meant for public consumption, and are only intended to illustrate what data is available within a particular structure. As such, I'll ask why, with all of the other available tools, someone would want a copy of yet another tool, and the response is most often, "...to validate the output of the other tools." So, I'm left wondering...if you don't understand the data structure that is being accessed or parsed, how is having another tool to parse it beneficial?
Tools provide a layer of abstraction over the data, and as such, while they allow us access to information within these data structures (or files) in a much more timely manner than if we were to attempt to do so manually, they also tend to separate us from the data...if we allow this to happen. For many of the more popular data structures or sources available, there are likely multiple tools that can be used to display information from those sources. But the questions then become, (a) do you understand the data source(s) being parsed, and (b) do you know what the tool is doing to parse those data structures? Is the tool using an MS API to parse the data, or is it doing so on a binary level?
A great example of this is what many of us will remember seeing when we have extracted Windows XP Event Logs from an image and attempted to open them in the Event Viewer on our analysis system. In some cases, we'd see a message that told us that the Event Log was corrupted. However, it was very often the case that the file wasn't actually corrupted, but instead that our analysis system did not have the appropriate message DLLs installed for some of the records. Microsoft does, however, provide very clear and detailed definitions of the Event Log structures, and as such, tools that do not use the Windows API to parse the Event Log files can be used to much greater effect, to include parsing individual records from unallocated space. This could not be done without an understanding of the data structures.
Not long ago, Francesco contacted me about the format of automaticDestinations Jump List files, because he'd run a text search across an image and found a hit "in" one of these files, but parsing the file with multiple tools gave no indication of the search hit. It turned out that understanding the format of MS compound file binary files provides us with a clear indication of how to map unallocated 'sectors' within the Jump List file itself, and determine why he'd seen a search hit 'in' the file, but that hit wasn't part of the output of the commonly-used tools for parsing these files.
Another great example of this came my attention this morning via the SQLite: Hidden Data in Plain Sight blog post from the Linuxsleuthing blog. This blog post further illustrates my point; however, in this case, it's not simply a matter of displaying information that is there but not displayed by the available tools. Rather, it is also a matter of correlating the various information that is available in a manner that is meaningful and valuable to the analyst.
The Linuxsleuthing blog post also asks the question, how do we overcome the shortcomings of the common SQLite Database analysis techniques? That's an important question to ask, but it should also be expanded to just about any analysis technique available, and not isolated simply to SQLite databases. What we need to consider and ask ourselves is, how do we overcome the shortcomings of common analysis techniques?
Tools most often provide a layer of abstraction over available data (structures, files, etc.), allowing for a modicum of automation and allowing the work to be done in a much more timely manner than using a hex editor. However, much more is available to us than simply parsing raw data structures and providing some of the information to the analyst. Tools can parse data based on artifact categories, as well as generate alerts for the analyst, based on known-bad or known-suspicious entries or conditions. Tools can also be used to correlate data from multiple sources, but to really understand the nature and context of that data, the analyst needs to have an understanding of the underlying data structures themselves.
Addendum
This concept becomes crystallized when looking at any shell item data structures on Windows systems. Shell items are not documented by MS, and yet are more and more prevalent on Windows systems as the versions progress. An analyst who correctly understands these data structures and sees them as more than just "a bunch of hex" will reap the valuable rewards they hold.
Shell items and shell item ID lists are found in the Registry (shellbags, itempos* values, ComDlg32 subkey values on Vista+, etc.), as well as within Windows shortcut artifacts (LNK files, Win7 and 8 Jump Lists, Photos artifacts on Windows 8, etc.). Depending upon the type of shell item, they may contain time stamps in DOSDate format (usually found in file and folder entries), or they may contain time stamps in FILETIME format (found in some variable type entries). Again, tools provide a layer of abstraction over the data itself, and as such, the analyst needs to understand the nature of the time stamp, as well as what that time stamp represents. Not all time stamps are created equal...for example, DOSDate time stamps within the shell items are created by converting the file system metadata time stamps from the file or folder that is being referred to, reducing the granularity from 100 nanoseconds to 2 seconds (i.e., the seconds value is multiplied times 2).
Resources
Windows Shellbag Forensics - Note: the first colorized hex dump includes a reported invalid SHITEM_FILEENTRY, in green; it's not actually invalid, it's just a different type of shell item.
Tools provide a layer of abstraction over the data, and as such, while they allow us access to information within these data structures (or files) in a much more timely manner than if we were to attempt to do so manually, they also tend to separate us from the data...if we allow this to happen. For many of the more popular data structures or sources available, there are likely multiple tools that can be used to display information from those sources. But the questions then become, (a) do you understand the data source(s) being parsed, and (b) do you know what the tool is doing to parse those data structures? Is the tool using an MS API to parse the data, or is it doing so on a binary level?
A great example of this is what many of us will remember seeing when we have extracted Windows XP Event Logs from an image and attempted to open them in the Event Viewer on our analysis system. In some cases, we'd see a message that told us that the Event Log was corrupted. However, it was very often the case that the file wasn't actually corrupted, but instead that our analysis system did not have the appropriate message DLLs installed for some of the records. Microsoft does, however, provide very clear and detailed definitions of the Event Log structures, and as such, tools that do not use the Windows API to parse the Event Log files can be used to much greater effect, to include parsing individual records from unallocated space. This could not be done without an understanding of the data structures.
Not long ago, Francesco contacted me about the format of automaticDestinations Jump List files, because he'd run a text search across an image and found a hit "in" one of these files, but parsing the file with multiple tools gave no indication of the search hit. It turned out that understanding the format of MS compound file binary files provides us with a clear indication of how to map unallocated 'sectors' within the Jump List file itself, and determine why he'd seen a search hit 'in' the file, but that hit wasn't part of the output of the commonly-used tools for parsing these files.
Another great example of this came my attention this morning via the SQLite: Hidden Data in Plain Sight blog post from the Linuxsleuthing blog. This blog post further illustrates my point; however, in this case, it's not simply a matter of displaying information that is there but not displayed by the available tools. Rather, it is also a matter of correlating the various information that is available in a manner that is meaningful and valuable to the analyst.
The Linuxsleuthing blog post also asks the question, how do we overcome the shortcomings of the common SQLite Database analysis techniques? That's an important question to ask, but it should also be expanded to just about any analysis technique available, and not isolated simply to SQLite databases. What we need to consider and ask ourselves is, how do we overcome the shortcomings of common analysis techniques?
Tools most often provide a layer of abstraction over available data (structures, files, etc.), allowing for a modicum of automation and allowing the work to be done in a much more timely manner than using a hex editor. However, much more is available to us than simply parsing raw data structures and providing some of the information to the analyst. Tools can parse data based on artifact categories, as well as generate alerts for the analyst, based on known-bad or known-suspicious entries or conditions. Tools can also be used to correlate data from multiple sources, but to really understand the nature and context of that data, the analyst needs to have an understanding of the underlying data structures themselves.
Addendum
This concept becomes crystallized when looking at any shell item data structures on Windows systems. Shell items are not documented by MS, and yet are more and more prevalent on Windows systems as the versions progress. An analyst who correctly understands these data structures and sees them as more than just "a bunch of hex" will reap the valuable rewards they hold.
Shell items and shell item ID lists are found in the Registry (shellbags, itempos* values, ComDlg32 subkey values on Vista+, etc.), as well as within Windows shortcut artifacts (LNK files, Win7 and 8 Jump Lists, Photos artifacts on Windows 8, etc.). Depending upon the type of shell item, they may contain time stamps in DOSDate format (usually found in file and folder entries), or they may contain time stamps in FILETIME format (found in some variable type entries). Again, tools provide a layer of abstraction over the data itself, and as such, the analyst needs to understand the nature of the time stamp, as well as what that time stamp represents. Not all time stamps are created equal...for example, DOSDate time stamps within the shell items are created by converting the file system metadata time stamps from the file or folder that is being referred to, reducing the granularity from 100 nanoseconds to 2 seconds (i.e., the seconds value is multiplied times 2).
Resources
Windows Shellbag Forensics - Note: the first colorized hex dump includes a reported invalid SHITEM_FILEENTRY, in green; it's not actually invalid, it's just a different type of shell item.