Structure Adds Context
A while ago, I was talking to Cory Altheide and he mentioned something about timeline analysis that sort of clarified an aspect of the analysis technique for me...he said that creating a timeline from multiple data sources added context to the data that you were looking at. This made a lot of sense to me, because rather than just using file system metadata and displaying just the MACB times of the files and directories, if we added Event Log records, Prefetch file metadata, Registry data, etc., we'd suddenly see more than just that a file was created or accessed. We'd start to see things like, user A had logged in, launched an application, and the result of those actions was the file creation or modification in which we were interested.
Lately, I've been looking at a number of data structures used by Windows systems...for example, the DestList stream within Windows 7 jump lists. What this got me thinking about is this...as analysts, we have to understand the structure in which data is stored, and correspondingly, how it's used by the application. We need to understand this because the structure of the data can provide context to that data.
Let's look at an example...once, in a galaxy far, far away, I was working on a PCI forensic assessment, which included scanning every acquired image for potential credit card numbers (CCNs). When the scan had completed, I found that I had a good number of hits in two Registry hive files. So my analysis can't stop there, can it? After all, what does that mean, that I found CCNs in the Registry? In and of itself, that statement is lacking context. So, I need to ask:
Are the hits key names? Value names? Value data, or embedded in value data? Or, are the hits located in unallocated space within the hive files?
The answers to any of these questions would significantly impact my analysis and the findings that I report.
Here's another example...I remember talking with someone a while back who'd "analyzed" a Windows PE file by running strings on it, and found the name of a DLL. I don't remember the exact conclusions that they'd drawn from this, but what I do remember is thinking that had they done some further analysis, they might have had different conclusions. After all, finding a string in a 200+ KB file is one thing...but what if that DLL had been in the import table of the PE header? Wouldn't that have a different impact on the analysis than if the DLL was instead the name of the file where stolen data was stored before being exfil'd?
So, much like timeline analysis, understanding the structure in which data is stored, and how that data is used by an application or program, can provide context to the data that will significantly impact your analysis and findings.
Addendum, 7 July
I've been noodling this over a bit more and another thought that I had was that this concept applies not just to DF analysis, but also to the work that often goes on beyond just analysis, particularly in the LE field, and that is developing intelligence.
In many cases, and particularly for law enforcement, there's more to DF analysis than simply running keyword searches or finding an image. In many instances, the information found in one examination is used to develop intelligence for a larger investigation, either directly or indirectly. So, it's not just about, "hey, I found an IP address in the web logs", but what verb was used (GET, POST, etc.), what were the contents of the request, who "owns" the IP address, etc.
So how is something like this implemented? Well, let's say you're using Simson's bulk_extractor, and you find that a particular email address that's popped up in your overall investigation was located in an acquired image. Just the fact that this email address exists within the image may be a significant finding, but at this point, you don't have much in the way of context, beyond the fact that you found it in the image. It could be in an executable, or part of a chat transcript, or in another file. Regardless, where the email address is located within the image (i.e., which file it's located in) will significantly impact your analysis, your findings, and the intel you derive from these.
Now, let's say you take this a step further and determine, based on the offset within the image where the email address was located, that the file that it is located in is an email. Now, this provides you with a bit more context, but if you really think about it, you're not done yet...how is the email-address-of-interest used in the file? Is it in the To:, CC:, or From: fields? Is it in the body of the message? Again, where that data is within the structure in which it's stored can significantly impact your analysis, and your intel.
Consider how your examination might be impacted if the email address were found in unallocated space or within the pagefile, as opposed to within an email.
Lately, I've been looking at a number of data structures used by Windows systems...for example, the DestList stream within Windows 7 jump lists. What this got me thinking about is this...as analysts, we have to understand the structure in which data is stored, and correspondingly, how it's used by the application. We need to understand this because the structure of the data can provide context to that data.
Let's look at an example...once, in a galaxy far, far away, I was working on a PCI forensic assessment, which included scanning every acquired image for potential credit card numbers (CCNs). When the scan had completed, I found that I had a good number of hits in two Registry hive files. So my analysis can't stop there, can it? After all, what does that mean, that I found CCNs in the Registry? In and of itself, that statement is lacking context. So, I need to ask:
Are the hits key names? Value names? Value data, or embedded in value data? Or, are the hits located in unallocated space within the hive files?
The answers to any of these questions would significantly impact my analysis and the findings that I report.
Here's another example...I remember talking with someone a while back who'd "analyzed" a Windows PE file by running strings on it, and found the name of a DLL. I don't remember the exact conclusions that they'd drawn from this, but what I do remember is thinking that had they done some further analysis, they might have had different conclusions. After all, finding a string in a 200+ KB file is one thing...but what if that DLL had been in the import table of the PE header? Wouldn't that have a different impact on the analysis than if the DLL was instead the name of the file where stolen data was stored before being exfil'd?
So, much like timeline analysis, understanding the structure in which data is stored, and how that data is used by an application or program, can provide context to the data that will significantly impact your analysis and findings.
Addendum, 7 July
I've been noodling this over a bit more and another thought that I had was that this concept applies not just to DF analysis, but also to the work that often goes on beyond just analysis, particularly in the LE field, and that is developing intelligence.
In many cases, and particularly for law enforcement, there's more to DF analysis than simply running keyword searches or finding an image. In many instances, the information found in one examination is used to develop intelligence for a larger investigation, either directly or indirectly. So, it's not just about, "hey, I found an IP address in the web logs", but what verb was used (GET, POST, etc.), what were the contents of the request, who "owns" the IP address, etc.
So how is something like this implemented? Well, let's say you're using Simson's bulk_extractor, and you find that a particular email address that's popped up in your overall investigation was located in an acquired image. Just the fact that this email address exists within the image may be a significant finding, but at this point, you don't have much in the way of context, beyond the fact that you found it in the image. It could be in an executable, or part of a chat transcript, or in another file. Regardless, where the email address is located within the image (i.e., which file it's located in) will significantly impact your analysis, your findings, and the intel you derive from these.
Now, let's say you take this a step further and determine, based on the offset within the image where the email address was located, that the file that it is located in is an email. Now, this provides you with a bit more context, but if you really think about it, you're not done yet...how is the email-address-of-interest used in the file? Is it in the To:, CC:, or From: fields? Is it in the body of the message? Again, where that data is within the structure in which it's stored can significantly impact your analysis, and your intel.
Consider how your examination might be impacted if the email address were found in unallocated space or within the pagefile, as opposed to within an email.