The Tool Validation "Myth-odology"
I posted recently about understanding data structures, and I wanted to continue with that thought process and line of reasoning into the area of the current state of tool validation.
What we have seen in the community for some time is that a new tool is announced or mentioned, and members of the community begin clamoring for their copy of that tool. Many times, one of the first questions is, "where can I download a copy of the tool?" The reasons most give for wanting to download a copy of the tool is so that they can "test" it, or use it to validate the output of other tools. To that, I would pose this question - if you do not understand what the tool is doing, what it is designed to do, and you do not understand the underlying data structures being parsed, how can you then effectively test the tool, or use that tool to validate other tools?
As such, the current state of tool validation, for the most part, isn't so much a methodology as it is a myth-odology. Obviously, this isn't associated with testing and validation processes such as those used by NIST and other organizations, and applies more to individual analysts.
There are tools out there right now that are being recommended as being THE tool for parsing a particular artifact or set of artifacts. The tools are, in fact, very good at what they do, but the fact is that some of them do not parse all of the data structures available within the set of artifacts, nor do they identify the fact that they're missing these structures in their output. I'm aware of analysts who, in some cases, have stated that the fact that the tool doesn't parse and display specific artifacts isn't an issue for them, because the tool showed them what they were looking for. I think what's happening is that someone will run a tool against a data set, see a lot of data in the output, and deem it "good". They may then run another tool against the same data set, see different output, and deem one of the tools "not good" or at the very least, "questionable". What I don't think is happening is that analysts are testing the tools against the data structures themselves, viewing the data itself as a 'blob' and relying on the tools to provide that layer of abstraction I mentioned in my previous post.
Consider the parsing of shell items, and shell item ID lists. These artifacts abound on Windows systems, more so as the versions of Windows increase. One place that they've existed for some time is in the Windows shortcuts (aka, LNK files). Some of the tools that we've used for years parse both the headers and LinkInfo blocks of these files, but it's only been in the past 12 - 18 months or so that tools have parsed the shell item ID lists. Why is this important? These blog posts do a great job of explaining why...give them a read. Another reason is that over the past year or so, I've run across several LNK files that consisted solely of the header and the shell item ID list...there was no LinkInfo block to parse. As such, some of the tools that were available at the time would simply return blank output.
There is also the issue of understanding how a tool performs it's function. Let's take a look at the XP Event Log example again. Tools that use the MS API for parsing these files are likely going to return the "corrupted file" message that we're all used to seeing, but tools that parse the files on a binary level, going record-by-record, will likely work just fine.
Another myth or misconception that is seen too often is that the quality of the tool is determined by how much space the output consumes. This simply is not the case. Again, consider the shell item ID lists in LNK files. Some of the structures that make up these lists contain time stamps, and a number of tools display the time stamps. What do these time stamps mean? How are they generated/produced? Perhaps equally important is the question, what format are the time stamps saved in? As it turns out, the time stamps are DOSDate format, consuming 32-bits and having a 2 second granularity. On NTFS systems, a folder entry (that leads to the target file) that appears in the shell item ID list will have a 64-bit FILETIME time stamp converted to a 32-bit DOSDate time stamp, with a corresponding loss in granularity. As such, it's important to not only understand the data structure and its various elements, but also the context of those structure elements. As such, if one tool lists all of the elements of the component data structures, and another does not, is the second tool any less valid or correct?
Returning to the subject of data structures, does this mean that every analyst must know and understand the details for every available data structure on, say, a Windows system? No, not at all...that's simply not realistic. The answer, IMHO, is that analysts need to engage. If you're unclear about something, ask. If you need a reference, ask someone. There are some great structure references posted on the ForensicsWiki, including those posted by Joachim Metz, but I think that far too few analysts use that site as a resource. By sharing what we know, and coupling that with what we need to know, we can approach a better method for validating the tools and methodologies that we use.
What we have seen in the community for some time is that a new tool is announced or mentioned, and members of the community begin clamoring for their copy of that tool. Many times, one of the first questions is, "where can I download a copy of the tool?" The reasons most give for wanting to download a copy of the tool is so that they can "test" it, or use it to validate the output of other tools. To that, I would pose this question - if you do not understand what the tool is doing, what it is designed to do, and you do not understand the underlying data structures being parsed, how can you then effectively test the tool, or use that tool to validate other tools?
As such, the current state of tool validation, for the most part, isn't so much a methodology as it is a myth-odology. Obviously, this isn't associated with testing and validation processes such as those used by NIST and other organizations, and applies more to individual analysts.
There are tools out there right now that are being recommended as being THE tool for parsing a particular artifact or set of artifacts. The tools are, in fact, very good at what they do, but the fact is that some of them do not parse all of the data structures available within the set of artifacts, nor do they identify the fact that they're missing these structures in their output. I'm aware of analysts who, in some cases, have stated that the fact that the tool doesn't parse and display specific artifacts isn't an issue for them, because the tool showed them what they were looking for. I think what's happening is that someone will run a tool against a data set, see a lot of data in the output, and deem it "good". They may then run another tool against the same data set, see different output, and deem one of the tools "not good" or at the very least, "questionable". What I don't think is happening is that analysts are testing the tools against the data structures themselves, viewing the data itself as a 'blob' and relying on the tools to provide that layer of abstraction I mentioned in my previous post.
Consider the parsing of shell items, and shell item ID lists. These artifacts abound on Windows systems, more so as the versions of Windows increase. One place that they've existed for some time is in the Windows shortcuts (aka, LNK files). Some of the tools that we've used for years parse both the headers and LinkInfo blocks of these files, but it's only been in the past 12 - 18 months or so that tools have parsed the shell item ID lists. Why is this important? These blog posts do a great job of explaining why...give them a read. Another reason is that over the past year or so, I've run across several LNK files that consisted solely of the header and the shell item ID list...there was no LinkInfo block to parse. As such, some of the tools that were available at the time would simply return blank output.
There is also the issue of understanding how a tool performs it's function. Let's take a look at the XP Event Log example again. Tools that use the MS API for parsing these files are likely going to return the "corrupted file" message that we're all used to seeing, but tools that parse the files on a binary level, going record-by-record, will likely work just fine.
Another myth or misconception that is seen too often is that the quality of the tool is determined by how much space the output consumes. This simply is not the case. Again, consider the shell item ID lists in LNK files. Some of the structures that make up these lists contain time stamps, and a number of tools display the time stamps. What do these time stamps mean? How are they generated/produced? Perhaps equally important is the question, what format are the time stamps saved in? As it turns out, the time stamps are DOSDate format, consuming 32-bits and having a 2 second granularity. On NTFS systems, a folder entry (that leads to the target file) that appears in the shell item ID list will have a 64-bit FILETIME time stamp converted to a 32-bit DOSDate time stamp, with a corresponding loss in granularity. As such, it's important to not only understand the data structure and its various elements, but also the context of those structure elements. As such, if one tool lists all of the elements of the component data structures, and another does not, is the second tool any less valid or correct?
Returning to the subject of data structures, does this mean that every analyst must know and understand the details for every available data structure on, say, a Windows system? No, not at all...that's simply not realistic. The answer, IMHO, is that analysts need to engage. If you're unclear about something, ask. If you need a reference, ask someone. There are some great structure references posted on the ForensicsWiki, including those posted by Joachim Metz, but I think that far too few analysts use that site as a resource. By sharing what we know, and coupling that with what we need to know, we can approach a better method for validating the tools and methodologies that we use.