Reconstructing Data Structures
I've posted before on the topic of understanding data structures (here, here, and here), and some recent analysis brought this back to me yet again.  I had an opportunity to make use of my understanding of data structures, specifically within the Windows Registry, in order to attempt to gain some information from a file where the tools normally used by analysts had failed.
The situation was that we had a Windows system that had been compromised...the bad guy had accessed the system using stolen credentials, then used it to move laterally to other systems. Between this and the response activities, the system had been infected with malware that overwrites and deletes files. A responder had collected potentially usable files from this system, including the Registry hives from the compromised user account, and had then run some Registry parsing tools against the hives, none of which worked. Yep. All of the tools failed...even the viewers.
I was sent a Registry hive file and a list of "strings of interest", and asked to provide some context to those strings, and if possible, when those strings had been created within the hive file. My first step was to get an idea as to why the tools used to view and parse the hive had reportedly failed...so, I opened the file in a hex editor.
The first thing I noticed was that there was no 'regf' header. In Windows Registry hive files, the first four bytes of the file should read "regf" (or "72 65 67 66" in hexadecimal)...there was no header. Next, I looked for the first 'hbin' section, which is usually found at offset 0x1000 within the file, and starts with 'hbin'. In this case, I didn't find an 'hbin' section until I reached offset 0x10000 in the file...and the entire space up to that point was all zeros. I could tell right away that this wasn't good. Even worse, when I was finally able to locate a key node structure within the hive file, it wasn't the root node. In fairly short order, it was easy to see why all of the parsing tools had failed, as none had been able to discern a recognizable file structure.
Each hbin section is 4096 (0x1000) bytes in size, which means that if the first hbin section was located at offset 0x10000 within the file, I was missing over a dozen complete hbin sections. Not a great way to get started, eh? As I scrolled through the rest of the file quickly, I could see what looked like legitimate key and value nodes, but I could also see other sections of the file...large sections...that were full of zeros, as well as some that were full of binary stuff that made no sense whatsoever. In some cases, I could see sections that contained what appeared be Unicode strings, but there were no discernible structures surrounding those strings.
 When I was reading over this article prior to posting it, I tried to imagine that last paragraph as scary as possible, just for effect. I pictured myself reading this out loud like a scout leader telling a bunch of scouts a ghost story around a campfire, or huddled under a blanket with a flashlight. I don't know if that helps get the point across about how badly damaged this file was, or if it was just funny.   I mean, for me, saying, "...the first hbin in the hive file was found at offset 0x10000..." IS a horror story!  Either way, in attempting to provide anything at all, I had my work cut out for me...this was going to be tough.
When I was reading over this article prior to posting it, I tried to imagine that last paragraph as scary as possible, just for effect. I pictured myself reading this out loud like a scout leader telling a bunch of scouts a ghost story around a campfire, or huddled under a blanket with a flashlight. I don't know if that helps get the point across about how badly damaged this file was, or if it was just funny.   I mean, for me, saying, "...the first hbin in the hive file was found at offset 0x10000..." IS a horror story!  Either way, in attempting to provide anything at all, I had my work cut out for me...this was going to be tough. 
Some background about myself...I was "trained" as an electrical engineer; that is, my undergraduate studies were in EE. One of my professors would constantly say that "electrical engineers are inherently lazy", meaning that rather than making a complicated solution, or worse, making wild, unsupported assumptions about what we were looking at, electrical engineers would always seek the simplest solution. He even told us a story once about how a radar system used across the Air Force had been "fixed" by soldering a resistor in parallel to another resistor on a circuit board, rather than replace the entire circuit board. Seeking a simple solution, I thought it best to seek out some reference material for help. My reference for this work was/is Windows Registry Forensics; in this case, the lower half of pg 26 (in the soft cover edition).
By now I pretty much knew that I had my work cut out for me. I had a list of strings of interest, but just the strings. If I was going to make any sense of these strings at all, I'd have to know where they were located within the file. So, I ran MS/SysInternals strings.exe with the -o switch, so that I could get the offset of the where the string was located within the file. Once I had done that, I noted a couple of the strings of interest, and opened the hive file in a hex editor. I picked one of the strings...the output of strings gives me the offset in decimal, so I converted the offset to hexidecimal...and located it in the file. I did this with several of the strings, and in each case, my findings fell into one of three categories:
1. The string was a value name.
Pp. 29 - 31 of WRF cover the structure of a Registry value. The value node header is 20 bytes long, and starts with a 4-byte (DWORD) value that is the size of the overall structure itself (i.e., the header and the name). An example value is illustrated in figure 1.
In figure 1, the value header starts 8 bytes into the listing, with "D8 FF FF FF".  This value translates to -40, and tells us that the structure is 40 bytes in size.  Next, we see "vk", which is the value node identifier.  The next two bytes, "10 00", tell us that the name of the value is 16 bytes long.  The name starts immediately following the value structure, so there is no offset to the name listed in the value header; in this case, we can see the name, "GroupByDirection", which is clearly 16 bytes in size. 
The remaining elements of the value header can be found in table 1.2 in WRF. The value types can be found in table 1.3.
2. The string was the name of a deleted value.
In several instances, I located the string in question and following the structure of a 'nearby' value, found that the string was indeed the value name. However, when a node (key, value) is deleted in a Registry hive file, the size value (first DWORD) is converted to a positive number.
Consider figure 1 again...the first DWORD is "D8 FF FF FF", which is equal to -40. Had the value been deleted, the DWORD would be "28 00 00 00".
3. The string was value data.
In some cases, I found strings at offsets where there was no 'nearby' value (vk) or key (nk) node. Instead, immediately before the string was what appeared to be a DWORD indicating a negative size value. In figure 2, the value is '88 FF FF FF'.
The string illustrated in figure 2 is an excerpt of a value data entry that I extracted from one of my own systems, but it illustrates the point very well. In this case, the first DWORD translates to -120, indicating that the string value is 120 bytes in length. Again, figure 2 illustrates an excerpt of the Registry value data, not the entire string.
Just to be clear, I wasn't looking for all available strings, and I also did not look at all of the strings of interest. By this point, I had looked at about a dozen of so of the strings of interest, out of several dozen. I mention this because some of the strings of interest could have been key names, but at this point, none of the strings I'd looked at were, in fact, key names. I should mention that some of the strings appeared as Unicode strings in those sections of the file that I mentioned earlier in this post...while the string was clearly visible, and of interest in the context of the overall examination, I could not find any discernible structure (Registry key or value node, shell item, etc.) near or surrounding the string.
Now, the Windows Registry is described as a hierarchical database, but it's also something of a singly-linked list of structures. What I mean by this is that the root key node in a Registry hive file points to other keys (subkeys) and values. Subkeys point to other keys and values, and values only point to data. When I say "point to", I'm referring to the offset within the value or key header structure that tells us where the next element is located. This offset is not measured from the beginning of the file (from 0); rather, it starts at the beginning of the first hbin structure, which is (usually) at offset 0x1000. Let's say that you find an offset value within a value header structure that points to the data, and the offset is "D8 54 01 00". Translating endianness, we would look for that data structure at 0x0154D8 + 0x1000 within the hive file, or at offset 0x0164D8 from the beginning of the file.
Because key nodes point to other other key nodes and value nodes, and value nodes point to data, the Registry can be described as a singly-linked list. By contrast, active processes in memory are maintained as a doubly-linked list...each process points to the next process in the list, as well as the previous process in the list. In the Registry, a value does not point to it's parent key, nor do keys point to their parent key. This can make it difficult to reconstruct a damaged Registry hive file, particularly when strings of interest that may be pertinent to the investigation can be seen within the file.
In the instance I was looking at, just scrolling through the hive file, I could see that a good deal of it had been destroyed. In fact, it looked as if the virus had successfully overwritten entire sectors with zeros, and in other some cases, some of the sectors that made up the rest of the file I was sent did not even contain discernible structures. I knew that I couldn't start at the beginning of the file and reconstruct the hive file...too much was missing. One approach might have been to comb through the file and catalog all of the key and value node structures that I could locate (both allocated and unallocated), and then run consecutive scans to (a) locate associated value data, and (b) correlate the values to the appropriate keys.
What I did instead was pick out a couple of the more interesting strings, and locate them within the hive file. I had the offset to the string from the strings.exe output, so I went to that offset in the file, found the string, and then found the beginning of the structure, of which the string was a member. I noted the structure type (value, data) and recorded the offset for the structure. I then subtracted 0x1000 from the offset, reversed the endianness, and searched for the value in the hive file.
Here's an example...in one instance, I located a string that was a value name, and the value structure began at offset 0x164D8 within the file. Subtracting 0x1000, I got 0x154D8, and reversing endianness, I got "D8 54 01 00". I then searched for that hex string in UltraEdit. What that led to was a structure that maintained a list of offsets to values associated with a key. So, I repeated this process, using the location where this structure started. What that led me to discover was that the key had been obliterated from hive; it wasn't "deleted" in the sense that the first DWORD had been converted to a positive value...it simply no longer existed in the file.
As laborious as it is, this process can be used to gain some modicum of context and value for the investigator.
Process
When confronted with a hive file as badly damaged as the one I was looking at, there are basically two ways to go about collecting some modicum of information and context from the file. The first is to comb through the entire file, cataloging each discernible structure, as if you were putting puzzle pieces out on a table. Each structure would have to include not just the information it contained, but also the offset to where it was located within the file. Once the scanning process had been completed, the pieces could be assembled much like a puzzle, albeit with a lot of missing pieces.
The other way to go about this would be to do something like what I did...find a string of interest, locate it within the file, determine what type of structure it belonged to (if it was part of a structure...), and attempt to reconstruct the path based on knowledge of the structures. I opted for this approach because it gave me some answers quickly.
Take-away
My biggest take away from this exercise was that understanding the structure of what I was looking at allowed me to not only troubleshoot the issue and determine why the tools weren't working, but it also allowed me to provide some information, context, and insight regarding the data when those tools were not able to do so.
The situation was that we had a Windows system that had been compromised...the bad guy had accessed the system using stolen credentials, then used it to move laterally to other systems. Between this and the response activities, the system had been infected with malware that overwrites and deletes files. A responder had collected potentially usable files from this system, including the Registry hives from the compromised user account, and had then run some Registry parsing tools against the hives, none of which worked. Yep. All of the tools failed...even the viewers.
I was sent a Registry hive file and a list of "strings of interest", and asked to provide some context to those strings, and if possible, when those strings had been created within the hive file. My first step was to get an idea as to why the tools used to view and parse the hive had reportedly failed...so, I opened the file in a hex editor.
The first thing I noticed was that there was no 'regf' header. In Windows Registry hive files, the first four bytes of the file should read "regf" (or "72 65 67 66" in hexadecimal)...there was no header. Next, I looked for the first 'hbin' section, which is usually found at offset 0x1000 within the file, and starts with 'hbin'. In this case, I didn't find an 'hbin' section until I reached offset 0x10000 in the file...and the entire space up to that point was all zeros. I could tell right away that this wasn't good. Even worse, when I was finally able to locate a key node structure within the hive file, it wasn't the root node. In fairly short order, it was easy to see why all of the parsing tools had failed, as none had been able to discern a recognizable file structure.
Each hbin section is 4096 (0x1000) bytes in size, which means that if the first hbin section was located at offset 0x10000 within the file, I was missing over a dozen complete hbin sections. Not a great way to get started, eh? As I scrolled through the rest of the file quickly, I could see what looked like legitimate key and value nodes, but I could also see other sections of the file...large sections...that were full of zeros, as well as some that were full of binary stuff that made no sense whatsoever. In some cases, I could see sections that contained what appeared be Unicode strings, but there were no discernible structures surrounding those strings.
Some background about myself...I was "trained" as an electrical engineer; that is, my undergraduate studies were in EE. One of my professors would constantly say that "electrical engineers are inherently lazy", meaning that rather than making a complicated solution, or worse, making wild, unsupported assumptions about what we were looking at, electrical engineers would always seek the simplest solution. He even told us a story once about how a radar system used across the Air Force had been "fixed" by soldering a resistor in parallel to another resistor on a circuit board, rather than replace the entire circuit board. Seeking a simple solution, I thought it best to seek out some reference material for help. My reference for this work was/is Windows Registry Forensics; in this case, the lower half of pg 26 (in the soft cover edition).
By now I pretty much knew that I had my work cut out for me. I had a list of strings of interest, but just the strings. If I was going to make any sense of these strings at all, I'd have to know where they were located within the file. So, I ran MS/SysInternals strings.exe with the -o switch, so that I could get the offset of the where the string was located within the file. Once I had done that, I noted a couple of the strings of interest, and opened the hive file in a hex editor. I picked one of the strings...the output of strings gives me the offset in decimal, so I converted the offset to hexidecimal...and located it in the file. I did this with several of the strings, and in each case, my findings fell into one of three categories:
1. The string was a value name.
Pp. 29 - 31 of WRF cover the structure of a Registry value. The value node header is 20 bytes long, and starts with a 4-byte (DWORD) value that is the size of the overall structure itself (i.e., the header and the name). An example value is illustrated in figure 1.
|  | 
| Fig 1: Registry value structure | 
The remaining elements of the value header can be found in table 1.2 in WRF. The value types can be found in table 1.3.
2. The string was the name of a deleted value.
In several instances, I located the string in question and following the structure of a 'nearby' value, found that the string was indeed the value name. However, when a node (key, value) is deleted in a Registry hive file, the size value (first DWORD) is converted to a positive number.
Consider figure 1 again...the first DWORD is "D8 FF FF FF", which is equal to -40. Had the value been deleted, the DWORD would be "28 00 00 00".
3. The string was value data.
In some cases, I found strings at offsets where there was no 'nearby' value (vk) or key (nk) node. Instead, immediately before the string was what appeared to be a DWORD indicating a negative size value. In figure 2, the value is '88 FF FF FF'.
|  | 
| Fig 2: Value Data | 
The string illustrated in figure 2 is an excerpt of a value data entry that I extracted from one of my own systems, but it illustrates the point very well. In this case, the first DWORD translates to -120, indicating that the string value is 120 bytes in length. Again, figure 2 illustrates an excerpt of the Registry value data, not the entire string.
Just to be clear, I wasn't looking for all available strings, and I also did not look at all of the strings of interest. By this point, I had looked at about a dozen of so of the strings of interest, out of several dozen. I mention this because some of the strings of interest could have been key names, but at this point, none of the strings I'd looked at were, in fact, key names. I should mention that some of the strings appeared as Unicode strings in those sections of the file that I mentioned earlier in this post...while the string was clearly visible, and of interest in the context of the overall examination, I could not find any discernible structure (Registry key or value node, shell item, etc.) near or surrounding the string.
Now, the Windows Registry is described as a hierarchical database, but it's also something of a singly-linked list of structures. What I mean by this is that the root key node in a Registry hive file points to other keys (subkeys) and values. Subkeys point to other keys and values, and values only point to data. When I say "point to", I'm referring to the offset within the value or key header structure that tells us where the next element is located. This offset is not measured from the beginning of the file (from 0); rather, it starts at the beginning of the first hbin structure, which is (usually) at offset 0x1000. Let's say that you find an offset value within a value header structure that points to the data, and the offset is "D8 54 01 00". Translating endianness, we would look for that data structure at 0x0154D8 + 0x1000 within the hive file, or at offset 0x0164D8 from the beginning of the file.
Because key nodes point to other other key nodes and value nodes, and value nodes point to data, the Registry can be described as a singly-linked list. By contrast, active processes in memory are maintained as a doubly-linked list...each process points to the next process in the list, as well as the previous process in the list. In the Registry, a value does not point to it's parent key, nor do keys point to their parent key. This can make it difficult to reconstruct a damaged Registry hive file, particularly when strings of interest that may be pertinent to the investigation can be seen within the file.
In the instance I was looking at, just scrolling through the hive file, I could see that a good deal of it had been destroyed. In fact, it looked as if the virus had successfully overwritten entire sectors with zeros, and in other some cases, some of the sectors that made up the rest of the file I was sent did not even contain discernible structures. I knew that I couldn't start at the beginning of the file and reconstruct the hive file...too much was missing. One approach might have been to comb through the file and catalog all of the key and value node structures that I could locate (both allocated and unallocated), and then run consecutive scans to (a) locate associated value data, and (b) correlate the values to the appropriate keys.
What I did instead was pick out a couple of the more interesting strings, and locate them within the hive file. I had the offset to the string from the strings.exe output, so I went to that offset in the file, found the string, and then found the beginning of the structure, of which the string was a member. I noted the structure type (value, data) and recorded the offset for the structure. I then subtracted 0x1000 from the offset, reversed the endianness, and searched for the value in the hive file.
Here's an example...in one instance, I located a string that was a value name, and the value structure began at offset 0x164D8 within the file. Subtracting 0x1000, I got 0x154D8, and reversing endianness, I got "D8 54 01 00". I then searched for that hex string in UltraEdit. What that led to was a structure that maintained a list of offsets to values associated with a key. So, I repeated this process, using the location where this structure started. What that led me to discover was that the key had been obliterated from hive; it wasn't "deleted" in the sense that the first DWORD had been converted to a positive value...it simply no longer existed in the file.
As laborious as it is, this process can be used to gain some modicum of context and value for the investigator.
Process
When confronted with a hive file as badly damaged as the one I was looking at, there are basically two ways to go about collecting some modicum of information and context from the file. The first is to comb through the entire file, cataloging each discernible structure, as if you were putting puzzle pieces out on a table. Each structure would have to include not just the information it contained, but also the offset to where it was located within the file. Once the scanning process had been completed, the pieces could be assembled much like a puzzle, albeit with a lot of missing pieces.
The other way to go about this would be to do something like what I did...find a string of interest, locate it within the file, determine what type of structure it belonged to (if it was part of a structure...), and attempt to reconstruct the path based on knowledge of the structures. I opted for this approach because it gave me some answers quickly.
Take-away
My biggest take away from this exercise was that understanding the structure of what I was looking at allowed me to not only troubleshoot the issue and determine why the tools weren't working, but it also allowed me to provide some information, context, and insight regarding the data when those tools were not able to do so.
Final Thought
One of the comments to one of my previous posts on this topic included the following question:
Another question might be: With all the data structures out there, is it even possible to truly understand them all?
I would suggest that, no, it's not possible for any single analyst to understand all of the structures available on a Windows system. That's why none of us try to do so...instead, some of us document what we know, in books or by posting to a wiki, and then offer ourselves as resources. That way, if someone has a question, all they have to do is ask. So, if an analyst runs across something that they don't understand, they can continue to not understand it, or they can ask someone who appears to know something about the topic, and in fairly short order, get an understanding.
One of the comments to one of my previous posts on this topic included the following question:
Another question might be: With all the data structures out there, is it even possible to truly understand them all?
I would suggest that, no, it's not possible for any single analyst to understand all of the structures available on a Windows system. That's why none of us try to do so...instead, some of us document what we know, in books or by posting to a wiki, and then offer ourselves as resources. That way, if someone has a question, all they have to do is ask. So, if an analyst runs across something that they don't understand, they can continue to not understand it, or they can ask someone who appears to know something about the topic, and in fairly short order, get an understanding.
