Creating Tools and Solving Problems
I received a request recently for a blog post on a specific topic via LinkedIn; the request looked like this:
Have you blogged or written about creating your own tools, such as you do, from a beginner's standpoint? I am interested in learning more about how to do this.
A bit of follow-up revealed a bit more information behind the request:
I teach a DFIR class at a local university and would like to incorporate this into the class.
I don't often get requests, and this one seemed kind of interesting to me anyway, so I thought I'd take a shot at it.
To begin with, I did write a section of a blog post entitled "Why do I write my own tools?" The post is just a bit more than three years old, and while the comments were short, they still apply today.
The Why?
Why do I write my own tools? As I mentioned in my previous post, it helps me to understand the data itself much better, and as such, I understand the context and usage of the data much better, as well.
Another reason to write my own tools is to manage the data in a manner that best suits my analysis needs. RegRipper started that way...well, that and the need for automation. Over time, I've continued to put a great deal of thought into my analysis process and why I do the things I do, and why I do them the way I do them. This is, in part, where my five-field TLN format came from, and it still holds up as an extremely successful methodology.
The timeline creation and analysis methodology has proven to be extremely successful in testing, as well. For example, there was this blog post (not the first, and it won't be the last) that discusses the BAM key. Again, this isn't the first blog post on the topic, and speaking with the author of the post recently, face-to-face (albeit through a translator), it was clear that he'd found some discrepancies in previously-posted findings regarding when the key is updated. So, someone is enthusiastically focusing their efforts in determining the nature of the key contents, and as such, I have opted to focus my analysis on the context of the data, with respect to other process execution data (AmCache.hve, UserAssist, AppCompatCache/ShimCache, etc.) In order to do this, I'd like to see all of the data sources normalized to a common format (TLN) so that I can look at them side-by-side, and the only way I'm going to do that is to write my own tools. In fact, I have...I have a number of RegRipper plugins that I can use to parse this information out into TLN format, add Windows Event Log data to the Registry data, and boo-yah! There it is.
Another advantage of writing my own tools is that I get to deal directly with the data itself, and in most cases, I don't have to go through an API call. This is how I ended up writing the Event Log/*.evt file parser, and from that, went on to write a carving tool to look for individual records. Microsoft has some really clear and concise information about the various structures associated with EVT records, making it really easy to write tools. Oh, and if you think that's not useful anymore, remember the NotPetya stuff last year (summer, 2017)? I used the tool I wrote to carve unallocated space for EVT records when a Win2003 server got hit. You never know when something's going to be useful like that.
The How
How do I write my own tools? That's a good question...is it more about the process itself, or the thought process behind writing a tool. Well, as I learned early in my military career, "it depends".
First, there are some basics you need to understand, of course...such as endianness. There is also how to recognize, parse, and translate binary data into something useful. I usually start out with a hex editor, and I've gotten to the point where I not only recognize 64-bit FILETIME time stamps in binary data, but specifically with respect to shellbags, I've gotten to the point where I recognize patterns that end up being GUIDs. It's like the line from The Matrix..."all is see are blondes, brunettes, and redheads."
I start by understanding the structures within data, either by following a programming specification (MS has a number of good ones), or some other format definition. Many times, I'll start with a hex editor, or a bit of code to dump arbitrary-length binary data to hex format, print it out, and go nuts with highlighters. For Registry stuff, I started by using Peter Nordahl's offline password editing tool header files to understand the structure of the various cells available within the Registry. When the Parse::Win32Registry Perl module came along, I used that for accessing the various cells, and was able to shift my focus to identifying patterns in binary data types within values, as well as determining the context of data through the use of testing and timelining. For OLE files, I started with the MS-CFB definition, and like I said, MS maintains some really good info on Event Log structures.
The upshot of this is that I have a better understanding than most, regarding some of the various data types, particularly those that include or present time stamps. There are a lot of researchers who put effort into understanding the specific actions that cause an artifact to be created or modified, but I think it's also important to understand the time format itself. For example, FILETIME objects are 64-bit time stamps with a granularity of 100 nanoseconds, where a DOSDate time stamp (embedded within many shell item artifacts) has a granularity of 2 seconds. The 128-bit SYSTEMTIME structure only has a granularity of one second, similar to the Unix epoch time.
In addition to the understanding of time stamp formats, I've also found a good number of time stamps where most folks don't know they exist. For example, when analyzing Word .doc files, the 'directories' within the OLE structure have time stamps associated with them, and tying that information to other document metadata, compile time stamps for executables found in the same campaign, etc., can all be used to develop a better understanding of the adversary.
Something else that can be valuable if you understand it is the metadata available within LNK files sent by the adversary as an attachment. Normally, LNK files may be created on a victim system as the result of an installation process, but when the adversary includes an LNK file as an attachment, then you've got information about the adversary's system available to you, and all it takes to unlock that information is an understanding of the structure of the LNK files, which are composed, in part, of shell items.
Things To Consider
Don't get hung up on the programming language. I started teaching myself Perl a long time ago, in part to assist some network engineering guys. However, I later learned that at the time, Perl was the only language that had the necessary capability to access the data I needed to access (i.e., live Windows systems). Perl later remained unique in that manner when it came to "dead box" and file analysis. Over time, that changed as Python caught up. Now, you can use languages like Go to parse Windows Event Logs. Oh, and you can still do a lot with batch files. So, don't get hung up on which language is "best"; the simple answer is, "the one that works for you", and everything else is just a distraction.
This isn't just about writing tools to get to the data, so that I can perform analysis. One of the things I'm particular about is developing intelligence from the work I do, learning new things and incorporating or "baking it into" my tools and processes. This is why I have the eventmap.txt file as part of my process for parsing Windows Event Logs (*.evtx files); I see and learn something new (such as the TaskScheduler/706 event), add it to the file with comments, and then I always have the information available. Further, sharing it with others means that they can benefit from knowledge of others without having to have had the same experiences.
Closing Thoughts
Now, is everyone going to write their own tools? No. And that's not the expectation at all. If everyone were writing their own tools, no one would ever get any actual work done. However, understanding data structures to the point of writing your own tools can really open up new vistas for the use of available data, and of intelligence that can be developed from the analysis that we do.
However, if this is something you're interested in, then once you are able to start recognizing patterns and matching those patterns up to structure definitions, there isn't much that you can't do, as the skills are transferable. It doesn't matter where the file comes from...from which device or OS...you'll be able to parse the files.
Have you blogged or written about creating your own tools, such as you do, from a beginner's standpoint? I am interested in learning more about how to do this.
A bit of follow-up revealed a bit more information behind the request:
I teach a DFIR class at a local university and would like to incorporate this into the class.
I don't often get requests, and this one seemed kind of interesting to me anyway, so I thought I'd take a shot at it.
To begin with, I did write a section of a blog post entitled "Why do I write my own tools?" The post is just a bit more than three years old, and while the comments were short, they still apply today.
The Why?
Why do I write my own tools? As I mentioned in my previous post, it helps me to understand the data itself much better, and as such, I understand the context and usage of the data much better, as well.
Another reason to write my own tools is to manage the data in a manner that best suits my analysis needs. RegRipper started that way...well, that and the need for automation. Over time, I've continued to put a great deal of thought into my analysis process and why I do the things I do, and why I do them the way I do them. This is, in part, where my five-field TLN format came from, and it still holds up as an extremely successful methodology.
The timeline creation and analysis methodology has proven to be extremely successful in testing, as well. For example, there was this blog post (not the first, and it won't be the last) that discusses the BAM key. Again, this isn't the first blog post on the topic, and speaking with the author of the post recently, face-to-face (albeit through a translator), it was clear that he'd found some discrepancies in previously-posted findings regarding when the key is updated. So, someone is enthusiastically focusing their efforts in determining the nature of the key contents, and as such, I have opted to focus my analysis on the context of the data, with respect to other process execution data (AmCache.hve, UserAssist, AppCompatCache/ShimCache, etc.) In order to do this, I'd like to see all of the data sources normalized to a common format (TLN) so that I can look at them side-by-side, and the only way I'm going to do that is to write my own tools. In fact, I have...I have a number of RegRipper plugins that I can use to parse this information out into TLN format, add Windows Event Log data to the Registry data, and boo-yah! There it is.
Another advantage of writing my own tools is that I get to deal directly with the data itself, and in most cases, I don't have to go through an API call. This is how I ended up writing the Event Log/*.evt file parser, and from that, went on to write a carving tool to look for individual records. Microsoft has some really clear and concise information about the various structures associated with EVT records, making it really easy to write tools. Oh, and if you think that's not useful anymore, remember the NotPetya stuff last year (summer, 2017)? I used the tool I wrote to carve unallocated space for EVT records when a Win2003 server got hit. You never know when something's going to be useful like that.
The How
How do I write my own tools? That's a good question...is it more about the process itself, or the thought process behind writing a tool. Well, as I learned early in my military career, "it depends".
First, there are some basics you need to understand, of course...such as endianness. There is also how to recognize, parse, and translate binary data into something useful. I usually start out with a hex editor, and I've gotten to the point where I not only recognize 64-bit FILETIME time stamps in binary data, but specifically with respect to shellbags, I've gotten to the point where I recognize patterns that end up being GUIDs. It's like the line from The Matrix..."all is see are blondes, brunettes, and redheads."
I start by understanding the structures within data, either by following a programming specification (MS has a number of good ones), or some other format definition. Many times, I'll start with a hex editor, or a bit of code to dump arbitrary-length binary data to hex format, print it out, and go nuts with highlighters. For Registry stuff, I started by using Peter Nordahl's offline password editing tool header files to understand the structure of the various cells available within the Registry. When the Parse::Win32Registry Perl module came along, I used that for accessing the various cells, and was able to shift my focus to identifying patterns in binary data types within values, as well as determining the context of data through the use of testing and timelining. For OLE files, I started with the MS-CFB definition, and like I said, MS maintains some really good info on Event Log structures.
The upshot of this is that I have a better understanding than most, regarding some of the various data types, particularly those that include or present time stamps. There are a lot of researchers who put effort into understanding the specific actions that cause an artifact to be created or modified, but I think it's also important to understand the time format itself. For example, FILETIME objects are 64-bit time stamps with a granularity of 100 nanoseconds, where a DOSDate time stamp (embedded within many shell item artifacts) has a granularity of 2 seconds. The 128-bit SYSTEMTIME structure only has a granularity of one second, similar to the Unix epoch time.
In addition to the understanding of time stamp formats, I've also found a good number of time stamps where most folks don't know they exist. For example, when analyzing Word .doc files, the 'directories' within the OLE structure have time stamps associated with them, and tying that information to other document metadata, compile time stamps for executables found in the same campaign, etc., can all be used to develop a better understanding of the adversary.
Something else that can be valuable if you understand it is the metadata available within LNK files sent by the adversary as an attachment. Normally, LNK files may be created on a victim system as the result of an installation process, but when the adversary includes an LNK file as an attachment, then you've got information about the adversary's system available to you, and all it takes to unlock that information is an understanding of the structure of the LNK files, which are composed, in part, of shell items.
Things To Consider
Don't get hung up on the programming language. I started teaching myself Perl a long time ago, in part to assist some network engineering guys. However, I later learned that at the time, Perl was the only language that had the necessary capability to access the data I needed to access (i.e., live Windows systems). Perl later remained unique in that manner when it came to "dead box" and file analysis. Over time, that changed as Python caught up. Now, you can use languages like Go to parse Windows Event Logs. Oh, and you can still do a lot with batch files. So, don't get hung up on which language is "best"; the simple answer is, "the one that works for you", and everything else is just a distraction.
This isn't just about writing tools to get to the data, so that I can perform analysis. One of the things I'm particular about is developing intelligence from the work I do, learning new things and incorporating or "baking it into" my tools and processes. This is why I have the eventmap.txt file as part of my process for parsing Windows Event Logs (*.evtx files); I see and learn something new (such as the TaskScheduler/706 event), add it to the file with comments, and then I always have the information available. Further, sharing it with others means that they can benefit from knowledge of others without having to have had the same experiences.
Closing Thoughts
Now, is everyone going to write their own tools? No. And that's not the expectation at all. If everyone were writing their own tools, no one would ever get any actual work done. However, understanding data structures to the point of writing your own tools can really open up new vistas for the use of available data, and of intelligence that can be developed from the analysis that we do.
However, if this is something you're interested in, then once you are able to start recognizing patterns and matching those patterns up to structure definitions, there isn't much that you can't do, as the skills are transferable. It doesn't matter where the file comes from...from which device or OS...you'll be able to parse the files.