Timeline Analysis...do we need a standard?
Perhaps more appropriately, does anyone want a standard, specifically when it comes to the output format?
Almost a year ago, I came up with a 5-field output format for timeline data. I was looking for something to 'define' events, given the number of data sources on a system. I also needed to include the possibility of using data sources from other systems, outside of the system being examined, such as firewalls, IDS, routers, etc.
Events within a timeline can be concisely described using the following five fields:
Time - A timeline is based on times, so I put this field first, as the timeline is sorted on this field. Now, Windows systems have a number of time formats...32-bit Unix t_time format, 64-bit FILETIME objects, and the 128-bit SYSTEMTIME format. The FILETIME object has granularity to 100 nanoseconds, and the SYSTEMTIME structure has granularity to the millisecond...but is either really necessary? I opted to settle on the Unix t_time format, as the other times could be easily reduced to that format, without loosing significant granularity.
Source - This is the source from which the timeline data originates. For example, using TSK's fls.exe allows the analyst to compile file system metadata. If the analyst parses the MFT using MFTRipper or analyzeMFT, she still has file system metadata. The source remains the same, even though the method of obtaining the data may vary...and as such, should be documented in the analyst's case notes.
Sources can include Event Logs (EVT or EVTX), the Registry (REG), etc. I had thought about restricting this to 8-12 characters...again, the source of the data is independent of the extraction method.
Host - This is the host or name of the system from which the data originated. I included this field, as I considered being able to compile a single timeline using data from multiple systems, and even including network devices, such as firewalls, IDS, etc. This can be extremely helpful in pulling together a timeline for something like SQL injection, including logs from the web server, artifacts from the database server, and data from other systems that had been connected to.
Now, when including other systems, differences in clocks (offsets, skew, etc.) need to be taken into account and dealt with prior to entering the data into the timeline; again, this should be thoroughly documented in the analyst's case notes.
Host ID information can come in a variety of forms...MAC address, IP address, system/NETBios name, DNS name, etc. In a timeline, it's possible to create a legend with a key value or identifier, and have the timeline generation tools automatically translate all of the various identifiers to the key value.
This field can be set to a suitable length (25 characters?) to contain the longest identifier.
User - This is the user associated with the event. In many cases, this may be empty; for example, consider file system or Prefetch file metadata - neither is associated with a specific user. However, for Registry data extracted from the NTUSER.DAT or USRCLASS.DAT hives, the analyst will need to ensure that the user is identified, whereas this field is auto-populated by my tools that parse the Event Logs (.evt files).
Much like the Host field, users can be identified in a variety of means...SID, username, domain\username, email address, chat ID, etc. This field can also have a legend, allowing the analyst to convert all of the various values to a single key identifier.
Usually, a SID will be the longest method of referring to a user, and as such would likely be the maximum length for this field.
Description - This is something of a free-form, variable length field, including enough information to concisely describe the event in question. For Event Log records, I tend to include the event source and ID (so that it can be easily researched on EventID.net) , as well as the event message strings.
Now, for expansion, there may need to be a number of additional, perhaps optional fields. One is a means for grouping individual events into a super-event or a duration event, such as in the case of a search or AV scan. How this identifier is structured still needs to be defined; it can consist of an identifier in an additional column, or it may consist of some other structure.
Another possible optional field can be a notes field of some kind. For example, event records from EVT or EVTX files can be confusing; adding additional information from EventID.net or other credible sources may add context to the timeline, particularly if multiple examiners will be reviewing the final timeline data.
This format allows for flexibility in storing and processing timeline data. For example, I currently use flat ASCII text files for my timelines, as do others. Don has mentioned using Highlighter as a means for analyzing an ASCII text timeline. However, this does not obviate using a database rather than flat text files; in fact, as the amount of data grows and as visualization methods are developed for timelines, using a database may become the standard for storing timeline data.
It is my hope that keeping the structure of the timeline data simple and well-defined will assist in expanding the use of timeline creation and analysis. The structure defined in this post is independent of the raw data itself, as well as the means by which the data is extracted. Further, structure is independent of the storage means, be it a flat ASCII text file, a spreadsheet or a database. I hope that those of us performing timeline analysis can settle/agree upon a common structure for the data; from there, we can move on to visualization methods.
What are your thoughts?
Almost a year ago, I came up with a 5-field output format for timeline data. I was looking for something to 'define' events, given the number of data sources on a system. I also needed to include the possibility of using data sources from other systems, outside of the system being examined, such as firewalls, IDS, routers, etc.
Events within a timeline can be concisely described using the following five fields:
Time - A timeline is based on times, so I put this field first, as the timeline is sorted on this field. Now, Windows systems have a number of time formats...32-bit Unix t_time format, 64-bit FILETIME objects, and the 128-bit SYSTEMTIME format. The FILETIME object has granularity to 100 nanoseconds, and the SYSTEMTIME structure has granularity to the millisecond...but is either really necessary? I opted to settle on the Unix t_time format, as the other times could be easily reduced to that format, without loosing significant granularity.
Source - This is the source from which the timeline data originates. For example, using TSK's fls.exe allows the analyst to compile file system metadata. If the analyst parses the MFT using MFTRipper or analyzeMFT, she still has file system metadata. The source remains the same, even though the method of obtaining the data may vary...and as such, should be documented in the analyst's case notes.
Sources can include Event Logs (EVT or EVTX), the Registry (REG), etc. I had thought about restricting this to 8-12 characters...again, the source of the data is independent of the extraction method.
Host - This is the host or name of the system from which the data originated. I included this field, as I considered being able to compile a single timeline using data from multiple systems, and even including network devices, such as firewalls, IDS, etc. This can be extremely helpful in pulling together a timeline for something like SQL injection, including logs from the web server, artifacts from the database server, and data from other systems that had been connected to.
Now, when including other systems, differences in clocks (offsets, skew, etc.) need to be taken into account and dealt with prior to entering the data into the timeline; again, this should be thoroughly documented in the analyst's case notes.
Host ID information can come in a variety of forms...MAC address, IP address, system/NETBios name, DNS name, etc. In a timeline, it's possible to create a legend with a key value or identifier, and have the timeline generation tools automatically translate all of the various identifiers to the key value.
This field can be set to a suitable length (25 characters?) to contain the longest identifier.
User - This is the user associated with the event. In many cases, this may be empty; for example, consider file system or Prefetch file metadata - neither is associated with a specific user. However, for Registry data extracted from the NTUSER.DAT or USRCLASS.DAT hives, the analyst will need to ensure that the user is identified, whereas this field is auto-populated by my tools that parse the Event Logs (.evt files).
Much like the Host field, users can be identified in a variety of means...SID, username, domain\username, email address, chat ID, etc. This field can also have a legend, allowing the analyst to convert all of the various values to a single key identifier.
Usually, a SID will be the longest method of referring to a user, and as such would likely be the maximum length for this field.
Description - This is something of a free-form, variable length field, including enough information to concisely describe the event in question. For Event Log records, I tend to include the event source and ID (so that it can be easily researched on EventID.net) , as well as the event message strings.
Now, for expansion, there may need to be a number of additional, perhaps optional fields. One is a means for grouping individual events into a super-event or a duration event, such as in the case of a search or AV scan. How this identifier is structured still needs to be defined; it can consist of an identifier in an additional column, or it may consist of some other structure.
Another possible optional field can be a notes field of some kind. For example, event records from EVT or EVTX files can be confusing; adding additional information from EventID.net or other credible sources may add context to the timeline, particularly if multiple examiners will be reviewing the final timeline data.
This format allows for flexibility in storing and processing timeline data. For example, I currently use flat ASCII text files for my timelines, as do others. Don has mentioned using Highlighter as a means for analyzing an ASCII text timeline. However, this does not obviate using a database rather than flat text files; in fact, as the amount of data grows and as visualization methods are developed for timelines, using a database may become the standard for storing timeline data.
It is my hope that keeping the structure of the timeline data simple and well-defined will assist in expanding the use of timeline creation and analysis. The structure defined in this post is independent of the raw data itself, as well as the means by which the data is extracted. Further, structure is independent of the storage means, be it a flat ASCII text file, a spreadsheet or a database. I hope that those of us performing timeline analysis can settle/agree upon a common structure for the data; from there, we can move on to visualization methods.
What are your thoughts?