What It Looks Like: Disassembling A Malicious Document
I recently analyzed a malicious document, by opening it on a virtual machine; this was intended to simulate a user opening the document, and the purpose was to determine and document artifacts associated with the system being infected. This dynamic analysis was based on the original analysis posted by Ronnie from PhishMe.com, using a copy of the document that Ronnie graciously provided.
After I had completed the previous analysis, I wanted to take a closer look at the document itself, so I disassembled the document into it's component parts. After doing so, I looked around on the Internet to see if there was anything available that would let me take this analysis further. While I found tools that would help me with other document formats, I didn't find a great deal that would help me this particular format. As such, I decided to share what I'd done and learned.
The first step was to open the file, but not via MS Word...we already know what happens if we do that. Even though the document ends with the ".doc" extension, a quick look at the document with a hex editor shows us that it's format is that of the newer MS Office document format; i.e., compressed XML. As such, the first step is to open the file using a compression utility, such as 7Zip, as illustrated in figure 1.
As you can see in figure 1, we now have something of a file system-style listing that will allow us to traverse through the core contents of the document, without actually having to launch the file. The easiest way to do this is to simply extract the contents visible in 7Zip to the file system.
Many of the files contained in the exported/extracted document contents are XML files, which can be easily viewed using viewers such as Notepad++. Figure 2 illustrates partial contents for the file "docProps/app.XML".
Within the "word" folder, we see a number of files including vbaData.xml and vbaProject.bin. If you remember from PhishMe.com blog post about the document, there was mention of the string 'vbaProject.bin', and the Yara rule at the end of the post included a reference to the string “word/_rels/vbaProject.bin”. Within the "word/_rels" folder, there are two files...vbaProject.bin.rels and document.xml.rels...both of which are XML-format files. These documents describe object relationships within the overall document file, and of the two, documents.xml.rels is perhaps the most interesting, as it contains references to image files (specifically, "media/image1.jpg" and "media/image2.jpg"). Locating those images, we can see that they're the actual blurred images that appear in the document, and that there are no other image files within the extracted file system. This supports our finding that clicking the "Enable Content" button in MS Word did nothing to make the blurred documents readable.
Opening the word/vbaProject.bin file in a hex editor, we can see from the 'magic number' that the file is a structured storage, or OLE, file format. The 'magic number' is illustrated in figure 3.
Knowing the format of the file, we can use the MiTeC Structured Storage Viewer tool to open this file and view the contents (directories, streams), as illustrated in figure 4.
Figure 5 illustrates another view of the file contents, providing time stamp information from the "VBA" folder.
Remember that the original PhishMe.com write-up regarding the file stated that the document had originally been seen on 11 Dec 2014. This information can be combined with other time stamp information in order to develop an "intel picture" around the infection itself. For example, according to VirusTotal, the malicious .exe file that was downloaded by this document was first seen by VT on 12 Dec 2014. The embedded PE compile time for the file is 19 June 1992. While time stamps embedded within the document itself, as well as the PE compile time for the 'msgss.exe' file may be trivial to modify and obfuscate, looking at the overall wealth of information provides analysts with a much better view of the file and its distribution, than does viewing any single time stamp in isolation.
If we continue navigating through the structure of the document, and go to the VBA\ThisDocument stream (seen in figure 4), we will see references to the files (batch file, Visual Basic script, and Powershell script) that were created within the file system on the infected system.
Summary
My goal in this analysis was to see what else I could learn about this infection by disassembling the malicious document itself. My hope is that the process discussed in this post will serve as an initial roadmap for other analysts, and be extended in the future.
Tools Used
7Zip
Notepad++
Hex Editor (UltraEdit)
MiTeC Structured Storage Viewer
Resources
Lenny Zeltser's blog - Analyzing Malicious Documents Cheat Sheet
Virus Bulletin presentation (from 2009)
Kahu Security blog post - Dissecting a Malicious Word document
Document-Analyzer.net - upload documents for analysis
Python OLETools from Decalage
Trace Evidence Blog: Analyzing Weaponized RTF Documents
Addendum 6 Jan 2015 - Extracting the macro
I received a tip on Twitter from @JPoForenso to take a look at Didier Stevens' tools zipdump.py and oledump.py, as a means for extracting the macro from the malicious document. I first tried oledump.py by itself, and that didn't work, so I started looking around for some hints on how to use the tools together. I eventually found a tweet from Didier that had illustrated how to use these two tools together. From there, I was able to extract the macro from within the malicious file. Below are the steps I followed in sequence to achieve the goal of extracting the macro.
1. "C:\Python27>zipdump.py d:\tips\file.doc" gave me a listing of elements within the document itself. From here, I knew that I wanted to look at "word/vbaProject.bin".
2. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin" gave me a bunch of compressed stuff sent to the console. Okay, so good so far.
3. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin | oledump.py" gave me some output that I could use, specifically:
1: 445 'PROJECT'
2: 41 'PROJECTwm'
3: M 20159 'VBA/ThisDocument'
4: 3432 'VBA/_VBA_PROJECT'
5: 515 'VBA/dir'
Now, I've got something I can use, based on what I'd read about here. At this point, I know that the third item contains a "sophisticated" macro.
4. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin | oledump.py -s 3 -v" dumps a bunch of stuff to the console, but it's readable. Redirecting this output to a file (i.e., " > vba.txt") lets me view the entire macro.
Addendum 14 Jan 2015 - More Extracting the Macro
Didier posted this following image to Twitter recently, illustrating the use of oledump.py:
After I had completed the previous analysis, I wanted to take a closer look at the document itself, so I disassembled the document into it's component parts. After doing so, I looked around on the Internet to see if there was anything available that would let me take this analysis further. While I found tools that would help me with other document formats, I didn't find a great deal that would help me this particular format. As such, I decided to share what I'd done and learned.
The first step was to open the file, but not via MS Word...we already know what happens if we do that. Even though the document ends with the ".doc" extension, a quick look at the document with a hex editor shows us that it's format is that of the newer MS Office document format; i.e., compressed XML. As such, the first step is to open the file using a compression utility, such as 7Zip, as illustrated in figure 1.
Figure 1: Document open in 7Zip |
As you can see in figure 1, we now have something of a file system-style listing that will allow us to traverse through the core contents of the document, without actually having to launch the file. The easiest way to do this is to simply extract the contents visible in 7Zip to the file system.
Many of the files contained in the exported/extracted document contents are XML files, which can be easily viewed using viewers such as Notepad++. Figure 2 illustrates partial contents for the file "docProps/app.XML".
Figure 2: XML contents |
Within the "word" folder, we see a number of files including vbaData.xml and vbaProject.bin. If you remember from PhishMe.com blog post about the document, there was mention of the string 'vbaProject.bin', and the Yara rule at the end of the post included a reference to the string “word/_rels/vbaProject.bin”. Within the "word/_rels" folder, there are two files...vbaProject.bin.rels and document.xml.rels...both of which are XML-format files. These documents describe object relationships within the overall document file, and of the two, documents.xml.rels is perhaps the most interesting, as it contains references to image files (specifically, "media/image1.jpg" and "media/image2.jpg"). Locating those images, we can see that they're the actual blurred images that appear in the document, and that there are no other image files within the extracted file system. This supports our finding that clicking the "Enable Content" button in MS Word did nothing to make the blurred documents readable.
Opening the word/vbaProject.bin file in a hex editor, we can see from the 'magic number' that the file is a structured storage, or OLE, file format. The 'magic number' is illustrated in figure 3.
Figure 3: vbaProject.bin file header |
Knowing the format of the file, we can use the MiTeC Structured Storage Viewer tool to open this file and view the contents (directories, streams), as illustrated in figure 4.
Figure 4: vbaProject |
Figure 5 illustrates another view of the file contents, providing time stamp information from the "VBA" folder.
Figure 5: Time stamp information |
Remember that the original PhishMe.com write-up regarding the file stated that the document had originally been seen on 11 Dec 2014. This information can be combined with other time stamp information in order to develop an "intel picture" around the infection itself. For example, according to VirusTotal, the malicious .exe file that was downloaded by this document was first seen by VT on 12 Dec 2014. The embedded PE compile time for the file is 19 June 1992. While time stamps embedded within the document itself, as well as the PE compile time for the 'msgss.exe' file may be trivial to modify and obfuscate, looking at the overall wealth of information provides analysts with a much better view of the file and its distribution, than does viewing any single time stamp in isolation.
If we continue navigating through the structure of the document, and go to the VBA\ThisDocument stream (seen in figure 4), we will see references to the files (batch file, Visual Basic script, and Powershell script) that were created within the file system on the infected system.
Summary
My goal in this analysis was to see what else I could learn about this infection by disassembling the malicious document itself. My hope is that the process discussed in this post will serve as an initial roadmap for other analysts, and be extended in the future.
Tools Used
7Zip
Notepad++
Hex Editor (UltraEdit)
MiTeC Structured Storage Viewer
Resources
Lenny Zeltser's blog - Analyzing Malicious Documents Cheat Sheet
Virus Bulletin presentation (from 2009)
Kahu Security blog post - Dissecting a Malicious Word document
Document-Analyzer.net - upload documents for analysis
Python OLETools from Decalage
Trace Evidence Blog: Analyzing Weaponized RTF Documents
Addendum 6 Jan 2015 - Extracting the macro
I received a tip on Twitter from @JPoForenso to take a look at Didier Stevens' tools zipdump.py and oledump.py, as a means for extracting the macro from the malicious document. I first tried oledump.py by itself, and that didn't work, so I started looking around for some hints on how to use the tools together. I eventually found a tweet from Didier that had illustrated how to use these two tools together. From there, I was able to extract the macro from within the malicious file. Below are the steps I followed in sequence to achieve the goal of extracting the macro.
1. "C:\Python27>zipdump.py d:\tips\file.doc" gave me a listing of elements within the document itself. From here, I knew that I wanted to look at "word/vbaProject.bin".
2. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin" gave me a bunch of compressed stuff sent to the console. Okay, so good so far.
3. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin | oledump.py" gave me some output that I could use, specifically:
1: 445 'PROJECT'
2: 41 'PROJECTwm'
3: M 20159 'VBA/ThisDocument'
4: 3432 'VBA/_VBA_PROJECT'
5: 515 'VBA/dir'
Now, I've got something I can use, based on what I'd read about here. At this point, I know that the third item contains a "sophisticated" macro.
4. "C:\Python27>zipdump.py -d d:\tips\file.doc word/vbaProject.bin | oledump.py -s 3 -v" dumps a bunch of stuff to the console, but it's readable. Redirecting this output to a file (i.e., " > vba.txt") lets me view the entire macro.
Addendum 14 Jan 2015 - More Extracting the Macro
Didier posted this following image to Twitter recently, illustrating the use of oledump.py: