MetaData and eDiscovery

In yesterday's CyberSpeak podcast, mention was made of issues with Office document metadata and eDiscovery. Several commercially available tools were mentioned, and I wanted to mention that there are freeware tools available.

First off, let me say that the tool I'll mention is one of my own...I'll be up front about that. It's a Perl module that I posted on CPAN, and it ships with a sample script called "". On Windows, if you're using ActiveState's ActivePerl, installation of the module is simple. Download the archive and extract the file to \perl\site\lib\File. To install the necessary modules to support this module, use the following commands:

ppm install OLE-Storage
ppm install Startup
ppm install Unicode-Map

The sample script pulls out the data in a crude format...the original script that I based this module on ( did a better job of extracting the information in a pretty format. As an example, I'll use the Blair document:

C:\Perl> d:\cd\blair.doc
File = d:\cd\blair.doc
Size = 65024 bytes
Magic = 0xa5ec (Word 8.0)
Version = 193
LangID = English (US)

Document was created on Windows.

Magic Created : MS Word 97
Magic Revised : MS Word 97

Last Author(s) Info
1 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
2 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
3 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
4 : JPratt : C:\TEMP\Iraq - security.doc
5 : JPratt : A:\Iraq - security.doc
6 : ablackshaw : C:\ABlackshaw\Iraq - security.doc
7 : ablackshaw : C:\ABlackshaw\A;Iraq - security.doc
8 : ablackshaw : A:\Iraq - security.doc
9 : MKhan : C:\TEMP\Iraq - security.doc
10 : MKhan : C:\WINNT\Profiles\mkhan\Desktop\Iraq.doc

Summary Information
Subject :
Authress : default
LastAuth : MKhan
RevNum : 4
AppName : Microsoft Word 8.0
Created : 03.02.2003, 09:31:00
Last Saved : 03.02.2003, 11:18:00
Last Printed : 30.01.2003, 21:33:00

Document Summary Information
Organization : default

Notice the bolded line above...this is extracted from the binary data of the file.

The module extracts the information, it just needs to be prettied up a bit. Another benefit of the module is that it extracts additional information from the OLE contents of the file. First off, it extracts information about the OLE "trash bins", where useful data could be hidden:

Trash Bin Size
BigBlocks 0
SystemSpace 940
SmallBlocks 0
FileEndSpace 1450

Also, the module collects information about the OLE streams within the file:

Stream : ☺CompObj
Stream : WordDocument
Stream : ♣DocumentSummaryInformation
Stream : ObjectPool
Stream : 1Table
Stream : ♣SummaryInformation

At this point, you're probably thinking, "" Well, there's a freeware utility available called MergeStreams that allows you to merge an Excel spreadsheet into a Word document. The resulting file is slightly smaller than the sum of both file sizes, and the file extension is ".doc" if you double click the file, it will open in Word and all of the word data will be visible. However, if you change the file extension to ".xls" and double-click the file, it will open in Excel, with none of the Word data/information visible. It's still's just not being parsed by Excel.

Why is this important? Well, if I wanted to smuggle information out of an organization, I might put the information in a spreadsheet for easy access and searching and then merge it into an innocuous Word document and copy it to my thumb drive (or laptop hard drive). If on the off chance anyone was to search me or my devices, they'd see the Word document. If the double-clicked it, they'd see the innocuous, boring content I'd put there...and wave me on my merry way. The same could be true for email attachments.

The example that I use that gets the LEOs sitting up in their seats is to take three illicit images and paste them into a Word document. Merge the document with an Excel spreadsheet that may be widely circulated throughtout the forecasts, etc. Only those folks who know that the images are there will know to change the file extension to ".doc" so that they can view the images.

Interesting stuff. Like I said before, if you have a situation like what was mentioned in the podcast (i.e., you have to search a lot of files for specific metadata, such as the last author, or one of the last 10 authors), then something like the Perl module provides the necessary framework; combine it with any number of ways to enumerate the files in question (read the contents of a directory, read the file list from a file, etc.), Perl's regular expressions, and you can output to any format you like (HTML, XML, spreadsheet, database, text file, etc.).