How to Search a PST file

Discovery Assistant is an eDiscovery application that can identify keyword search terms contained in many file types, including Microsoft Word files, Excel spreadsheets, plain text, web pages, and email messages among others. In this article, we will examine searching email messages, and PST files in particular PST files. Properly using keyword search terms is a powerful way to identify relevant documents contained in a large set of unstructured documents and email folders.

Discovery Assistant Search includes the ability to search through compound nested emails, attachments, scanned documents (PDF), loose electronic documents, zip files, faxed images (TIFF) and metadata.

Matching document may also have multiple parents, siblings, and child relationships to other documents that form the family group. It’s important that the family relationships are maintained during the document search and identification process.

Documents that match one or more search terms can be produced either as native documents, or formatted as PDF or TIFF images, and exported for further review along with extracted text, and the embedded metadata.

Preparing a PST file for Search:

To prepare a PST file for searching, perform the following tasks with Discovery Assistant:

  1. Load the PST file into Discovery Assistant.
    1. This will extract email and attachments from within a PST file.
    2. Additionally, it will extract zip contents and embedded objects (OLE).
  2. Identify and remove duplicate email messages.
  3. Perform Optical Character Recognition (OCR) on any scanned documents.
  4. Extract text from electronic documents.
  5. Index the extracted text.

Searchable electronic document types include PST, MSG, EML, Lotus NSF, Microsoft Office documents, PDF, HTML, TXT, Scanned documents, and other common electronically stored file formats.

Searching a PST file for Keywords:

Discovery Assistant includes a powerful search engine that is capable of indexing and searching Terabytes of data.

Once the contents of the PST file have been indexed, searching takes less than a second to return all matching documents.

Supported Features include:

  • Batch Searching (multiple keywords).
  • Compound Boolean search (AND / OR / NOT).
  • Proximity search (must be within (e.g.) 2 words).
  • Fuzzy Search (sounds like, but not exactly the same as).
  • WildCard search (e.g. all files that start with MULTI*).
  • Metadata Search:
    • Limited to a certain date range.
    • Contents of Subject field.
    • Contents of To/From/CC/BCC field.
    • Many other metadata values extracted at load time.

Example Uses for Keyword Search:

  • Respond to a legal request for files that contain one or more search terms.
  • Respond to a freedom of information request on a specific topic.
  • Initial preparation of relevant documents for litigation review.
  • Context searching, near duplicate identification, email thread identification.
  • Data culling.
  • Pre-discovery review.
  • Search for privileged and responsive documents.
  • Evidence collection.

Example Keyword Search Requests:

apple and pear both words must be present
apple or pear either word can be present
apple w/5 pear apple must occur within 5 words of pear
apple not w/12 pear apple must occur, but not within 12 words of pear
apple and not pear only apple must be present
subject contains smith      the field subject must contain smith
apple w/5 xfirstword apple must occur in the first five words
apple w/5 xlastword apple must occur in the last five words

Search terms may include the following special characters:

Character    Meaning
? Matches any character
= Matches any single digit
* Matches any number of characters
% fuzzy search
# phonic search
~ stemming
& synonym search
~~ numeric range
## regular expression

Tech Specs:

Speed (on typical hardware):
Loading 1 Gig per hour
OCR Preparation 1 second per page
OCR'ing 1 second per page
Building the index  1 Gig per hour
Searching 1 second per term

Scalability:
Ability to scale to multiple projects across multiple machines.

Support for Unicode, UTF-8, ANSII:
Index text is first converted to UTF-8 before the words are indexed. This way, the UNICODE, ANSII, and UTF-8 formatted files will all return the same search results.