|
Product Description | FAQ | Pricing | Downloads | Tech Notes | Litigation Support | Contact Us | Resellers |
Goto Page 1 2 3
ImageMAKER Discovery Assistant ReadmeImageMAKER Discovery Assistant automates the process of preparing documentation for legal discovery. Supported features include document conversion to TIFF and PDF, Bates stamping, extraction of meta data, OCR text extraction, printing, and export to Summation and Concordance case management tools. Product Features include: · Scalable to millions of conversions per day. · Simple to install and use · Powerful customizable feature set. · Custom Development and support available. · Tools to manage the processing of terabytes of data. · Support for extracting OLE embedded documents · Optional support for text searchable PDF, and color TIFF output. · Simple to use database front end, capable of scaling to millions of documents across multiple machines. · Support for most common document types, including Word, Excel, PowerPoint, PDF, HTML, TXT, JPEG, and RTF. · Also includes support for converting Microsoft Outlook MSG and PST files, and Lotus Notes email files. · Flexible built-in bates stamper supports writing bates labels to any four corners of the output TIFF file. · Creates TIFF files, Meta data, Txt output, and a conversion log for each file converted. · Add-on upgrade to convert to scanned PDF. ContentsQuick Overview Quick Overview:Discovery Assistant can be downloaded from http://www.discoveryassistant.com/Download/Downloads.php Basic idea is the product can take any set of computer generated documents, including email, word documents, pdf files, spreadsheets, and/or scanned documents (from hard drive, and/or CD-Rom's), and convert them to TIFF or PDF and associated meta-data, suitable for importation into Case Management software. The product can also directly output Bates Stamped TIFF and PDF for direct production of documents for legal discovery. Discovery Assistant currently exports to the following Case Management systems: Summation DII Class I - tiff and text Concordance IPRO-Tech - images are loadable into Ipro Comma Separated Value (CSV) currently only supports TAB output Ringtail Ringtail Legal from FTI Introspect IDX (Zantaz) IDX file format Some distinguishing Discovery Assistant features include: · conversion of all printable document types to TIFF or PDF. · emphasis on conversion speed. · support for removal of duplicates. · extraction of TEXT and metadata. · integrated OCR support for extracting text from scanned images. · proper Date/Time stamped values for macros in Word and Excel. · integrated Bates Stamping. · blank page removal · file management features, including a 'MOVE' capability. · ability to auto preview the conversion results. · specialized Excel formatting controls, including 'fit to pages' feature. · ability to identify document file types by content rather than file extension. · upgrade to output in color (jpeg compressed TIFF) · upgrade to output to postscript (color postscript) for conversion to PDF or direct printing · support utilities to handle the processing of Terabytes of data. · Export to Concordance and Summation. · Export to CSV, Introspect, Ringtail, IPRO. Our company focus is to provide a high quality easy to use product. We provide quick turn-around on reported problems, and to offer custom development services in the event that custom features are required - ensuring that the Discovery Assistant product meets our customer's exact needs. We are the developers of the underlying core technology used in the Discovery Assistant product, and have been in business selling print drivers, viewers, and document conversion products since 1990. Installion:1. Take a quick look at the User Manual 2.
Install the Microsoft .NET
Framework Version 1.1 Redistributable Package Install the Microsoft .NET Framework Version 2.0 Redistributable Package (x86) http://www.microsoft.com/downloads/details.aspx?familyid=0856EACB-4362-4B0D-8EDD-AAB15C5E04F5&displaylang=en OR Install the Microsoft .NET Framework Version 2.0 Redistributable Package (x64) http://www.microsoft.com/downloads/details.aspx?familyid=B44A0000-ACF8-4FA1-AFFB-40E78D788B00&displaylang=en (still trying to determine if we need one or both). 3. Install Microsoft Outlook, and Lotus Notes (if required). 4. Install the latest Microsoft Office updates: http://office.microsoft.com/OfficeUpdate/default.aspx 5. Install the DiscoveryAssistant application (unzip XDCAssistant.zip and run Setup.exe). 6. Call your technical contact (Ken Davies) at ImageMAKER Development for a walkthrough instruction. (604) 525-2108. Pacific Standard Time. Quick Start Instructions:1.
Download and install the
.NET runtime version from Microsoft. Also, confirm you have the latest Office updates from Microsoft: http://office.microsoft.com/OfficeUpdate/default.aspx Make sure you have installed the Office Tools \ Microsoft Office Document Imaging product (contains the OCR engine that Discovery Assistant uses). Also, if running Windows 2008, make sure you have downloaded and installed the Desktop Experience (Image Viewer for Jpeg). [To install Desktop Experience, from the Start Button, go to Administrative Tools, and click on Server Manager. In the Features Summary section of Server Manager, click Add Features. In the Add Features dialog, select the 'Desktop Experience check box, and then click Next, and Install.] 2. Unzip the Discovery.zip file and run DiscoverySetup.exe. Setup automatically installs print drivers, Discovery Assistant.exe, PST, MSG Lotus Notes and ZIP crackers. 3. Select the All Files tab in Discovery Assistant, and select Scan. Use this interface to select the directory containing the files to be converted. After the scan has completed, you can sort the files based on filename, full path/filename, date, size, type, or whether it is convertible or not. 4. Switch tabs to view Files that can be converted. Then use the display buttons to Queue 1, or Queue All files for conversion 5. Switch tabs again to Queued for conversion, and again, from the buttons choose to convert one, or convert all files. Current suggestion is to first choose only one file to convert. 6.
As the files are converted
the first time, one or two dialogs may pop up.
If we dont properly time-out, and shut down the problem application,
you can auto-close these dialogs yourself.
There is a second application (XDCAdmin.exe) that can be used to
auto-close these dialogs. Details on
how to auto-train the XDCAdmin to auto-close, auto-shutdown, and
auto-kill follow in the detailed notes. 7. Once you have some success, and have identified what problems (if any) there are with the test conversions, set things up to do a full conversion of all queued files. 8. Switch tabs again to Converted and Failed. You can look at the resultant files using the interface provided. Use this interface to investigate any errors that might have come up. We can be useful at this phase to identify what fixes are necessary. 9. After all conversions are done, you can perform the following operations from the Converted Files tab: · Deblank the output files - remove blank pages from selected files. · Assign Bates Numbers. · Assign Document ID's. · OCR any image files (scanned PDF files). · Bates Stamp the resultant TIFF files. 10. Once you are done with conversions, you can export the conversion results to Concordance / Summation / IPRO / or a comma separated value file (CSV file) for inclusion in another database, spreadsheet or document management format. If converting XLS files, we recommend choosing between the following settings: 11.
Under Admin / Configure /
Excel Settings, users have the choice to force output to a specified number of
pages (print to fit), or to print at full size. In either case the complete
spreadsheet is rendered, not just the last saved print range. Default is to print the entire spreadsheet at
full size. 12. If you are planning on converting PST or MSG files, there may be an issue with the Outlook security dialog. We provide a tool to automatically close these dialogs, but if there is still a problem, The Outlook Security Dialog can be turned off permanently by opening Outlook 2007, and choosing: Tools / Trust Center / Programmatic Access / "Never warn me about suspicious activity". 13. Conversion will go quicker if you have the Outlook client open. This reduces the time we take opening and closing Outlook. 14.
There is tremendous variety
in PST/MSG files. Our current working
methodology is if you do have a conversion failure, we can likely comment on
(and fix) the problem by getting you to send us a log files. Log files can be generated by running
imgLOG.exe before loading files to be converted. To remove date headers and footers from MSG, TXT, HTML: 15.
MSG files are formatted as
TXT, RTF, or HTML. For Notepad and Internet Explorer, under the File/PageSetup dialog, there are header/footer strings. To remove headers/footers from HTML and MSG that are rendered as HTML do the following: 1. Open Internet Explorer 2. Go to File/PageSetup 3. Delete the strings in the Header and Footer edit boxes 4. Click 'ok' 5. Exit Internet Explorer Can remove headers/footers from TXT rendered images the same way as Internet Explorer. To Get Internet Header extraction working: 16. First make sure that Outlook CDO (Collaboration Data Objects) is installed. Pre Office 2007: CDO comes standard with your Office installation, but must be selected ON to be installed. Office 2007: you must download
and install a separate add-on from Microsoft:
Collaboration Data Objects, Version 1.2.1 To check if CDO is already installed, Look in the registry for: HKEY_CLASSES_ROOT\CDO.Message And look in your system folder for: C:\WINDOWS\system32\cdosys.dll To install CDO (Collaboration Data Objects)... 1. Get the Office installation disk. 2. Run Office Installation. As you already have Office installed, choose 'update'. 3. Select advanced... 4. Expand the Outlook distribution list, look for and enable the Collaboration Data Objects' value. Default is to change the 'x' to read 'my computer'. 5. Select OK. 6. Installation should ADD CDO, and not remove any other Office applications. You must also turn 'Internet
Headers' on from the Project Options / Outlook tab Internet headers are extracted to the .MTF file (Metadata File) during conversion. You can open the Metadata file in the Converted tab to review the contents. Internet headers are appended to the bottom of the metadata file. Headers will either be marked as: Note: Only messages that have been received have Internet Headers. Messages that have been sent (and not received) have no headers. To View the Internet Headers in Outlook, open the source MSG file (use the Open Source button in Discovery Assistant), then select 'View Options'. To export Internet Headers, be sure to select the INETHEADER field on (near or at bottom of the list). To convert WordPerfect Office files (WPD and WB3): 17. Quatro Pro, and Word Perfect require some simple setup before doing any conversion of those file formats. Fix is to open the application, and load a sample file (can create a simple file if you want to). Next, make sure that the default printer is the ImageMaker XDC Service1 driver, do a print, and then do a file save. Leave these applications open, but minimized, or reduce the window footprint to a smaller portion of the screen. See comments further on down about using Word to handle WordPerfect files. To get Lotus Notes support working: 18. At startup, Discovery Assistant looks for Notes.exe in the \program files\Lotus\Notes\ directory. It then forces a 'path' change to the local logged in user to make sure that the lotus directory is part of the system path in order for the dll's to work. (No idea why Lotus does not do this as part of its own installation). If you've just recently installed Lotus Notes, then you need to re-install Discovery Assistant to set the path information. Support for OCR: 19. To enable Discovery Assistant to use the Microsoft Office 2003 OCR engine (recommended), first confirm that OCR is working by running the Microsoft Office Tools / Microsoft Office Document Imaging product. Open a TIFF file, and then choose OCR to confirm the OCR engine is working. Support for Search: 20. Full text search can be done either before conversion, or after conversion. See notes on dtSearch (end of this file). Requires that you separately download dtSearch from the dtSearch website. Support for Distributed Processing: 21. Discovery Assistant allows multiple machines to be controlled from a master machine to provide faster throughput for job conversions. To set up Distributed Processing, see notes near the end of the file entitled: Setting up Distributed Processing". Support for Searchable PDF: 22. To get searchable PDF working, you need to do the following: 1. Install a Postscript print driver onto your machine. Recommendation is to install an HP LaserJet PS, or Apple LaserWriter (either color or B&W). Can do this by doing an Add Printer from the Printer's dialog. 2. Download and install the Discovery Assistant Postscript update from http://www.discoveryassistant.com/Download/Downloads.php. 3. When installing the postscript update, you will be prompted to download and install GhostScript and GhostView - two open source products that convert Postscript to PDF. 4. Re-start Discovery Assistant and re-queue files for conversion. When converting, choose 'Postscript' as the output file type. 5. review, bates stamp, and export as Searchable PDF. Note: if the input file is a scanned image, the output file will also come out as a scanned image. The only way to get text out of a scanned image is to convert to TIFF, then use Discovery Assistant to OCR it. Switching from Demo to Release: 23. If you have processed documents in DEMO mode, and have now licensed the product, and want to export files, you need to: 1. From the Project menu item, select 'remove temp files'. 2. Re-queue the converted files, and re-convert. This removes the demo stamp. Avoiding memory problems: 24) Make sure the following aren't running: · Google Desktop. Turn this off as it consumes vast amounts of CPU. · Microsoft Office tablet service (WISPTIS.EXE). Turn this off as it eats memory like crazy on every file open command. (CiceroUIWndFrame message crash) Supported File Types (quick overview):Discovery Assistant supports file formats based on file extension. To check for an associated application for any given extension, you can: 1. Try to open the file by double clicking on the file icon 2. Try printing the file by dragging the file over onto a printer icon. Discovery Assistant also lists associated files: 1. Use the DA_Sysinfo application to list supported file types. 2. Use the Discovery Assistant / Admin / Configure / Document types to modify supported file types. If you still can't figure out the owner application: Check file extension at http://filext.com. If you want to add support for a new file type, first ensure that the appropriate application is installed. As long as that application registers a 'PrintTo' or 'Print' file association, we should be able to convert the file content to TIFF and TXT. Here are the steps to take to add a new file association: 1. First thing is to check if file extension type has a 'print' or 'printto' association. Can do this by right clicking on the file, and seeing if thereis a 'print' menu item. You can also try dragging the file from Windows Explorer onto a printer icon, and seeing if it prints. Discovery Assistant lists all print and printto associations in the output generated by DA_SysInfo (installed in the ImageMAKER Discovery Assistant program group). We use the registered file associations first before looking for other ways to print. Associations are normally registered as command line strings. You can interactively review and modify review file associations by opening Windows Explorer, and choosing Tools / Folder Options / File Types. 2. Sometimes the owner application supports printing from the command line, but doesn't properly register that fact. Applications may require you to activate a 'register' button before it sets the file associations. For example, Internet Explorer requires you to select Internet Options /Programs / Reset Web Settings. 3. If there is no registered application, and you don't know what application opens the file type, then you can search the file type extension database: http://filext.com for the proper application. Acquire the application, and register the file associations (step 1 or 2). 4. If there is no file association for the file type, but you know of an application that supports this file type, then there are a number of things you can do: (all of which are somewhat messy - but permanent). · from Windows Explorer, manually add the file type. · from Discovery Assistant / Admin / Documents, add the file type · run "DA_Sysinfo xyz-txt" where 'xyz' is the new type, and 'txt' is the equivalent file type. Then, stop, and re-start Discovery Assistant, and do a re-check on that file type. If you want a one time solution: · from the Discovery Assistant non-convertible tab, use Assign Type. 5. In some cases, we do custom development to support the file type in question. Custom file types that we've written converters for include zip, pst, msg, eml, doc, xls, ppt, and pdf. 6.
If the application supports
Open, but does not have a command line Print capability, you might still be
able to get things working using a macro recorder. Incorrect Document Extensions:Discovery Assistant will also detect and handle file types named with an incorrect extension. For example, if a Word Document has an extension ".BAK" , Discovery Assistant will detect and treat that file type extension as ".DOC". File types that we can identify using binary contents is:
Office12 / Office 2007 supportThe URL for the Office 2007 compatibility pack is http://www.microsoft.com/downloads/details.aspx?FamilyId=941b3470-3ae9-4aee-8f43-c6bb74cd1466&displaylang=en. If you add in the compatibility pack, you should be able to open and process Office 12 documents: (PPTX, DOCX, XLSX) running no an Office 2003 or Office 2000 machine. MSG and PST handling.Discovery Assistant excels at handling MSG and PST file formats. Some things to do to ensure your system is running efficiently: 1. The Outlook Security Dialog can be turned off permanently by opening Outlook 2007, and choosing: Tools / Trust Center / Programmatic Access / "Never warn me about suspicious activity". 2. If you are having any troubles scanning a PST file, there is a Microsoft validation tool that can be used to repair PST files: 1. Exit Outlook if it is running. 2. Double-click Scanpst.exe, located at drive:\Program Files\Microsoft Office\OFFICE12. 3. In the Enter the name of the file you want to scan box, enter the name of the .pst or .ost file that you want to check, or click Browse to search for the file. 4. To specify the scan log options, click Options, and then click the option that you want. 5. Click Start. 3. If you've installed Office XP, be sure to disable the Speech and Handwriting Recognition software, as this eats system resources. To disable, go to: 1. "Control Panel" 2. "Add/Remove Programs" 3. "Microsoft Office," click on the "Change" button 4. Browse to "Office Shared Features," "Alternative User Input," and select for Speech and Handwriting Recognition (both) "Not available" from the drop-down box. Very neat trick: To extract multiple MSG files from a PST file for testing, debugging, message ordering etc... 1. Open Outlook mailbox 2. Sort messages 3. Select multiple messages, and then from the outlook menu, choose 'copy'. 4. Open Windows Explorer and create a new directory. 5. Select 'paste'. 6. The files are written to the output directory in the same order as they are listed in Outlook. 7. To then convert these files, 'drag' the msg files from the Explorer interface into the 'All Files' tab of Discovery Assistant. Message order will be maintained. Handling Outlook Security issues.The Outlook Security Dialog can be turned off permanently by opening Outlook 2007, and choosing: Tools / Trust Center / Programmatic Access / "Never warn me about suspicious activity". More details at: http://msdn2.microsoft.com/en-us/library/bb226709.aspx - "Code Security Changes in Outlook 2007 - MSDN Library / Office Development / 2007 Microsoft Office System / Outlook 2007 / Technical Articles". Support for De-DuplicationMany file sets contain multiple copies of the same file. The de-duplication feature is designed to spot these duplicate files, and ensure that only one copy is converted. Duplicates are identified by a unique Hash Value, that is calculated for every file, message, and attachment at time of import. For message files, the hash value is based on the 'text' content of the email message, not the binary contents of the MSG file. The MSG binary file may contains additional unique information related to how it's stored in the PST file. The binary file will also contain all the binary attachments. (ie we don't hash the MSG file, but the extracted TEXT portion only). If two files have the same hash value, then we do a binary comparison just to be sure the files are both equal (one last final check). If the files do not compare, then the hash value is modified to include an extension. Every file in the project is marked 'true' if there is a duplicate. (see 'Local Duplicate' column in AllFiles). At any time before conversion, you can also link projects to a 'global' project, and can identify global duplicates this way. One advantage of 'global' deduplication is it will differentiate between 'primary', and 'secondary' duplicate. (Local deduplication will flag primary and duplicate as both being duplicates). The de-duping feature is controlled from the Options / De-duping tab. Settings are as follows: · skip local duplicates when converting · skip global duplicates when converting · don't skip children unless parent is skipped · if duplicate is NOT skipped, then copy output files rather than converting. Normal default is to enable the top three choices. The forth choice is OFF, and skipped (duplicate) files are not copied to the converted directory. Explanation of settings: · Skip Duplicates means that if it is a duplicate, we don't process any further. · Skip Global Duplicates means that if it is a global duplicate, don't process any further. · Don't skip children unless parent skipped means that parent and all other children must also be a duplicate before we bother skipping that whole email chain. · Copy Duplicates means that we copy the resultant TIFF files from a previous converted copy. · Saves the time for duplication, but does not save on drive space. User Example: 1. User chooses a list of files to convert. Discovery Assistant loads the list into memory, creating a unique hash code for each file scanned. As files are added, they are compared to the list of existing hash codes already generated. If there is a hash-code match, then both the source and the potential duplicate are binary compared (ensuring an exact match). 2. At time of conversion, if the file is a duplicate, and has already been converted, then we ignore (and duplicate again), skip (don't convert), or 'copy' over the duplicated TIFF file rather than do the conversion again. 3. User selects a range of files to Export to one of the common formats. If the selected list contains a duplicate, and the de-duplicate setting is set to 'linked', then we create an entry for the file in the output list, but point backwards in the list to the TIFF and META data of the original file. entry 1, tiff file 1, meta file 1, text file 1 entry 2, tiff file 2, meta file 2, text file 2 entry 3, tiff file 1, meta file 1, text file 1 <----- duplicate of entry 1 entry 4, tiff file 4, meta file 4, text file 4 The XML files keep track of what files are skipped at time of conversion. (these are marked as skipped instead of converted). The XML file can be exported as a MDB or XLS file for documentation purposes. Support for De-Blanking - removal of blank pages.Discovery Assistant defaults to print the entire spreadsheet, not just the last defined range. When printing the entire spreadsheet, it is possible that blank pages will be produced. After conversion, from the Converted Tab, users can select 'deBlank' to remove blank pages. Deblanking goes through each page and looks for black bits in a 10x10 cell grid. If there are more than 200 black bits in any cell, then the page is not blank. If blank pages are discovered... Discovery Assistant update the MetaData to indicate what pages have been removed, and creates a 'cleaned' and 'removed' output file. User can then look at the 'cleaned' file, and the 'removed pages' file to confirm that we've not made any mistakes. Cleaned + removed = total Bates StampingTo ensure that the Bates Stamp does not obscure any important information: 1. Confirm that the ImageMAKER XDC Service1 printer has the proper unprintable region margins set. In most cases it doesn't make much difference as most business documents do not print right to the edge. However, if you are converting image files (TIFF / fax / JPEG pictures / etc.) the converting application may print right to the border edge. To check/change the printer borders, go to the printers dialog, and select the print properties for the ImageMAKER XDC Service1 printer. In the Device Settings tab, look for and set the unprintable regions. Recommend a border of .25 inches. It may be useful to set the top margin to 0, and the bottom margin to .5 in order to get more room for the bates stamp. 2. Convert from the 'Queued' directory tab, to the 'Converted' directory tab. Confirm that the output images have a white space border. 3. Set up the Bates Stamp. Margins are defined in the setup area. BatesStamp the output. 4. Review the resulting files. 5. To print the resulting TIFF images to a hard copy printer, you have two choices: 1. Print to edges 2. Scale to fit the printable region. If you are looking to get the Bates Labels as tight to the outside printable region as possible, then you can set the imgview.exe application (what we use to print TIFF files) to 'print to edge'. To do this, open one of the tiff files in imgview.exe by double clicking on the thumbnail imagein Discovery Assistant. Then choose menu / Options / Print to edge, and close the imgview application. Bates stamping images with no image scaling or compression: If you are looking to Bates Stamp TIFF images without any additional scaling, then the conversion from 'queued' to 'converted' must be run through our imgview.exe application AND the setting 'print to edge' must be set on. To make the imgview.exe application the default TIFF print application, from the same imgview.exe options menu, select 'set as default viewer'. Then stop and re-start discovery assistant to pick up the new file association. You can confirm what the current default viewer is by doing a 'view source'. Advanced Excel Spreadsheet Processing:Under the Admin / Configure / Excel tab, the user can set up the following preferences: Orientation: Default / portrait / landscape Scale: Default Fit-To (pages wide / pages high) Zoom To (% of normal size) Show Comments: Default / None / at end of sheet / as displayed on sheet Page Order: Default / Down then over / Over then down Print Quality: Default / 200/300/400 dpi. Paper Size: Default / Standard paper sizes. Turn headers/footers off. Print just the last saved print range, or the whole spreadsheet. Set all worksheets to active before converting Clear print area before converting (print all cells) Scale: To limit the number of pages when printed, suggest setting the default size to Fit to 1 page wide, 10 pages high. Special case printing can then be done based on the thumbnail output images produced. If you have wide varieties of Excel spreadsheets, some with lots of pages, others with only a few pages, our recommendation is to print excel at less than 100% size. Things still look very good at 75% scaling. Can easily go as low as 50% scaling. This reduces the number of pages, and gives you a better chance that you get more meaningful information on each page. Print Area: Default is to print entire spreadsheet, not just the print area. When printing all cells, need to look for and remove blank pages afterwards. Set all Worksheets to Active: Default is to set all sheets to print. Otherwise, print only the active sheets. Disable Macros, Re-calculate: Macros and auto-recalculate are disabled. Currently in testing: Ability to unhide cells, columns, rows, extract formulas as part of the MetaData, set column width. Support for proper DATE and TIME settings in Word and Excel.Word and Excel contain macros and functions that return the 'current' date and time. The expectation is that these date/time values are properly set when the user creates, prints, or saves a work document. When submitting a document into discovery, the date/time printed in the TIFF image has to match the date/time the document was last accessed. Discovery Assistant solution to the date/time problem is to set the system date/time to the document's last saved date/time before doing the conversion. To enable this feature, go to the Admin/Configure screen. There you will see the following: IMPORTANT: Some document headers and footers will render the current date and time. If you need this to reflect the LastWrite time of the file being converted, check the box below. Warning: This option may have unpredictable effects on the system and other applications X Reset System Time to file LastWrite Time before conversion. The only 'unpredictable effect' we can currently think of is that the Discovery Assistant application is 'killed' during conversion, and does not re-set the system date/time back to current. This can easily be solved by going into the Control Panel Date/Time applet, and re-setting the system time. The visible indication that we are changing the system time can be seen when the computer time value changes in the bottom right hand corner of the computer's task bar. We always make sure to set the time back to the exact correct value by keeping track of (and accounting for) the elapsed time since changing the system clock value. DataBase Format for the Discovery Assistant project:Quick background on our database structure is as follows: Discovery Assistant uses XML as the data storage format. Records are read into memory, manipulated in memory, then saved every 100 or so conversions, or when the user closes the file. Otherwise, all database activity is done in-memory, using .NET controls. The advantage of running the database completely in memory is speed. Things that traditionally take a long time using a transaction based database run 1000's of times faster in a 'memory loaded' database. Traditional time consuming activities include: · Generating and re-viewing different data views of the same data set. · Changing a status value for each record in the database (queued status, bates number, document ID) · Operations that add 1000's of records at a time.
Other advantages of the XML format are: 1. Universal format can be converted to any other format with a wide variety of available tools. 2. .NET controls work with XML natively. 3. Very compact way of storing variable length data. 4. Can be read/searched by humans using a simple text editor. 5. Can be repaired if corrupted by an external process/activity (like power failure during a file save). 6. Can be manually edited by a text editor is users want to a general search and replace.
The disadvantages of having the data stored in memory are: · takes up memory (best to limit projects to 500,000 record items or less). · if the application crashes, you lose data back to the 'last saved' version. (need to save after major activity). · Can't multiplex access to the same data from more than one machine. (currently not an issue). Our rule of thumb is to limit project sizes to 1 or 2 gigs per project, to a maximum of 100,000 to 200,000 items. If you have data sets bigger than 200,000 items, or larger than 2 gigs in size, then best to break the data down into multiple projects, possibly spread across multiple machines. Another rough rule of thumb: A single machine running Discovery Assistant can process on average 1 gig of data per day. For really large projects (Terabytes in size): We provide an Access Database (MDB) tool we call TeraBite, that enumerates all the files in a given directory tree or server share, then creates a database containing that list of files to process. The database list can then be written out as multiple text based Load List for further processing by Discovery Assistant. Load List contents are defined by a maximum number of files, or maximum cumulative file size. Load Lists are serially processed by the service provider in batches across multiple computers. As loads are completed, they are exported out to a format suitable for review by the customer. This way data flows through the process in chunks, and delivery of the first chunk can happen in a single day (or less) after start of conversion. Support for ScalabilityAdditional conversion machines can be added to improve overall throughput. Large conversion requests can be broken down into a set of smaller jobs, each of which is run on a different computer. To ensure that each computer has the same conversion settings, we recommend saving the DiscoveryAssistant.xml file, and the HKLM\Software\Imagemaker registry settings, and then duplicating these two files across the various machines. The discoveryAssistant.xml file (installed in the same directory as the discoveryAssistant.exe file) \program files\imagemaker\discovery assistant\discoveryAssistant.xml Contains all the global project settings (and is text readable). The remaining settings (that control document formatting) are saved in the registry. You can export the hive HKLM\Software\ImageMAKER to a .REG file, and use this as the other settings file. To match a second machine's settings: · install Discovery Assistant · copy over the DiscoveryAssistant.xml file · double-click on the saved imageMAKER.reg file Support for Lotus Notes (NSF)Discovery Assistant supports loading Lotus Notes NSF files natively. Before installing Discovery Assistant, make sure you have Lotus Notes client version installed first. (we've tested using Lotus Notes Domino Designer 6.0.3). If Discovery Assistant is already installed, install Lotus Notes, then re-install Discovery Assistant. At startup, Discovery Assistant looks for Notes.exe in the \program files\Lotus\Notes\ directory. It then forces a 'path' change to the local logged in user to make sure that the lotus directory is part of the system path in order for the dll's to work. (No idea why Lotus does not do this as part of its own installation). If you've just recently installed Lotus Notes, then you need to re-install Discovery Assistant to set the path information. Download site for Lotus Notes client: http://www-128.ibm.com/developerworks/downloads/ Lotus Notes®, Domino Designer, and Domino Administrator clients V8 or later. Here is the direct link to the notes client: (these link names change over time) http://www.ibm.com/developerworks/downloads/ls/lsndad/?S_TACT=105AGX28&S_CMP=DLMAIN Lotus Notes Metadata Type: Lotus Notes Document ID:B0DB4E68D9BF457B86256FBA00621AE9 From: CN=Helmuth X Fendel/OU=LAKE/OU=CORP/O=ABBOTT To:CN=Giorgio Martellino/OU=ADDITN11/OU=ADD_ITL_HUB/OU=ADD_EURO_HUB/OU=ADD_HUB/O=ADD/C=US@ABBOTT;"Karrer, Roberto (INT'L)" <Roberto.Karrer@ace-ina.com> Cc:Bryan.Willcox@ace-ina.com;CN=Charles M Santora/OU=LAKE/OU=CORP/O=ABBOTT@ABBOTT Bcc: Subject:Re: Sibutramina - Privileged & Confidential Sent:2002-04-10 10:23:54 Received:2002-04-10 10:23:56 Date Modified:2005-03-04 09:51:34 Date Created:2005-03-04 09:51:34 Date Accessed:2005-03-04 09:51:34 Size:18255 Importance:1 Priority:1 Mood:0 PreventCopying:0 ReturnReceipt:0 IsSentByAgent:0 Number of Attachments:0 Body: Message Contents EndBody: Known Problems: On some NSF files, we have troubles extracting attachments. Fix seems to be to stop, then re-start Discovery Assistant, then re-import the NSF file (or do a re'check if already imported). Alternate solution: Convert Lotus Notes messages to PST http://www.lotus-notes-export.com/XitNotes.asp Problem #2 A user with appropriate Access Control List (ACL) rights receives the following error when attempting to open a local replica of a database: "The database has local access protection and you are not authorized to access it locally." Solution This will occur in cases where a user other than the current user created the local replica. This occurs because the Notes client has a default security setting to encrypt local replicas. This setting is accessed via File -> Security -> User Security -> Notes Data -> Databases. To create local replicas that are not encrypted, select "Do not locally encrypt" rather than the default "Locally encrypt using". Alternate Solution: Switch to the user supplied Notes id file (filename.id); then open the datbase enter the password, and go to File/Access Control, set everything to Manager; then File/Application/Properties/Encryption Settings and checked the "do not encrypt" box; and then compacted the database. Related information How to Determine Which Databases Are Encrypted Encrypting PAB causes error 'Unable to create location' Error Accessing Server Database "This Database Has Loca http://www-1.ibm.com/support/docview.wss?rs=0&uid=swg21088323 Problem #3 If you encounter problems, run the Logger (red button on top right of Discovery Assistant application). You can then email us the log contents for further analysis. example problem: [08-06-27 11:35:47 AM DA::OpenNsfDatabase()] System.Runtime.InteropServices.COMException (0x80040154): COM object with CLSID {5FB98ACD-8EAA-4E2D-A980-9B1C678B8C4D} is either not valid or not registered. possible resolution: 1. From DOS prompt, type 'path'. Make sure that the path contains a pointer to the nnotes.dll file (c:\program files\lotus\notes\NNOTES.DLL) 2.
re-register
the nsfCracker.dll: regsvr32 "c:\Program Files\Imagemaker\Discovery
Assistant\NSFCRACKER.DLL" If installing on Vista, make sure DOS is running in Admin mode. (right click on Command prompt, and choose Admin). Problem #4 Notes error: "You must supply the bulk decryption key in order to extract this file object." Resolution: The Encrypt incoming mail field is set to Yes in the Mail section of the user's Address Book entry. Once this was changed to No the problem is resolved. Problem #5 You attempt to read newly-encrypted mail (i.e. with a new key) with an old backup ID file that does not contain the new key, and the following error occurs: "Specified Private Key Does Not Exist." Additionally, if an encrypted message has an attachment and you attempt to open it with an old backup ID, the message above is generated and an empty message with the attachment is displayed. If you then try to launch or detach the attachment, the following error occurs: "You Must Supply the Bulk Decryption Key in Order to Extract This File Object <path\filename>)". or (in Notes 4.6x): "The encrypted data has been modified or the wrong key was used to decrypt it: Could not detach to file <path\filename> or (in Notes 5.x): "You cannot access portions of this document because it is encrypted and you do not have any of the keys: Could not detach to file <path\filename> This issue only occurs if an old ID is being used. To avoid the issue, use a current ID. The fact that the error messages could be more descriptive has been reported to Lotus Quality Engineering. Possible Resolution: This issue might occur when a user's ID has been updated with a new Public key, and the user is using an older version of their ID which contains the old Public Key. A user can initiate the updating of their Public key by using the menu options: File, Tools, User ID, More Options, New Public Key. The ID will then need to be recertified. Support for Novel GroupWise Unfortunately, we currently do not support GroupWise directly. However... there is a product that might be able to help with migration: http://www.transend.com/products_transend_migrator.asp There is a special Transend Migrator Forensic Edition license for use in forensic environments for eDiscovery. One license per workstation allows the conversion of an unlimited number of data files/mailboxes. Please contact us for more information on Transend Migrator Forensic Edition. Transend Corporation, 225 Emerson Street, Palo Alto, CA 94301 Phone: 650-324-5370 Converts Messages/Folders, Attachments, Archives, Address Books, Calendars and Tasks Between Virtually All Email Systems/Clients. Includes support for: Lotus Notes Outlook/Exchange (server or .pst file) Outlook MSG Files GroupWise (5.5+ for GW Archives) IMAP4 Server HTML Eudora Netscape/Mozilla/Thunderbird AOL CompuServe 2.0+ Outlook Express Pegasus Notework ExpressIT (Native and SMTP) cc:Mail DaVinci 3.0+ MHS/SMF-70 Calipso Archive Transport File (Transend proprietary format) Sun One (via IMAP) Pop Server http://www.transend.com/supported_mail_systems.asp Support for PaperPort .MAX filesIf you have the Paperport application installed, and it supports printing MAX files, then Discovery Assistant supports converting MAX files to TIFF. (tested and works). CommandLine: <D:\Program Files\ScanSoft\PaperPort\PPPAGEVW.EXE /p z:\web_test_files\5pages.max"> Forensics Tools:OST -> MSG Advanced Exchange Recovery. http://www.exchange-recovery.com/. $600 PST -> MSG Aid4Mail http://www.aid4mail.com hard drive usb write protect. Logicube Hard Drive & Media Duplication http://www.logicube.com/logicube/pressreleases/writeprotect.asp Support for OCRDiscovery Assistant now supports a native OCR feature. The default is to use Microsoft Office 2003 MODI control if available, otherwise, uses a SimpleOCR package shipped with Discovery Assistant. Microsoft MODI OCR uses the OmniPage SDK engine from Nuance software (and is our current best of breed recommendation). To confirm you have MODI installed, run Microsoft Office Document Imaging application, load a TIFF file, and select 'OCR'. The application will install OCR if not already installed. Discovery Assistant extracts text from documents during the conversion process. Extraction is extremely accurate as the text is generated by the print driver during the print process. If source documents are scanned images though, there is no text extraction when printing. In this case, you can manually OCR those documents you would like the text from by selecting the OCR button in the Conversion Tab. Fixes: OCR was not successful (no text was found) on one or more pages. http://support.microsoft.com/kb/918215/en-us Support for TIFF Pass-throughTo speed up the processing of scanned TIFF documents... if the source documents are already in the proper format (scanned B&W, standard dpi), then processing can be sped up by selecting 'Enable no-Print convert on images'. You can select this flag from the Admin / Configure screen. If this flag is set, then Discovery Assistant does not print the image, but creates an exact copy of the source image ready for further processing. formatted (standard dpi, and scanned in B&W) Support for Scanned TIFF FilesFor Discovery Assistant to properly fill in the Custodian / Box / Folder information at time of export, the scanner operator must save the original TIFF files according to the following rules: 1. Each Custodian gets their own output directory. An example custodian would be John Smith. 2. Within the custodian directories are subdirectories that correspond to each of the boxes. ie: if John Smith's documents come in three boxes, then there are three BOX folders in the John Smith directory. c:\...\John Smith\Box1 c:\...\John Smith\Box2 c:\...\John Smith\Box3
3. Within the box directories are the folder names. If a folder contains multiple folders, then those names are appended. ie: if Box1 contains 3 folders: January, February, March, and the January Folder contains two sub folders: Invoices, Receipts - then the output scanned TIFF files will be placed in the following directories: c:\...\John Smith\Box1\January\*.tif c:\...\John Smith\Box1\January-Invoices\*.tif c:\...\John Smith\Box1\January-Receipts\*.tif c:\...\John Smith\Box1\February\*.tif c:\...\John Smith\Box1\March\*.tif Discovery Assistant then uses the following logic to automatically generate the Custodian / Box / Folder export information: 1. TIFF filename, "c:\...\%1\%2\%3\filename.tif" is broken back into the following sub-directories: sub directory %1 is Custodian name sub directory %2 is Box name sub directory %3 is folder name
This way, no one has to hand-code any information. The Discovery Assistant operator checks that the scanner folder is correct when adding files into the project (checks that the scanner operator has done their job). If there is any confusion as to where documents came from, they can be immediately traced back to the original folder by using the displayed source path name. The source path name is reduced to Custodian / Box / Folder at time of export. If at some later date the Discovery Assistant operator is processing a PST file, or multiple folders of data, then the same rules apply when extracting the Custodian, Box, and Folder. (ie: everything is consistent moving forward). Handling WordPerfect files using Word: (WPD)If installing WPD support for the first time in Word, you may need to install the WPD plug-in. Next, you need to set up a file association for WPD files. Easiest way to do this is to right-click on a WPD file, then associate Microsoft Word with that file type. (Open association). Next, to get print and printto file associations established, easiest way to do this is to run the DA_Sysinfo.exe application and use it to copy across the DOC file associations. \program files\imagemaker\discovery assistant\DA_Sysinfo.exe wpd-doc To switch to using WordPrintTo to handle WordPerfect, go to Discovery Assistant Admin / Configure / Documents, and go to .DOC to get the over-ride settings. The Override setting for WPD will look something like: "C:\Program Files\ImageMaker\Discovery Assistant\WordPrintTo.exe" /pt "%1" "%2" "%3" "%4" Copy these same settings to the WPD entry in the Documents dialog. Calculation of MD5 Hash code to detect duplicates: Under Options / De-Duping, users can set the following values: Hash Code Sample Size(KB) 100 (set to 0 for entire file) On conversion: · Ignore Duplicates (process as usual) · Skip Duplicates (don't convert) · Copy Duplicates (copy the TIFF file from previous conversion) <---- recommended · Link Duplicates (point to the TIFF file from previous conversions Hash codes are generated when the file is first entered into the database. To speed things up, users can set hash code generation to just the first K bytes of a file. Default is 100K. Duplicates are generated as files are added to the database. If a duplicate is found, the duplicate file, and the file being added are both marked as 'duplicate'. Email files are binary files with unique index values within them (MessageID). To compare if emails are duplicates, we extract and check only the text contents of the message. Algorithm to check for and assign duplicates:For each new file being processed: If Message File, then extract message body (as text). Calculate MD5 hash code for first (x) bytes of file. (multiple of 1K) Convert hash code to a string Loop until Done: Search existing database for first matching MD5 hash code. If no matching MD5 hash code Add new hash code. Done Else Binary compare the two matching files. If files match Mark both as duplicates. Done Else Add a character extension to the hash value to make it unique, and loop End Loop Handling Terabytes of Data:We rate our product at a gig per day per machine. 1 gig of data averages out to approximately 70,000 pages, and about 5 gigs of storage space. Actual conversion speeds are rated at 3,500 pages per hour of straight conversion (20 hours a day), plus an additional 4 hours a day to handle the other house-keeping tasks, like file import / de-duplication / deblanking / bates labeling / exporting etc. In addition to straight conversion is the time to: · Set up the machines and install all appropriate software. · Quality control review of output data. · Exception handling. · Trouble-shooting. The Discover Assistant operating philosophy is that if you want to convert Terabytes of data, you need multiple-Terabytes of storage space and lots and lots of computers. To handle terabytes of data requires: · Enumerating what files are to be converted using our TeraBite application. · Breaking the project down into 1 or 2 gig Batch files. · Optionally run a global 'de-dup' check · Process the Batch files across multiple machines. · Export the resultant files back into a case management system for additional processing Assuming each GIG of data yields approximately 70,000 tiff pages, rough estimates as to time to process the data are as follows: Pages per TeraByte: 70,000 pages per gig * 1000 gigs = 70,000,000 pages Computer time to process a TB (assumes an average of 1 second per page): 70,000,000 * 1/60 = 1,200,000 minutes 20,000 hours or approximately 1000 days. Standard outsourcing prices per TB (3 cents a page): 70,000,000 * .03 = $200,000 If you had 100 computers on-site processing the data, 1TB would take 10 days to process. With two operators running the machines, costs to do a TB would be: $20K amortization of computers (10% of $200,000 worth of equipment) $20K amortization of software (10% of $200,000 worth of ImageMAKER, Office, etc) + $10K for operator costs $50K If you were to outsource the same job to a third party (at preferred rates), expected costs would be: $200K System Requirements:Windows 2000, Windows XP, Windows 2003 (client, server, or WTS). 1 gig of ram. 30+ gigs of hard drive space (for output files). Microsoft Office, Acrobat, IE6, and any other file type application pre-installed. Preferred Computer Configuration: 3 Gig memory Dual 64 bit AMD 2 GHz processor running Windows XP 200 GIG hard drive. Gigabyte network cable Discovery Assistant comfortably handles the conversion of up to 100,000 files per project. For example, if you have one million files to convert, then our recommendation is to break them down into 10 separate projects. Sample Hard Drive requirements: 3.3 Gigs NSF file expands to: 15284 files. 9 Gigs of source files 13 Gigs of TIFF/Text/Metadata Handling Parent Child relationships:When we load in PST / MSG/ ZIP files, we keep track of all parent/child relationships between the related files. Specialized parent/child relationships handling is done at: · Queuing for conversion. · Assigning doc id's and bates numbers · Time of export · User interface can identify parent / child / sibling of any item in the queue. In addition, there are a number of metadata fields designed explicitly to identify ranges of parent/children. Current methodology for handling parent/children we first load the parent, then we enumerate each of the children, assigning sequential FileID numbers as we go along. Next we process the children's attachments, assigning FileID's as we move along, and so forth until the message is processed. We then go onto the next message. 'Child next' order turned OFF is how we load the files into Discovery Assistant. 0(0-6) | ------------------------------------------------- | | | | | | 1(0-6) 2(2-9*) 3(0-6) 4 5 6(6-12*) | | ----------------- --------------------- | | | | | | 7 (8-15*) 9 10 11 12(6-12*) | ----------------- | | | 13 14 15 'Child next' order turned ON is how we assign Bates Numbers, Document ID's in preparation for export. 0(0-15) | ------------------------------------------------- | | | | | | 1 2 9 10 11 12 | | ----------------- --------------------- | | | | | | 3 4 8 13 14 15 | ----------------- | | | 5 6 7 Assumptions: 1. There is only one parent, everything else is a child. 2. every child has the same parent (0), and all files have the same range (0-15). 3. when we hit a PST file, that breaks the cycle. PST files are not parents. If a msg file contains a PST file, then we don't keep track of children. 4. The diagram contains items, and range values in brackets. Any range value with an asterix in it is an incorrect 'child next' grouping. Parent Child relationships are reported in the export files through the following MetaTags: (Summation example) @ATTACHRANGE "filename" based @C GROUPRANGE "filename" based @C BATESGROUPRANGE "bates number" based @C BEGATTACH "filename" based @C ENDATTACH "filename" based Where "filename" can be any combination of DocID, FileID, Title, Bates Number etc. (naming scheme). If your original files are all named by their DOCID's, and you want to preserve that information at time of export, then set "filename" to %TITLE% in the naming scheme, and all exported files, and file ranges (except for BATESGROUPRANGE) will be based on the original filenames. Exported Ranges of DOCID and BATES Numbers:DocID Export fields for Summation: Parent: DOCID_00016 Attachments DOCID_00017-DOCID_00022 @ATTACH DOCID_00017; DOCID_00018; DOCID_00019; DOCID_00020; DOCID_00021; DOCID_00022 @ATTACHRANGE DOCID_00017-DOCID_00022 @C BEGDOC DOCID_000160001 @C ENDDOC DOCID_000160002 @C GROUPRANGE DOCID_00016-DOCID_00022 @C BEGATTACH DOCID_00016 @C ENDATTACH DOCID_00022 Bates export fields for Summation: Test Data: Parent: NTR00033-NTR00034 first attachment NTR00035-NTR00035 last attachment: NTR00041-NTR00053 @BATESBEG NTR00033 @BATESEND NTR00034 @C BATESGROUPRANGE NTR00033-NTR00053 @C BATESGBEG NTR00033 @C BATESGEND NTR00053 Note: @ATTACHRANGE and @ATTACH are the only export field that list JUST the attachments. All other fields include the mail message and attachments within the range. Support for UTC Date/TimeAll dates are UTC based (coordinated universal time), but expressed in the local time zone of the machine that is doing the conversion. Here are the exceptions: · All files have a Created, Modified and Accessed date stored in UTC format. These date/times are displayed in local time by the machine that is accessing them. For example, if the files are written to a hard drive in London at 9:00 AM (Local time is UTC-0), and that hard drive is then read in Vancouver (Local Pacific time is UTC -8), the time stamp will be reported as 1:00 AM. · If files are 'copied' to another directory or filename, then the Create and Last accessed dates of the new file will change to today's date and time. · If files are 'zipped', then 'unzipped', the Created Date, and Last Accessed Date will be set to today's date/time. Only modified date will be retained. Same goes for RAR compressed files - when uncompressed, only the 'modified' date will be correct. · Email received/sent date/time values are stored UTC, and reported in local time. · Word / Excel / Powerpoint, Acrobat store the following additional UTC dates in their MetaData: Date Created Date Last Printed Date Last Saved When we extract these date/times, we use these values to over-ride the operating system values for Created/Modified. Note: most simple file types (such as TXT, HTML, JPEG) have operating system date/time values for Created, Modified and Accessed, and do not contain any embedded Date/Time Metadata. Loading in a 'Selection Set'Discovery Assistant supports two different selection sets: Document ID, and FileID. (FileID's are internally assigned numbers. Document ID's are user defined strings). To define a selection set, create a TEXT file containing the FileID's, or DocumentID's, one file per line in the text file, then use the 'select' button to load that file in. Items in file list are marked 'selected'. Practical use: 1. Do a first pass-through to produce converted documents. 2. Assign DocumenID, and export DocumentID with data set. 3. Customer reviews data, and provides you with a list of DocID's to produce. 4. Load selection set in 'converted' tab. 5. Assign Bates Numbers to selection set. Choose 'child recursive' to get children. 6. Sort on Bates Numbers, and select only those that have been assigned bates numbers. Parents of children are identified by 'bates range'. 7. Select which files you want to bates stamp. 8. Bates Stamp 'selected' files and export OR Export unstamped Selected files. Estimating Page CountsIf you convert to TIFF, and then use the 'summary report' on the all files tab, you get a CSV file containing formatted information about: · file types · number of pages per file · total size of files by type · number of files converted / passthrough / failed /skipped. If you don't want to invest time converting the files, BUT still want an estimated page count (for billing purposes), then queue the data for MetaData conversion only. Then convert (metadata only). Discovery Assistant then estimates page count based on file size if the count is not already included in the metatata. Values used to 'estimate' bytes per page, broken down by file type, are stored in the registry at: HKLM\Software\ImageMAKER\DiscoveryAssistant\Settings\PageCountEstimates. If you do change these values, you need to stop/re-start Discovery Assistant for them to take effect. These values are also stored in the setup.ini file, and re-set at installation time. Actual values for TIFF files are calculated. Metadata values for Word and PDF are used if available. All other file types are defaults at 10,000 bytes per page. Some file types (like zip, nsf, pst, msg) are estimated at 1 page per.
|
For more information ImageMAKER Development Inc. 416 Sixth Street, Suite 102 New Westminster, BC Canada V3L 3B2 http://www.imgmaker.com Copyright © 2004-2008 |
To contact us from overseas: Sales: 1.604.525.2170 Local (Pacific) time: GMT-8 |
ImageMAKER Development Inc. Sales: toll free (866) 525-2170 or (604) 525-2170 Support: (604) 525-2108 Fax: (604) 520-0029 Email: sales@imgmaker.com support@imgmaker.com |