Project Description
This project is all about creating and sharing code used to recover PDF documents that have been uploaded and stored in Oracle databases using "Oracle Forms" applications.

Background

The Problem
How to extract thousands of Adobe Acrobat PDF files that had been uploaded to "long/raw" fields in an Oracle database through an Oracle Forms application and move them to BLOB fields in a new table for access by a new web-based application. The PDF files are stored in the database in an unknown format. Simple binary extracts resulted in unusable files that can not be read with Adobe Acrobat. The PDF files 'can' be accessed individually through the Oracle Forms application, where they open in Adobe Acrobat, but we have no way to automate this process.

Oracle Unable to Provide a Solution
The Oracle Forms application also allowed users to upload Microsoft Word, Microsoft Excel and PDF documents and that the files that were somehow saved using OLE containers. Oracle provided us code to recover the Microsoft documents from the database and save them to a file system where they could be accessed normally through their Microsoft applications. Unfortunately when the same code was used to recover PDF documents, the resulting files could not be opened or viewed with Adobe Acrobat applications.

No Third-Party Solution
We spent a lot of time searching for a third-party solution. We found a few others who were also searching for a solution, a few interesting clues, but no answers. The more interesting and relevant results of our search will be included elsewhere in this project.

Extracting the Documents with VBA
Not being familiar with Oracle products or the Oracle Forms application an attempt was made to access the files through Visual Basic for Applications (VBA) script running in an Microsoft Access database. An ODBC connection was established to the database and initial attempts were to simply extract the PDF files from the database and saving them directly to the file system. The resulting files could not be opened successfully with Adobe Acrobat applications, but examination of the files with TextPad revealed that they were very similar to "normal" PDF files.

Analyzing the Files
We started uploading PDF files through the Oracle Forms application then extracting them with the VBA application and comparing them. We discovered the files were nearly identical with the exception of a prefix or header, a suffix or footer, several bytes of code that appeared before and after the PDF document itself. By removing this "wrapper" we discovered that we were able to open and read smaller files without error; although larger files would flash error messages when opening and/or would contain errors within the document itself...unintelligible characters, pages that wouldn't display, images that were mostly correct but were missing sections.

Further study turned up 512-byte blocks of code that had been inserted periodically into the extracted PDF's. When these blocks of code were removed manually along with the headers and footers, the PDF files were the identical to the originals and would consistently open in Adobe Acrobat with no noticeable errors.

The Extraction Utility
The VBA utility was modified to remove these headers, footers and "delimiters" and then refined several times to improve its ability to detect and remove the delimiters. The extracted PDF's seemed to be normal and error free; however, we did not have sufficient resources to test every downloaded file for accuracy and did not have the originals for comparison.

Conclusion
The extraction utility seems to be working, but has only been tested on an extremely small sample from a single user. Our next improvements to the utility will likely be an attempt to improve the delimiter-detection code and improve the speed of the code.

Purpose

This project was established in CodePlex to:
  • Share what we've learned about the structure of PDF documents stored in "long/raw" fields by our Oracle Forms application,
  • Share the VBA extraction utility with others who may find it useful,
  • Improve the extraction utility by getting feedback from those who have attempted to use it,
  • Serve as a focal point for others to advance this knowledge and share their own PDF recovery code and techniques.

Contents

  • Oracle Forms Code Fragments of code in our Oracle Forms application used to upload and download PDF files. Also the code used to recover Microsoft Word and Excel files from the database to the file system.
  • PDF File Structure What we know about the headers, footers and delimiters found in the extracted PDF files.
  • The VBA File Recovery Utility Download the VBA utility
  • Links to other resources
  • User feedback
Last edited Aug 30 2007 at 3:13 PM by DaddyUnit, version 16

 

Want to leave feedback?
Please use Discussions or Reviews instead.

Updating...
© 2006-2009 Microsoft | About CodePlex | Privacy Statement | Terms of Use | Code of Conduct | Advertise With Us | Version 2009.10.27.15987