Semi-Structured Documents: Definitions, The Challenges, And The Methods To Manage Unstructured Content – Chapter 2

Our second chapter in the series “Best Practices for Managing Unstructured Data” will focus on the definition of semi-structured documents. We’ll continue to add chapters around the solutions and best practices regarding managing this information.

Axis Technical Group recently exhibited at the AIIM Conference in San Diego. During this event, Axis hosted a roundtable entitled “Best Practices for Managing Unstructured Data”. What became divided is that there are a lot of different interpretations around how to define “Unstructured Data.”

Semi-Structured Documents

These documents are “forms,” but where the data tends to flow a bit more around the page. Many of these types of documents are the ones sent to you with information—not ones you have someone else complete. There’s some structure though; for example, expecting key fields to be at the top of the page, but that may change from vendor to vendor.

Examples of this format would be an invoice or a closing statement. In most cases within a closing statement on page one, at the top, you’ll have “Company, Address, Phone, Buyer/Borrower, Escrow No., Close Date, Proration Date, Preparation Date, and Property Address” but then comes the tricky part: the line items.

Semi-Structured DocumentOn semi-structured documents, not only do the primary key indexes at the top move in exact position from client to client but then the line items like “Charges, Adjustments, and Fees” could appear on any line in a table. For that matter, even on another page. These documents present some real challenges, but software has come a long way and can do a pretty good job with the key indexes. Many organizations choose to not capture all the information on the page and just focus on a few indexes so they can store and search for the file on these indexes.

Software is trained to look for words like “First Name,” or “Escrow No.” and then associate the words next to that term as the index. In many cases, these items are enough to fill a page and associate it with the rest of the mortgage package, and then allow it to be “organized.”

More advanced, high-volume, loan-processing organizations have implemented advanced software solutions to capture all critical data from a loan package. It takes more training and costs more money, but in an extremely competitive market, it returns a very attractive ROI on the investment.

In other instances due to the complexity of the documents, some organizations do simple index extraction and then send the images to a data-entry shop to manually key in the rest of the desired data. These Document Processing Outsourcers (DPOs) have become popular with organizations where that can send this service overseas to low-cost processing centers running 24/7 with potential turnaround times of less than a day. Though attractive, the cost can add up when you are paying for every keystroke. In addition, it’s hard to scale up and down as volumes change which is very typical in this industry.

One critical department, where semi-structured documents are processed very successfully, is accounting. Invoices are a semi-structured, high-volume process for most organizations and can save a company a ton of time and human effort entering the information into line-of-business and accounting software packages. AP processing is, in fact, the largest use of Document Imaging software, since every company has an accounting department.

In our next chapter, we’ll focus on Unstructured Documents.

Blog Sequence Index