Structured Documents: Definitions, the Challenges, and the Methods to Manage Unstructured Content – Chapter 1 

Structured Document-

Paper Comes in All Kinds of Shapes, Sizes, and Formats. Different challenges come with each variation of the form, content, layout, and complexity of a document. For those familiar with a loan package, think about structured documents and all the different types, page sizes, designs, colors, formats, sources, and file types that exist.

Structured-Document---Fixed-Form

Specifically, we’ll describe these different document types in regards to the mortgage and title industry, since most people have experience with the documents in these business transactions, and illustrate where the challenges lie and how they are being addressed. There are three main paper or document formats; structured, semi-structured, and unstructured.

Structured or Fixed Form

These are generally the easiest documents to index and store. The documents are created as forms, and then someone fills in the form. The data is always in the same place; the indexes are clearly defined since the form identifies to the client where to enter the information. Think of these fields like fields in an electronic form or database. It is generally a one-to-one ratio. For example, fields might include First Name, Last Name, Street Address, Zip Code, Loan Number, and so on. Examples of forms in the mortgage and title market would include HUD-1, tax forms, and loan applications.

For software to know how and what data to extract, a sample document is scanned into the system, and the fields are mapped out as a template. Nothing actually moves around these pages, so the software just knows to look in the same place every time for information.

You can see an example of how easy this works by using software like Adobe Acrobat Professional. Run an image of a form through this software, and it’ll automatically identify areas it thinks are form data. The imaging industry has had great success with these document types for more than a decade.

Blog Sequence Index

In our next chapter, we’ll focus on Semi-Structured Forms.