Templatic documents, such as receipts, bills, insurance quotes, and others, are extremely common and critical in a diverse range of business workflows. Currently, processing these documents is largely a manual effort, and automated systems that do exist are based on brittle and error-prone heuristics. Consider a document type like invoices, which can be laid out in thousands of different ways — invoices from different companies, or even different departments within the same company, may have slightly different formatting. However, there is a common understanding of the structured information that an invoice should contain, such as an invoice number, an invoice date, the amount due, the pay-by date, and the list of items for which the invoice was sent. A system that can automatically extract all this data has the potential to dramatically improve the efficiency of many business workflows by avoiding error-prone, manual work.
This is a novel approach using representation learning to automatically extract structured data from templatic documents. Experiments on two corpora (invoices and receipts) show that we’re able to generalize well to unseen layouts.
This is a difficult challenge because it combines NLP and computer vision. Documents do not contain natural language, but text is found in forms and tables. An understanding of the two-dimensional layout of text on the page is key to understanding such documents.
In contrast to adapting techniques in NLP, computer vision or combinations, we propose using representation learning. Our extraction system uses knowledge of the types of the target fields (invoice_date
, currency
, etc) to generate extraction candidates, and a NN learns a dense representation of each candidate independent of the field it belongs. We also learn a representation for the field itself, based on neighboring words in the document.
The design of our system relies on a few observations about hwo information is often laid out in form-like documents. We can encode these priors into the architecture and its features.
- Each field often corresponds to a well-understood type
For example, the only likely extraction candidates for the invoice_date
field in an invoice are instances of dates. A currency amount of $25 would clearly be incorrect. Since there are orders of magnitude fewer dates on an invoice as there are text tokens, limiting the search space by type dramatically simplifies the problem. Consequently, we use a library of detectors for several common types such as dates, currency amounts, integers, address portals, emails addresses, etc. to generate candidates.
- Each field instance is usually associated with a key phrase that bears an apparent visual relationship to it
Usually there is a visually clear 'Date' sign next to the invoice_date
. We call such indicative words key phrases. Proximity is not the only criterion that defines a key phrase. It is also not the case that the key phrase always occurs on the same line. An effective solution needs to combine the spatial information along with the textual information. Fortunately, in our experience, these spatial relationships exhibit only a small number of variations across templates, and these tend to generalize across fields and domains.
- Key phrases for a field are largely drawn from a small vocabulary of field-specific variants
About 93% of the nearly 8400 invoice date instances were associated with key phrases that included the words “date” or “dated” and about 30% included “invoice”. Only about 7% of invoice dates had neither of these words in their key phrases. The fact that there are only a small number of field-specific key phrases suggests that this problem may be tractable with modest amounts of training data.
Given a document and a target schema, we generate extraction candidates for each field from the document text using the field type. We then score each candidate independently using a neural scoring model. Finally, we assign at most one scored candidate as an extraction result for each field.
Our system can ingest both native digital as well as scanned documents. We render each document to an image and use a cloud OCR service, which allows this to work with native digital documents, such as PDFs, and document images (e.g., scanned documents).
We then run a candidate generator using a could-based entity extraction service that identifies spans of text in the OCR output that might correspond to an instance of a given field. The candidate generator utilizes pre-existing libraries associated with each field type (date, number, phone-number, etc.), which avoids the need to write new code for each candidate generator.
Since the recall of the overall extraction system cannot exceed that of the candidate generators, it is important that their recall be high. Precision is, however, largely the responsibility of the scorer and assigner.
Given a set of candidates from a document for each field in the target schema, we want to identify the correct extraction candidate (if any) for each field. First we compute a score for each candidate independently using a NN, then we assign to each field the scored candidate that is most likely to be the true extraction for it.
This separation of scoring and assignment allows us to learn a representation for each candidate based only on its neighborhood, independently of other candidates and fields.
The scoring module takes as input the target field from the schema and the extraction candidate to produce a prediction score, the likelihood that the candidate is indeed a value one might extract for that field.
We propose an architecture where the model learns separate embeddings for the candidate and the field it belongs to, and where the similarity between the candidate and field embeddings determines the score. We believe that such an architecture allows a single model to learn candidate representations that generalize across fields and document templates.
The essential features of a candidate are the text tokens appearing nearby, and their positions. We define a neighborhood zone around the candidate extending all the way to the left of the page and about 10% of the page height above it. Any text tokens whose bounding boxes overlap by more than half with the neighborhood zone is considered to be a neighbor. We also use the absolute position if the candidate itself as a feature.
We do not incorporate the candidate text into the input. This text was already the basis for generating the candidate in the first place. Withholding this information from the input to the model avoids accidental overfitting to our somewhat-small training datasets.
We embed each of the candidate features separately. See figure 3 in paper.
Regarding invoices, we use 14k single-page invoices. The documents are from different vendors so they do not share any common templates. The second corpus contains 600 documents belonging to different templates, with no templates in common with the first corpus. We have used human annotators to provide ground truth extraction results for the fields of amount_due, due_date, invoice_date, invoice_id, purchase_order, total_amount and total_tax_amount.
The candidate generators have high recall, but the precision varies dramatically from field to field.
We also evaluated the model using receipts, containing 600 receipt images with ground truth extraction results for four fields: address, company, date and total.
We show that the model is able to help the extraction system generalize to unseen templates, and we probe the model to show that it learns meaningful internal representations. See paper section 6 for more.
In this initial foray into this challenging problem, we limited our scope to fields with domain-agnostic types like dates and numbers, and which have only one true value in a document. In future work, we hope to tackle repeated fields and learn domainspecific candidate generators. We are also actively investigating how our learned candidate representations can be used for transfer learning to a new domain and, ultimately, in a few-shot setting.