Wednesday, August 12, 2015

9,098,487: categorization based on word distance

One of the correlates for the social appreciation or value of scientists is the Gini coefficient. Indeed, poor people cannot afford technologies that make their lives more comfortable (and would not be able to amortize the investment anyway because their labor has little value). Rich people also cannot necessarily amortize the investment in new technology, because they can just hire the poor to do the work for them for a low pay. What drives technology is a low Gini coefficient, because it is a broad middle class that can and does invest in new technologies that makes their lives more efficient.

Worldwide, over the past two centuries the Gini factor has been on the rise, yet there has been incredible technological progress. This means we have to look at a smaller geographical scale. Indeed, for the late 2000s, the United States had the 4th highest measure of income inequality out of the 34 OECD countries measured, after taxes and transfers had been taken into account. As it happens, except for the medical sciences, the American science establishment is only a pale shadow of what it was half a century ago.

In this context, we were naive when three years ago we embarked to solve an efficiency problem for customers processing medical bills, invoices and receipts. These artifacts are still mostly on paper and companies spend a huge amount of time having minimum wage people ingesting it for processing in their accounting systems. In the datasets we received from the customers, the medical bills were usually in good physical shape, but the spelling checkers in the OCR engines had a hard time cracking the cryptic jargon. Receipts were in worst physical shape, printed on poor printers and crumpled, and often imaged by taking a picture with a smart phone.

Surprisingly, invoices were also difficult to process. They often had beverage stains and sometimes they had been faxed several times using machines that looked like they had mustard on the platen and mayonnaise under the lid. But the worst was that in the datasets we received, the form design of the invoices was very inconsistent: the fields and their labels were all over the place, many gratuitous and inconsistent abbreviations were used, etc.

Considerable effort had been spent in the 80s to solve this problem by people like Dan Bloomberg with his mathematical morphology methods to clean up the scanned images (Meg Withgott coined the phrase "document dry cleaning") and Rob Tow with David Hecht and others invented the glyph technology to mark forms so that the location and semantics of each field could be looked up. Maybe due to the increasing Gini coefficient they were not commercially successful. However because this time we had actual customers, we decided to give it another go.

Note that OCR packages already have pretty good built-in spelling checkers, so we are dealing with hard cases. The standard approach used in semantic analysis is based on predetermining all situations for a lexeme and store them into NoSQL databases. In our applications this turned out to be too slow: we needed a response time under 16 µs.

Looking at our dataset, we had a number of different issue types:

  • synonym: a word or phrase that means exactly or nearly the same as another word or phrase in the same language
  • abbreviation: a shortened form of a word or phrase
  • figure of speech: a word or phrase used in a nonliteral sense to add rhetorical force to written passage
  • misspelling: a word or phrase spelled incorrectly
  • mistyping: like misspelling, but depends on the distance between the keys on a keyboard or the stroke classes in the OCR engine
  • metonym: a word, name, or expression used as a substitute for something else with which it is closely associated
  • synecdoche: a figure of speech in which a part is made to represent the whole or vice versa
  • metalepsis: a figure of speech in which a word or phrase from figurative speech is used in a new context
  • kenning: a compound expression in Old English and Old Norse poetry with metaphorical meaning
  • acronym: an abbreviation formed from the initial letters of other words and pronounced as a word

In addition to computational speed, we wanted to be able to address each of these constructs explicitely. In the following we call a lexeme "category" because the used process is a categorization more than a linguistic analysis.

We solved the problem by introducing a metric, so that we could deal with everything as distances and intervals. For the metric we chose the edit distance, also known as Levenshtein distance: the number of edits used to change a first word in the category into a second word in the category, such as using additions, deletions, substitutions, and transpositions. This metric can be computed very fast. We also tried the Damerau-Levenshtein distance, but in this application it did not make a difference.

With a metric, everything was simple. We took the lexemes in each category and computed the center of gravity to determine the prototype for each category and the diameter of the category. The category then received a label that is the word or phrase used in the document typing application.

interval of a lexeme

With misspellings the diameters could be small, but with with synonyms and figures of speech the intervals could be quite large and can overlap. The intersections could easily be computed. Because in all three out datasets the dictionaries were small, we easily resolved the overlaps visually by constructing a graph in which each node had the category diameter as the value and the edges between two nodes had their Levenshtein distance as a weight. Then we plotted the graph with Gephi and split up the overlapping categories into smaller one with the same label.

With this, document typing became very fast: for each word or phrase we looked into which category it fell, and in there we looked for a match. When there was one, we replaced it with the lexeme's label, when not, we added it to the category and logged it for manual verification.

Patent 9,098,487 was filed November 29, 2012 and issued August 4, 2015.