Entity extraction is becoming a mission-critical tool for finding mentions of people, places, organizations, and products in massive quantities of text. In patent searches, law enforcement, voice-of-the-customer analysis, ad targeting, content recommendation, e-discovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.
Among named entity recognition systems, those such as Babel Street Text Analytics (formerly Rosette) entity extraction function which rely on machine learning to find entities have the advantage. They can find previously unknown entities. Furthermore, because statistical entity extractors are context sensitive, it can disambiguate between places like Paris and people named Paris.
Why entity extraction needs to be flexible
When it comes to entity extraction, not all content is created equal. While most entity extractors are quite accurate out-of-the-box when working on well-formed text such as news articles, the high degree of content variation in blogs, restaurant reviews, financial documents, electronic medical records, legal contracts, and patent filings, can limit the algorithms’ accuracy.
Text Analytics has an advantage in these cases. Its statistical model has been tuned to a wide range of content beyond simply published news. And, for users with particularly quirky data—whether in format, style, or vocabulary—and for those who need every last bit of accuracy, Text Analytics includes robust field training capabilities with multiple mechanisms for adapting to your data’s idiosyncrasies, thus maximizing the accuracy of entity extraction on your data.
Using field training to improve accuracy
Level 1: Just add data
The easiest level of adaptation, called “unsupervised field training,” can be almost completely user driven. Text Analytics provides access to a state-of-the art clustering tool chain. You add any quantity of your own data — no need for annotation! just any old documents you have lying around that are representative of the data you need to extract — and Text Analytics will build a new model adapted to the idiosyncrasies of your data, dramatically increasing the entity extraction accuracy.
This unsupervised process allows Text Analytics to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters, i.e., “similar words tend to appear in similar contexts.” Thus it might learn that the function word “outturn” is used in financial documents the same way “outcome” is used in news articles, or that the words “Waltham”, “Atiak”, “Loveland”, “Svetogorsk”, “Yeisk” and “Descoberto” are all likely names of LOCATIONs, even though none were mentioned in the original collection of annotated training data. Consequently Text Analytics will better understand the context surrounding unfamiliar words, and as a result, extract them into existing, well-defined clusters.
Level 2: A little annotation goes a long way
For even greater accuracy, you can annotate a small quantity of your data and actively teach Text Analytics the unique contexts for entities that are common to your documents. Only a few hundred annotated documents can create dramatic improvements in accuracy. Adaptation Studio makes adding annotated documents to boost the existing entity extraction model in Text Analytics much faster and more efficient than traditional annotation methods.
It used to be that annotators had no choice but to work “blind.” That is, they could not tell when they had annotated enough documents to reach the desired level of accuracy. Adaptation Studio, a user-friendly, web application, coordinates the work of multiple annotators and creates training data exponentially faster than traditional methods.
How? By:
- Leveraging interim models: The training is bootstrapped by tagging a tiny number of documents to build an interim model
- Efficient annotation: Active learning technology prioritizes the untagged documents that the interim model shows least confidence in; therefore, a greater variety of events are tagged sooner
- Computer-assisted tagging: The interim model pre-tags unannotated documents so that human annotators only correct errors, which is faster than hand-tagging every event
- Iterative model evaluation: The system continuously measures the model’s accuracy, allowing annotation to stop as soon as accuracy is achieved
Customers who have conducted entity extraction field training report a drop in both false positives (increased precision) and false negatives (increased recall) from Text Analytics and a noticeable improvement in their overall analytics system.
Professional support
Given that most of our customers welcome guidance in selecting data, building a new model, and evaluating the results, Babel Street offers professional services to assist with field training. Whether you are just adding raw data to the entity extraction model to improve accuracy or annotating your own data, we are here to assist in your target languages.
Find out how to transform your data into actionable insights.
Schedule a DemoStay Informed
Sign up to receive the latest intel, news and updates from Babel Street.