In our previous blog post, we discussed the importance of defining your requirements for your NLP evaluation. In short, can you describe what “perfectly performing natural language processing (NLP)” would output? And with perfect output, have you shown that you would get the business outcomes you are seeking? This second of three parts looks at the heart of the work, which is annotating a gold standard test dataset, and then scoring the results you get from different NLP libraries processing this test dataset.
Annotate the gold standard test dataset
Annotating data begins with drawing up guidelines. These are the rules by which you will judge correct and incorrect answers. It seems obvious what a person, location, or organization is, but there are always ambiguous cases, and your answer will depend on your use case, which depends on the requirements you defined in step 1.
If you’ve never done this before, you might think, “Well, isn’t a ‘person’ just any person mentioned?”
Here are some ambiguous cases to consider:
- Should fictitious characters (
“the Artful Dodger”
) be tagged as “person”? - When a location appears within an organization’s name, do you tag the location and the organization extracted or just the organization (
“San Francisco Association of Realtors”
)? - Do you tag the name of a person if it is used as a modifier (
“Martin Luther King Jr. Day”
)? - How do you handle compound names?
- Do you tag “Twitter” in
“You could try reaching out to the Twitterverse”?
- Do you tag “Google” in
“I googled it, but I couldn’t find any relevant results”
? - When do you include “the” in an entity?
- How do you differentiate between an entity that’s a company name and a product by the same name?
The Daily Globe was criticized for an article about the Netherlands in the June 4 edition of The Daily Globe.
How to do the annotation
The web-based annotation tool BRAT is a popular, if manual, open-source option. More sophisticated annotation tools that use active learning can speed up the tagging and minimize the number of documents that need to be tagged to achieve a representative corpus. As you tag, check to see if you have enough entity mentions for each type, and then tag more if you don’t.
Once the guidelines are established, annotation can begin. It is important to review the initial tagging to verify that the guidelines are working as expected. The bare minimum is to have a native speaker read through your guidelines and annotate the test corpus. This hand-annotated test corpus is called your “gold standard.”
Extra credit: Set up an inter-annotator agreement. Ask two annotators to tag your test corpus, and then check the tags to make sure they agree. In cases where they don’t agree, have the annotators check for an error on their side. If there’s no error, have a discussion. In some cases, a disagreement might reveal a hole in your guidelines.
Get output from vendors
Give vendors an unannotated copy of your test corpus, and ask them to run it through their system. Any serious NLP vendor should be happy to do this for you. You might also ask to see the vendor’s annotation guidelines, and compare them with your annotation guidelines. If there are significant differences, ask if their system can be adapted to your guidelines and needs.
Evaluate the results
Let’s introduce the metrics used for scoring named entity recognition (NER), and then the steps for performing the evaluation.
Metrics for evaluating NER: F-score, precision, and recall
Most NLP and search engines are evaluated based on their precision and recall. Precision answers “of the results you found, what percentage were correct?” Recall answers “of all the possible correct results, what percentage did you find?” F-score refers to the harmonic mean of precision and recall, which isn’t quite an average of the two scores, as it penalizes the case where precision or recall scores are far apart. This approach makes sense intuitively because if the system finds 10 answers that are correct (high precision), but misses 1,000 correct answers (low recall), you wouldn’t want the F-score to be misleadingly high.
In some cases using the F-score as your yardstick doesn’t make sense, such as with voice applications where the desire is high precision with low recall because the system can only present the user with a handful of options in a reasonable time frame. In other cases, high recall and low precision is the goal. Take the case of redacting text to remove personally identifiable information. Redacting too much (low precision) is much better than missing even one thing that should have been redacted, which is fulfilled by having high recall (overtagging what should be redacted).
Determining “right” and “wrong”
Determining what is correct is easy. The tricky part is, it is possible to be wrong in many ways. We recommend the guidelines followed by Message Understanding Conference 7 (MUC-7).[1] Entities are scored on a token-by-token basis (i.e., word-by-word for English). For ideographic languages such as Chinese and Japanese, a character-by-character scoring may be more appropriate, as there are no spaces between words and frequently a single character can represent a word or token.
Scoring looks at two things: whether entities extracted were labeled correctly as PER, LOC, etc.; and whether the boundaries of the entity were correct. Thus, an extracted PER, “John,” would only be partially correct if the system missed “Corncob,” as in “John Corncob,” the full entity.
When evaluating, F-score, precision, and recall are good places to start, but they are not the entire story. Other factors — including the ease or difficulty of adapting the NLP to do better on your task or the maturity of the solution — must be considered.
The third and final blog post in this series will walk you through those considerations. Next week, learn how to choose the right NLP package.
End Notes
- The Message Understanding Conferences (MUC) were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction between research teams competing against one another, which resulted in MUC developing standards for evaluation (e.g., the adoption of metrics like precision and recall). MUC-7 was the seventh conference held in 1997. https://en.wikipedia.org/wiki/Message_Understanding_Conference
Find out how to transform your data into actionable insights.
Schedule a DemoStay Informed
Sign up to receive the latest intel, news and updates from Babel Street.