Active learning
A technology used in machine learning training whereby the training documents that the interim statistical model is least confident in its predictions, are given highest priority to annotators to annotate, thus presenting the model with a more diverse set of training documents up front.
AI (artificial intelligence)
A quality of software or computer systems that mimic the thinking processes of a human. The AI in software is often a model trained using machine learning.
Alert/Alerts/Alerting
Processes warning someone of a danger or threat detected by technology. For example, if a name matching system detects a match between a name being examined and a name appearing on a watchlist, the system notifies (or "alerts") investigators of a potential match.
Annotation
In NLP, marking up text to show which words refer to named entities, parts of speech, or other attributes that one wants an AI model to learn.
Anti-money laundering (AML) compliance
Financial institution procedures designed to spot and prevent attempts at money laundering, conducted in alignment with anti-money laundering statues of the country or region in which the financial institution operates.
Application programming interface (API)
A piece of software that enables different applications to communicate with each other.
Background check
The process of verifying that an individual is who he or she claims to be, often used before allowing entry into a country, granting employment, or providing another privilege. Background checks are often used to verify a person's education and job history, or to examine past criminal history.
Character Encoding
The mapping of characters in a character set (example: the Latin alphabet) to numbers (expressed as bytes) to electronically process text. Examples of character encodings include Unicode‚ UTF-8, ASCII, and Shift-JIS used for Japanese.
Commercially available information (CAI)
Information appearing in databases accessible to the public, typically for a fee.
Confidence Score
A value provided by an NLP task between 0 and 1 that is a measure of how sure the NLP is that the answer is correct, with 1 being 100% confidence.
Contextual translation
A translation process in which the word's context is considered to determine its exact meaning. For example, when translating the word "plant" into a non-English language, translation software should consider the context in which the word is used: "I work at a manufacturing plant," "I watered my plant," or "I'm going to plant a bomb."
Continuous vetting
The processes of regularly reviewing an entity's status to ensure its status continues to meet a set of standards. Continuous vetting is a significant part of anti-money laundering efforts and accompanying Know Your Customer requirements. AML statutes require financial institutions to implement mandatory processes for verifying personal and business identities at account opening, and periodically thereafter.
Coreference Resolution
The linking of named entities mentioned in a text (for example, "Katherine Johnson" to other mentions of the same entity whether as nouns "the mathematician (called, nominal references), pronouns "she, her (called, pronominal references), or other named entity references "Johnson elsewhere in a document. The linking of the same entity can be within a single document (in-doc coreference resolution) or in different documents (cross-doc coreference resolution).
Corpus
A collection of documents. In the context of machine learning, corpus frequently refers to the documents used to train a model.
Cross-lingual
Refers to functionality that is able to work across languages. For example, cross-lingual name matching can match Natsume Soseki to 夏目漱石
Cross-Lingual search
A search function that accepts a search query in one language but will return results from records in other languages.
Cybersecurity
The act of defending servers, networks, computers, mobile devices, and data from cyberattacks.
Dark web
The dark web is part of the deep web, accessible only through certain search engines designed to ensure anonymity. The dark web is a hotbed of illegal activity: drugs, weapons, child sexual abuse materials, politically sensitive information, and even human beings have been sold on the dark web. However, because of its anonymous nature, the dark web can also be used for legitimate purposes such as communications among people living in oppressive regimes; communications with journalists; and as a means of circumventing censorship.
Data management
The processes of securely collecting, storing and using information in alignment with business goals.
Deep learning
A type of machine learning that draws on deep neural networks that iterates and creates models that have learned over successive training passes over the data to predict the correct answer.
Deep web
The part of the internet unindexed by standard search engines, and, therefore, largely inaccessible to everyday or unauthorized users. By some accounts, up to 99 percent of all the information on the internet appears on the deep web. Content blocked by paywalls, emails, electronic health records, bank records, streaming services — anything behind a paywall or requiring a password is considered part of the dark web.
Edit distance
Edit distance algorithms count the minimum number of single-character edits (insertions, deletions, or substitutions) needed to convert one term to another.
Enriched data
Existing information to which relevant data from additional sources has been appended. If a "Robert Smith, Boise, Idaho" already appears in a database, users may be able to enrich that entry with Smith's street address, age, date of birth, profession, marital status, and other information.
Entity (aka, named entity)
Often refers to proper nouns (such as people, locations, and organizations) and other significant things such as date/time, currency, email, URL, nationality.
Entity disambiguation (aka, entity resolution)
Distinguishing between two entities that share the same name, such as Michael Collins the astronaut and Michael Collins the Irish leader.
Entity Extraction (aka, named entity recognition)
An NLP task that identifies entities (most often people, places, organizations) from freeform text
Entity Linking
An NLP task that links names of people, places, organizations to entries in a knowledge base (such as Wikipedia) and leverages the context in the text to distinguish between similarly named entities such as Michael Collins the astronaut versus the Irish leader.
Entity type
A category of entity such as person, place, and organization.
Entity vs Identity
In the Rosette context, an entity is the mention of a person, place, or organization in text. An identity is an actual person, place, or organization that has been matched to an entry in a knowledge base.
Event extraction
An NLP task to identify an event, such as a travel event (Joe flew from Boston to San Francisco) or an arrest event (Joe was put in custody by the police for disorderly conduct.)
Event keyphrase
The word (often a verb) that defines an event. For a travel event, the keyphrase might be "flew, "is in transit or "traveled.
F-score
This is a measure of accuracy for an NLP task, usually between 0 and 1. It is calculated from the harmonic mean of precision and recall. The F-score measure is sensitive to both false positives and false negatives; a higher value indicates better accuracy.
False negative (aka, a missed match)
In the context of NLP or search, each result that was NOT returned as "correct‚" but should have been returned as correct.
False positive (aka, an incorrect result)
In the context of NLP or search, each result that was returned as "correct‚" but was in fact not correct.
Fraud, waste, and abuse
Bad acts regularly committed in industries such as finance, insurance, and healthcare. Fraud is a dishonest and deliberate action undertaken to take money or goods to which the fraudster is not entitled. (Example: Submitting invoices for services never performed.) Abuse is an improper use of resources that does not meet the level of fraud. (Example: An employee using the office printer to produce 200 invitations to his family reunion.) Waste is mismanagement or thoughtless use of money or other resources. (Example: Buying 1,000 reams of paper each week from a long-term supplier who charges $13 per ream, when the same quality of paper is available elsewhere for $9 per ream.) These acts can be more easily detected and mitigated through the use of artificial intelligence.
Fuzzy match
This is a term used in search, where the appliation looks for matches that are not exact. Examples of fuzzy match range from common typos to variations such as nicknames (Robert versus Bob) or matching by meaning (beer versus booze).
Gazetteer
In the context of Rosette Entity Extractor, a gazetteer refers to a list of entities of a given type, such as a nationality gazetteer which would include entities such as "American, French, and "Chinese.
Identity Resolution
The process of matching information about real-life individuals from two different data sources. It answers the question: is this John Doe the same person as this other John Doe in this other database?
Information security
Technological and other processes undertaken to protect unauthorized access to and misuse of data.
Insider threat
A threat, often a cyber threat, to an organization arising from people who work or have worked within the organization. Perpetrators can include employees, former employees, and contractors. Insider threats are particularly prevalent in information technology, when persons with legitimate access to networks use that access to harm the organization, its suppliers, or its customers.
Insider threat detection
Artificial intelligence and other technological processes deployed to detect behaviors indicating the presence of an insider threat.
Intellectual property
Creations of the mind. Intellectual property includes literary and artistic works, design works, inventions, and software. Copyrights, trademarks, and patents are all means of protecting intellectual property.
Know your customer (KYC)
A set of mandatory processes for verifying a customer's identity at account opening and periodically thereafter. Financial institutions must undertake KYC processes to comply with anti-money laundering statutes.
Know your vendor (KYV)
A set of processes undertaken by some companies to verify their vendors' identities; to determine whether vendor capabilities meet the needs of the hiring company; and to ensure the vendor's business activities and processes align with the goals and ethics of the hiring company.
Language of origin
In the context of personal, organizational and place names, language of origin refers to the original language of the name. For example, the name "Abdul Rashid is of Arabic origin even though it is written in English.
Large language model (LLM)
LLMs are deep learning algorithms that train on massive data sets to learn how to generate text, summarize information, and perform other text-based tasks. ChatGPT is one of the most prominent examples of an LLM.
Lemma/Lemmatization
The lemma is the form of a word that appears as its entry in a dictionary. For example for verbs, the lemma is often the infinitive form and for nouns it is often the singular form. The lemma of "drove is "drive (as in "to drive". Lemmatization is the act of finding the lemma of a word.
Linguistics
The study of human languages in terms of their grammar, morphology, and structures.
Location extraction
The technological act of extracting mentions of a location in text, assigning spatial coordinates to that location, then geotagging it.
Machine learning
The process by which a statistical model (aka, AI) is trained on thousands of examples in order to perform a task such as entity extraction. The "learning refers to the algorithm figuring out what context a word that refers to a person will appear in. The tasks that the model can complete are considered artificial intelligence because it can handle the inexact nature of human-generated text.
Managed threat detection
The process of proactively seeking potential cyber threats, and responding to them.
Match Score
In the context of name matching, the match score (often from 0 to 1) indicates how likely two names are a match, with 1 being a perfect match.
Match threshold score
In name screening, a match threshold score is the pre-determined match score (a decimal between 0 and 1) where matches above and below this threshold score are sent into different workflows
Morphology
The science of how words are formed in each language.
Multilingual search
A search function that can accept a search query in several languages and return search results in the same language as the query.
N-gram
The "n" refers to a variable number. A trigram is where text is broken up into overlapping groups of three characters. For example the trigrams for "book are "boo and "ook. This is a technique used in language identification. It is also considered a less accurate technique to index words for a search engine in a language that does not use spaces between words (such as Chinese and Japanese).
Named entity recognition (NER)
(See Entity Extraction)
National security
Process to ensure the safety and defense of a sovereign state.
Natural language processing (NLP)
A discipline that is the intersection of linguistics and computer science that enables computer programs to process and analyze "natural language which is the text and speech that humans naturally generate.
NLP pipeline
The ordered progression of natural language tasks for processing human language. In the case of text analysis it could be (in order): language identification, morphological anlaysis, and entity extraction.
Normalization
The act of taking words and transforming them to a standard form. For example, put all nouns in the singular form would normalize "children to "child and "books to "book.
Open-source intelligence (OSINT)
Open-source intelligence (OSINT) is intelligence that is produced from publicly available information and is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement. Information sources include public records, government records, commercial data, social media, newspapers, and magazines, among many others.
Pairwise match
In the context of name matching, it is looking at just two names and determining the degree to which they match (expressed as a match score).
Part-of-speech
The role of a word in a sentence. Examples include noun, verb, adjective.
Persistent search
A technology that keeps a search operation open whether someone is using it or not, recording updates and changes to the search term. Imagine a search for Robert Smith of Boise, Idaho. A search may initially return information on his date of birth, his street address, his profession, and professional commendations. If persistent search later uncovers his obituary, this information will be automatically appended to the Smith's record.
Precision
In the context of search, precision calculates how many results returned were correct. Mathematically it is expressed as P= (number of correct results: true positives)/(total number of results: false positives+true positives)
Public records search (PRS)
The processes of finding information in documents and other information typically generated by governments and open to the public. Criminal records, birth records, marriage records, city council minutes, court rulings and a vast array of other local, state and federal documents are all examples of public records.
Publicly available information (PAI)
Information that is readily available to the public. Books, magazines, public facing web sites, and social media posts are all examples of publicly available information.
Recall
In the context of search, recall calculates of all possible correct results, how many were found. Mathematically, it is expressed as R = (number of correct results found: true positives)/(total number of possible correct answers: true positives+false negatives)
Regular Expression
A regular expression (REGEX) is a string that specifies character patterns to match. Within NLP, regular expressions are often used for extraction of entities that fit a pattern (dates, times, URLs, email addresses, etc.)
Relationship extraction
an NLP task that recognizes a specific type of relationship between two entities such as "Maya Angelou was born April 4, 1928 is a birthdate relationship between the entities "Maya Angelou and "April 4, 1928.
Risk decision automation
Also known as "automated risk management," risk decision automation is the processes of deploying technology to gain real-time insights into potential and existing risks, then taking steps to mitigate those risks.
Risk mitigation
Strategies and processes for preparing for and ameliorating the effects of threats.
Salience Score
A measure of how relevant a value is to the overall content of the input text. It attempts to answer the question "Does this information matter? For example in an article about Watergate issued by the entity "Reuters, "Reuters has low salience to the content compared to the entity "Richard Nixon.
Script
The writing system used by a language. For example the script for English is the Latin alphabet.
Semantic similarity
In the NLP context, producing words or text that are similar in meaning to a given word or text.
Semantics
Refers to the meaning of a word or text
Sentiment analysis
The NLP task that determines if a given text is expressing a positive, negative sentiment or no particular sentiment (neutral).
Situational awareness
An organization's ability to anticipate and react to problems.
Statistical model
A machine learning trained model (aka, AI) that accomplishes an NLP task based on statistical probability. For example, a statistical model for entity extraction will output words that have a high probability of referring to a person.
Stop words
Words which very frequently and do not help the NLP task, so that they are suppressed ("stopped) from consideration to avoid distorting the results of an NLP task or search. For example, in name matching, if not stopped, the title "Mr. would make "Mr. Tom Jones a better match to "Mr. Jack Jones than "Thomas Jones.
Supervised Learning
The use of annotated data (usually by humans) to train statistical models and AI for an NLP task. (See also Unsupervised Learning.)
Supply chain risk management
Technological processes for managing cybersecurity risks, threats and vulnerabilities across the supply chain.
Text analysis
The analysis of text by software to surface patterns, meaning, and insights for a given task.
Threat intelligence
Data that is collected and analyzed to understand threats, including the motive for and target of the threat.
Threat intelligence platforms
Technology that collects and organizes threat intelligence from a variety of sources for analysis and response.
Threshold score (for name matching)
The score above which two names are considered a match. This score is set by the user.
Token
In the context of search, streams of text are broken up into tokens, which are the unit for indexing text. Tokens are more commonly called a "word.
Transcription vs. Translation vs. Transliteration
Transcription is a word or name written based on how it sounds. Translation is restating words in a source language in another target language so that the meaning of the words are the same. Transliteration is the mapping of a word in one script to another script either based on phonetics (pronunciation) or a mapping of characters in one script to the other.
Transfer Learning
An NLP approach where the results of models trained in one human language can be leveraged for another NLP task or the same task in another language.
True negative
In the context of NLP or search, each incorrect result that was not returned as a correct result
True positive (aka, a correct result)
In the context of NLP or search, each correct result is a true positive
Unicode (UTF8, UCS2, UTF16 etc.)
This is a universal character encoding that maps characters to values which are in turn used by computer applications to process and display text.
Unsupervised Learning
Training of statistical models and AI for an NLP task using unannotated data. For example, unsupervised training used on existing entity extraction models can increase the accuracy of the model for a particular domain if it is close enough to the domain that the original model was trained on. (See also Supervised Learning.)
Visa Screening
Assessment of an individual's personal information and, sometimes, professional credentials to determine whether he or she should be allowed entry into a country.
Word/Text Embedding
An NLP representation of meaning of words and text as numerical vectors which enables applications to compare the meaning (semantics) of words and text. (See also Semantic Similarity)