Digital representation of automated document workflow Systems

White Paper

Cross-lingual Search Based on Concepts and Meaning

Cross-lingual search is the process of querying in one language to find relevant documents in other languages. Until recently, machine translation has been the primary method of searching across languages either by translating search queries into other languages or translating searchable records into English. However, both machine and human translation lose valuable nuances and meaning present in the original text.

This white paper explores an approach that delivers better accuracy based on semantics (meaning), not translation. Semantic search goes beyond finding keywords to retrieving ideas suggested by the keywords.

In part 1, we compare the traditional translation-based approach with a newer approach that uses semantic similarity through text embeddings — a way to represent words in natural language processing tasks that encodes the meaning of words as mathematical vectors.

In part 2, we look at implementing semantic search as we discuss:

How to retrofit an existing keyword search engine to add cross-lingual and fuzzy search
Ways to overcome issues of speed, especially when searching very large data sets
A specific use case — targeted topic and event extraction
The special case of cross-lingual name matching