Identify the languages contained in any given document
-
Datasheet
Nstein Text Mining Engine: Best-of-breed enterprise semantic analysis software.
From Arabic to Ukrainian, Language Detector is a godsend for multilingual enterprises
Overview
Nstein's Language Detector identifies the language of a given document. This can be a godsend for large volume content-producers publishing in many languages, permitting yet another layer of sorting and indexing in TME 5.
Language Detector detects out-of-the-box: English, French, German, Arabic, Chinese (simplified), Czech, Danish, Estonian, Finnish, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai, Turkish, Ukrainian. Additional languages can be added upon request.
Applications
Language Detector is primarily a content classification tool. It allows high-volume, multi-language publishers yet another layer of information about their content. It also helps facilitate more advanced natural language processing, text mining, or any other machine-algorithm techniques relying on language dependant models.
How it works
The content is first sliced into various samples of contiguous characters and analyzed/weighted using a variety of algorithms. Combinations of sequential character sets are compared to the language database. Once pairs are indentified, the engine runs a confidence check on the sets of pairs and determines the languages most present in the document.
Input/Output
Language Detector ingests documents in any format and outputs the identified language with a confidence score.
