Products and Technologies

Identify the languages contained in any given document

  • Datasheet

    Nstein Text Mining Engine: Best-of-breed enterprise semantic analysis software.

    Download datasheet

From Arabic to Ukrainian, Language Detector is a godsend for multilingual enterprises

Overview

Nstein's Language Detector identifies the language of a given document. This can be a godsend for large volume content-producers publishing in many languages, permitting yet another layer of sorting and indexing in TME 5.

Language Detector detects out-of-the-box: English, French, German, Arabic, Chinese (simplified), Czech, Danish, Estonian, Finnish, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai, Turkish, Ukrainian. Additional languages can be added upon request.

Applications

Language Detector is primarily a content classification tool. It allows high-volume, multi-language publishers yet another layer of information about their content. It also helps facilitate more advanced natural language processing, text mining, or any other machine-algorithm techniques relying on language dependant models.

How it works

The content is first sliced into various samples of contiguous characters and analyzed/weighted using a variety of algorithms. Combinations of sequential character sets are compared to the language database. Once pairs are indentified, the engine runs a confidence check on the sets of pairs and determines the languages most present in the document.

Input/Output

Language Detector ingests documents in any format and outputs the identified language with a confidence score.