Detect Language
travel_explore Detect Language is an Activity that uses a large language model (LLM) to determine the primary language (English, Spanish, French, etc.) of a document. Activities executed downstream, such as export_notes Extract, can use this information to apply language specific logic.
The Detect Language activity in Grooper is a powerful tool that uses artificial intelligence (AI) to automatically determine the primary language—and optionally, the locale—of documents or pages within a Batch or Batch Folder. This activity is essential for organizations processing documents in multiple languages, ensuring that downstream activities such as Classify, Extract and Export are performed using the correct language-specific resources and logic.
Purposes and Use Cases
The Detect Language activity is designed to support a wide range of document processing scenarios, including:
- Multi-language document processing: Automatically identify the language of each document or page, enabling workflows that handle content in multiple languages without manual intervention.
- Language-aware extraction and classification: Ensure that Data Models, Lexicons, and extraction rules are applied using the correct language context, improving accuracy and reducing errors.
- Locale-specific processing: Detect not only the language but also the regional variant (locale), such as distinguishing between US and UK English, to support region-specific formatting, validation, or business rules.
- Preprocessing for downstream activities: Assign the correct "Culture Code" property early in the workflow, so that all subsequent activities—such as validation, export, or reporting—can leverage language and locale information.
- Handling electronic documents: Provide language detection for documents that do not go through OCR, where language information may not be available from the source.
This activity is especially valuable in environments where documents may arrive in any language, or where regulatory, compliance, or business requirements depend on accurate language identification.
What Is Detect Language?
The Detect Language activity leverages a large language model (LLM) to analyze the content of each document or page. It assigns an ISO language or locale code to the "Culture Code" property of the processed item, making this information available for all subsequent processing steps. This is especially important for workflows that require language-aware extraction, validation, or export.
Why Use AI for Language Detection?
AI-based language detection offers several advantages:
- High accuracy: The LLM can consider context, writing style, and subtle linguistic cues.
- Robustness: Effective even for documents with ambiguous, mixed, or short content.
- Locale awareness: Can distinguish between regional variations (e.g., US vs. UK English).
How Does It Work?
The activity operates by sampling text from both the beginning and end of a document. The number of pages sampled is controlled by the "PageDepth" property. This approach ensures that the language detection is based on a representative sample of the document, which is particularly useful for longer or potentially multilingual files.
The sampled text is sent to an AI model, which is instructed to return the most appropriate ISO language code (such as en for English or fr for French) or a full locale code (such as en-US or fr-FR) depending on the "DetectLocale" setting. The detected code is then stored in the "Culture Code" property of the Batch Folder or Batch Page.
If the AI model cannot determine the language, the activity will raise an error, ensuring that only documents with a confidently detected language proceed to the next steps.
Key Properties
- "Model": Specifies the AI chat model to use for language detection.
- "PageDepth": Determines how many pages from the start and end of the document are analyzed.
- "DetectLocale": When enabled, the activity returns a full locale code (e.g.,
en-US); otherwise, it returns only the base language code (e.g.,en).
Typical configuration and usage
- Add the Detect Language activity to your Batch Process. The Detect Language activity must be run after text recognition (OCR or native text extraction via a Recognize step) and typically is run before classification (via a Classify step) or data extraction (via an Extract step).
- Configure the "Model", "PageDepth", and "DetectLocale" properties as needed for your workflow.
- Run the process. Each document or page will be analyzed, and the detected language or locale will be assigned to the "Culture Code" property.
- Downstream activities, such as Classify, Extract or Export, can use this information to apply language-specific logic.
How does this differ from Detect Language (Legacy)?
Grooper also provides a legacy version of this activity, called Detect Language (Legacy). The key differences are:
- Detection Method: The legacy activity uses a rule-based approach, comparing document words to a multi-language Lexicon and selecting the best match based on word frequency.
- Configuration: The legacy method requires setup of a lexicon and a word extractor, and relies on a "Minimum Confidence" threshold.
- Accuracy: The AI-based method is generally more accurate and robust, especially for complex or ambiguous documents. The legacy method is faster and does not require AI resources, but may be less reliable for short or mixed-language content.
Summary
The Detect Language activity in Grooper provides a modern, AI-powered solution for language and locale detection, ensuring that your document processing workflows are language-aware and highly accurate. For most scenarios, the AI-based method is recommended. The legacy method remains available for environments with strict resource constraints or where AI is not available.