Project: Indonesia-MTEB Benchmark Document: 07 - Validation Strategies for Indonesia-MTEB Last Updated: 2026-01-25 Status: Research Phase Version: 1.0
Validation Strategies for Indonesia-MTEB¶
"Quality assurance is the backbone of any machine-translated or synthetically generated dataset. This document provides a comprehensive framework for validating Indonesia-MTEB datasets, combining proven methodologies from regional MTEBs with Indonesian-specific linguistic and cultural considerations."
Table of Contents¶
- Executive Summary
- Validation Framework Overview
- Translation Quality Pipeline
- LLM-as-a-Judge Calibration
- Semantic Similarity Validation
- Indonesian-Specific Validation
- Statistical Validation Methods
- Human Validation Protocols
- Quality Thresholds and Kept Ratios
- Failure Analysis and Recovery
- Implementation Checklist
1. Executive Summary¶
1.1 The Validation Challenge¶
Indonesia-MTEB requires validating data from three sources: 1. Aggregated datasets (50+ existing Indonesian datasets) 2. Translated datasets (full MTEB translation to Indonesian) 3. AI-generated datasets (novel synthetic data for gaps)
Each source requires different validation strategies. Based on VN-MTEB and TR-MTEB precedents, we implement a multi-stage validation pipeline with automated quality control and spot human verification.
1.2 Key Validation Components¶
| Component | Method | Target Metric | Status |
|---|---|---|---|
| Language Detection | LLM-based (Qwen2.5-3B-Instruct) | Indonesian purity > 99% | Core |
| Semantic Similarity | Embedding-based (gte-Qwen2-7B) | Cosine ≥0.75-0.80 | Core |
| LLM-as-Judge | Calibrated 5-criteria evaluation | Score ≥3.5/5.0 | Core |
| Cultural Terms | NER + cultural list check | Preservation >95% | Indonesia-specific |
| Register Detection | Formal/informal classifier | Consistency check | Indonesia-specific |
| Code-Mixing | Word-level language ID | Appropriate handling | Indonesia-specific |
| Human Validation | 10% sampling | IAA κ >0.8 | Quality assurance |
1.3 Expected Kept Ratios (Based on VN-MTEB)¶
| Task Category | Expected Kept Ratio | Justification |
|---|---|---|
| Classification | 70-75% | Straightforward, good EN→ID mapping |
| Clustering | 70-75% | Structural preservation |
| Pair Classification | 65-70% | Entailment relationships |
| Retrieval | 65-70% | Domain-specific challenges |
| Reranking | 65-70% | Nuanced ranking criteria |
| STS | 55-60% | Semantic similarity hardest to preserve |
2. Validation Framework Overview¶
2.1 Three-Stage Translation Pipeline (from VN-MTEB)¶
┌─────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB VALIDATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: LANGUAGE DETECTION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Purpose: Verify output is Indonesian │ │
│ │ Model: Qwen2.5-3B-Instruct (lightweight, accurate) │ │
│ │ Why LLM not FastText: Interleaved languages cause errors │ │
│ │ Output: Filter/flag samples with non-Indonesian content │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 2: SEMANTIC SIMILARITY │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Purpose: Ensure meaning preservation │ │
│ │ Model: gte-Qwen2-7B-instruct (32K context) │ │
│ │ Metric: Cosine similarity between source and translation │ │
│ │ Threshold: ≥0.75-0.80 (tuned for EN-ID) │ │
│ │ Output: Filter samples below threshold │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 3: LLM-AS-A-JUDGE │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Purpose: Multi-criteria quality evaluation │ │
│ │ Model: Llama-SEA-LION-v4-70B-IT (Indonesian-optimized) │ │
│ │ Criteria: Grammar, NER, Numbers/Links, Fluency, Meaning │ │
│ │ Formula: score = Σ(αᵢ × scoreᵢ) / |S| │ │
│ │ Threshold: ≥3.5/5.0 (calibrated) │ │
│ │ Technique: Chain-of-Thought prompting for reasoning │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ADDITIONAL FOR INDONESIA-MTEB: │
│ ├─ Cultural Term Preservation Check (Stage 3a) │
│ ├─ Register Consistency Validation (Stage 3b) │
│ └─ Code-Mixing Validation (Stage 3c) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
2.2 Validation for Different Data Sources¶
| Data Source | Validation Stages | Special Considerations |
|---|---|---|
| Aggregated Native | Statistical + Format | Verify existing quality, check for duplicates |
| Translated MTEB | Full 3-stage | Cultural adaptation, terminology consistency |
| AI-Generated | Full 3-stage + Human spot-check | Hallucination detection, consistency |
3. Translation Quality Pipeline¶
3.1 Stage 1: Language Detection¶
Purpose: Filter source language and verify output language purity.
Model Selection:
| Model | Parameters | Indonesian Performance | Rationale |
|---|---|---|---|
| Qwen2.5-3B-Instruct | 3B | Top on SEA-HELM for Indonesian | Lightweight, accurate |
| Llama-SEA-LION-v3-70B-IT | 70B | Highest Indonesian score | When resources available |
Why Not FastText: (from VN-MTEB findings) - Interleaved languages (Indonglish) cause FastText errors - Regional Indonesian variants not well-supported - LLM-based detection with CoT prompting more robust
Implementation:
# Language detection with LLM
def detect_language_llm(text: str, model) -> str:
prompt = f"""Identify the primary language of this text.
Output only the ISO 639-1 code (e.g., 'id', 'en', 'jv').
Text: {text[:500]} # First 500 chars for efficiency
Language code:"""
return model.generate(prompt, temperature=0.0)
3.2 Stage 2: Semantic Similarity¶
Purpose: Ensure translated text preserves original meaning.
Model Selection:
| Model | Context Length | Indonesian Support | Rationale |
|---|---|---|---|
| gte-Qwen2-7B-instruct | 32,768 | Excellent | Long context for documents |
| SEA-LION-v4-instruct | 8,192 | Native Indonesian | Cultural context awareness |
Threshold Determination:
Based on VN-MTEB analysis, we use semantic similarity distributions to determine thresholds:
| Similarity Range | Decision | Expected Distribution |
|---|---|---|
| ≥0.85 | Excellent | ~15% of samples |
| 0.75-0.85 | Good (keep) | ~55-60% of samples |
| 0.65-0.75 | Borderline | Manual review |
| <0.65 | Reject (filter) | ~10-15% of samples |
EN-ID vs EN-VN Comparison:
| Metric | EN-VN (VN-MTEB) | EN-ID (Est.) | Reason |
|---|---|---|---|
| Linguistic distance | Medium | Low | Indonesian closer to English |
| Script alignment | Latin | Latin | Same script family |
| Cultural adaptation | Moderate | Moderate | Similar challenges |
| Expected kept ratio | ~65% | ~70-75% | Higher for EN-ID |
Implementation:
from sentence_transformers import SentenceTransformer
def semantic_similarity_filter(source: list, translated: list,
model, threshold: float = 0.75):
embedder = SentenceTransformer(model)
source_embeddings = embedder.encode(source)
translated_embeddings = embedder.encode(translated)
# Compute cosine similarity
similarities = cosine_similarity(
source_embeddings,
translated_embeddings
).diagonal()
# Keep samples above threshold
kept_indices = np.where(similarities >= threshold)[0]
return [translated[i] for i in kept_indices], similarities[kept_indices]
3.3 Stage 3: LLM-as-a-Judge¶
Purpose: Multi-criteria quality evaluation.
Framework Selection: CheckEval (EMNLP 2025)
| Aspect | CheckEval Benefits | Traditional LLM-Judge Issues |
|---|---|---|
| Reliability | +0.45 agreement improvement | Low inter-evaluator agreement |
| Consistency | Binary questions reduce variance | High score variance |
| Interpretability | Decomposed criteria | Black-box scores |
| Calibration | Human-aligned prompts | Uncalibrated outputs |
Five Evaluation Criteria:
| Criterion | Weight (α) | Description |
|---|---|---|
| Grammar | 0.20 | Syntactic correctness, morphology |
| NER Preservation | 0.20 | Named entities kept/translated correctly |
| Numbers/Links/Special | 0.15 | Technical elements preserved |
| Fluency | 0.20 | Natural flow, appropriate register |
| Meaning Preservation | 0.25 | Semantic equivalence maintained |
Scoring Formula:
score_LLM_judge = (α_grammar × score_grammar +
α_NER × score_NER +
α_special × score_special +
α_fluency × score_fluency +
α_meaning × score_meaning) / 5
where: score_i ∈ [1, 5], threshold = 3.5
Chain-of-Thought Prompt Template (from VN-MTEB):
You are an expert Indonesian-English translation evaluator.
Analyze this translation step-by-step:
1. Check grammatical correctness
2. Verify named entities are preserved
3. Check numbers, links, special characters
4. Assess fluency and naturalness
5. Evaluate meaning preservation
Source: {source_text}
Translation: {translated_text}
For each criterion, provide a score 1-5 and brief justification.
Final format: {"score": X, "justification": "..."}
4. LLM-as-a-Judge Calibration¶
4.1 Calibration Process (from TR-MTEB)¶
Step 1: Human Annotation - Sample size: 115 examples (TR-MTEB standard) - Balanced across datasets - Binary labeling: PASS/FAIL - Criteria: Semantic fidelity + fluency
Step 2: Prompt Iteration - Iteratively refine evaluation prompt - Align LLM judgments with human labels - Target: 85%+ agreement, 90%+ precision
Step 3: Validation | Metric | TR-MTEB Result | Indonesia-MTEB Target | |--------|----------------|---------------------| | Agreement | 85.2% | ≥85% | | Precision | 92.9% | ≥90% | | Recall | 84.4% | ≥85% | | F1 Score | 88.4% | ≥88% |
4.2 Calibration Dataset for Indonesian¶
Cultural Context Examples:
| Source | Translation | Decision | Reason |
|---|---|---|---|
| "Gotong royong is important" | "Gotong royong is important" | ✓ PASS | Cultural term preserved |
| "Gotong royong is important" | "Kerja sama adalah penting" | ✗ FAIL | Cultural term lost |
| "Meeting today" | "Meeting hari ini" | ✓ PASS | Code-mixing preserved |
| "Meeting today" | "Pertemuan hari ini" | ~ BORDER | Depends on context |
4.3 Inter-Annotator Agreement (IAA)¶
Target Metrics: - Cohen's κ: ≥0.8 (substantial agreement) - Krippendorff's α: ≥0.8 - Percentage agreement: ≥85%
Annotation Protocol: 1. Two native Indonesian speakers 2. Independent labeling 3. Adjudication for disagreements 4. Continuous calibration sessions
5. Semantic Similarity Validation¶
5.1 Model Selection¶
Primary Model: gte-Qwen2-7B-instruct
| Feature | Specification |
|---|---|
| Context Length | 32,768 tokens |
| Dimensions | 3584 |
| Primary Language | Chinese (excellent for SEA) |
| Indonesian Performance | Strong (via SEA-HELM) |
Alternative: SEA-LION-v4-instruct - Native Indonesian understanding - 8B parameters, 8K context - Better for cultural context
5.2 Threshold Determination¶
Method: Evaluate on FLores Indonesian subset
| Dataset | Pairs | Purpose |
|---|---|---|
| FLores-Indo | 1,000+ | Human-aligned EN-ID pairs |
| Manual curation | 500 | Domain-specific examples |
Threshold Analysis:
| Threshold | Precision | Recall | F1 | Recommendation |
|---|---|---|---|---|
| 0.70 | 0.98 | 0.85 | 0.91 | Too permissive |
| 0.75 | 0.96 | 0.82 | 0.88 | Recommended |
| 0.80 | 0.94 | 0.75 | 0.83 | Too strict |
| 0.85 | 0.91 | 0.68 | 0.78 | Too strict |
By Task Type:
| Task | Optimal Threshold | Justification |
|---|---|---|
| STS | 0.70-0.75 | Semantic similarity inherently subjective |
| Classification | 0.75-0.80 | Label-dependent meaning |
| Clustering | 0.75-0.80 | Structural relationships |
| Retrieval | 0.70-0.75 | Query-document asymmetry |
6. Indonesian-Specific Validation¶
6.1 Cultural Term Preservation¶
Critical Cultural Terms to Preserve:
| Term | Category | Translation Strategy |
|---|---|---|
| Gotong royong | Social value | Keep (no direct English equivalent) |
| Pancasila | Political ideology | Keep (Indonesian-specific) |
| Adat | Customary law | Keep + parenthetical if needed |
| Warung | Small shop | Keep (cultural context) |
| Lebaran | Eid holiday | Keep (cultural event) |
| Merantau | Migration | Keep + explanation |
Validation Approach:
CULTURAL_TERMS = {
"gotong royong", "pancasila", "adat", "warung",
"lebaran", "merantau", "batik", "wayang", "gamelan"
}
def check_cultural_preservation(source: str, translation: str) -> dict:
"""Check if cultural terms are preserved appropriately"""
source_terms = [term for term in CULTURAL_TERMS if term.lower() in source.lower()]
trans_terms = [term for term in source_terms if term.lower() in translation.lower()]
preserved = len(trans_terms) / len(source_terms) if source_terms else 1.0
return {
"source_count": len(source_terms),
"preserved_count": len(trans_terms),
"preservation_ratio": preserved,
"status": "PASS" if preserved >= 0.9 else "REVIEW"
}
6.2 Register Detection (Formal vs Informal)¶
Indonesian Register Spectrum:
| Register | Characteristics | Indicators |
|---|---|---|
| Bahasa Baku | Formal, standardized | "Anda", "saya", complete morphology |
| Bahasa Jakarte | Jakarta slang | "Gue", "lu", "dih" |
| Bahasa Gaul | Youth slang | "Aye", "gabisa", "tetep" |
| Bahasa Pasar | Market simplified | "Saya tak faham" |
Detection Model: IndoBERTweet-based classifier
from transformers import pipeline
classifier = pipeline("text-classification",
model="Aardiiiiy/indobertweet-base-Indonesian-sentiment-analysis")
def detect_register(text: str) -> str:
"""Detect formality register of Indonesian text"""
# Fine-tune for register detection
result = classifier(text)
# Returns: "formal", "informal", "code-mixed"
return result
Validation Strategy: - Source and translation should match register - Formal source → formal translation - Informal source → informal translation - Register mismatch → flag for review
6.3 Code-Mixing Validation¶
Indonglish Detection:
Based on IndoJavE research (Hidayatullah et al., 2025):
| Tool | Approach | Accuracy |
|---|---|---|
| Word-level ID | Classifier per token | ~85% F1 |
| Features: context window, character patterns | ||
| IndoJavE model | Fine-tuned for ID-JV-EN | State-of-art |
Validation Rules: 1. Preserve appropriate code-mixing: Tech terms in English ("deadline", "meeting") 2. Tag code-mixed content: For targeted evaluation 3. Validate consistency: Register-appropriate mixing
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-bert-base")
model = AutoModelForTokenClassification.from_pretrained("fathan/indojave-codemixed-bert-base")
def validate_code_mixing(text: str) -> dict:
"""Validate and analyze code-mixed Indonesian text"""
tokens = tokenizer(text, return_offsets_mapping=True)
predictions = model(**tokens).logits.argmax(-1)
# 0=Indonesian, 1=Javanese, 2=English, 3=Mixed
id_counts = (predictions == 0).sum().item()
en_counts = (predictions == 2).sum().item()
mix_counts = (predictions == 3).sum().item()
return {
"indonesian_tokens": id_counts,
"english_tokens": en_counts,
"mixed_tokens": mix_counts,
"code_mixing_ratio": en_counts / id_counts if id_counts > 0 else 0
}
6.4 Regional Language Influence¶
Regional Variants to Handle:
| Region | Influence | Examples |
|---|---|---|
| Javanese | Javanese-influenced Indonesian | "Mawon" (tidak apa-apa), "Kulo" (saya) |
| Sundanese | Sundanese-influenced | "Teu acan" (tidak ada), "Mun" (kalau) |
| Balinese | Balinese-influenced | "Ajeng" (Ibu), "Cenik" (anak) |
| Minangkabau | Minang-influenced | "Akok" (aku), "di man" (di mana) |
Validation Approach: - Allow regional variants in appropriate contexts - Flag unexpected regional terms in formal contexts - Use NusaBERT multilingual model for detection
7. Statistical Validation Methods¶
7.1 Word Length Distribution Analysis¶
Rationale: EN-ID should maintain similar word length distributions (both Latin-script, agglutinative).
Method: Compute correlation of word length distributions
from scipy.stats import pearsonr
import numpy as np
def word_length_distribution_analysis(source_texts: list,
translated_texts: list) -> dict:
"""Analyze word length distribution correlation"""
source_lengths = [len(text.split()) for text in source_texts]
trans_lengths = [len(text.split()) for text in translated_texts]
# Bin the distributions
hist_source, bins = np.histogram(source_lengths, bins=range(1, 51))
hist_trans, _ = np.histogram(transl_lengths, bins=bins)
# Compute correlation
correlation, p_value = pearsonr(hist_source, hist_trans)
return {
"correlation": correlation,
"p_value": p_value,
"status": "PASS" if correlation >= 0.85 else "REVIEW"
}
Target: r ≥ 0.85 (VN-MTEB achieved r > 0.85 for EN-VN)
7.2 Vocabulary Overlap Analysis¶
Type-Token Ratio: Compare TTR between source and translation
Coverage Metrics: - Unique word count ratio - OOV rate for Indonesian - Domain-specific vocabulary preservation
7.3 Kept Ratio Tracking by Task¶
Systematic Tracking:
| Task | Target Kept | Minimum Acceptable | Monitoring |
|---|---|---|---|
| Classification | 75% | 65% | Weekly |
| Clustering | 75% | 65% | Weekly |
| Pair Classification | 70% | 60% | Weekly |
| Retrieval | 70% | 60% | Weekly |
| Reranking | 70% | 60% | Weekly |
| STS | 60% | 50% | Weekly |
Alert Thresholds: - Red alert if kept ratio < minimum acceptable - Yellow warning if kept ratio < target - 5%
8. Human Validation Protocols¶
8.1 Sampling Strategy¶
10% Human Verification:
| Dataset Size | Sample Size | Confidence | Margin of Error |
|---|---|---|---|
| 1,000 | 100 | 95% | ±5% |
| 10,000 | 1,000 | 99% | ±1.5% |
| 100,000 | 10,000 | 99.9% | ±0.5% |
Stratified Sampling: - Ensure representation across: - All task categories - Different domains (news, medical, legal, social) - Different quality tiers (high, medium, low confidence scores)
8.2 Annotation Guidelines¶
Quality Criteria for Human Evaluation:
| Criterion | Definition | Scoring |
|---|---|---|
| Accuracy | Meaning preserved | 1-5 scale |
| Fluency | Natural Indonesian | 1-5 scale |
| Appropriateness | Register matches context | 1-5 scale |
| Completeness | No missing content | Yes/No |
| Cultural Fit | Culturally appropriate | Yes/No |
Annotation Interface:
class HumanAnnotation:
source_text: str
translated_text: str
task_category: str
domain: str
# Scoring
accuracy_score: int # 1-5
fluency_score: int # 1-5
appropriateness_score: int # 1-5
# Binary
complete: bool
culturally_appropriate: bool
# Comments
issues: List[str]
suggestions: List[str]
8.3 Inter-Annotator Agreement¶
Measurement: - Cohen's Kappa for categorical decisions - Pearson/Spearman for scale scores - Percentage agreement for binary decisions
Target: κ ≥ 0.8 (substantial agreement)
Dispute Resolution: - Third annotator for tie-breaking - Weekly calibration meetings - Create gold standard set for IAA measurement
9. Quality Thresholds and Kept Ratios¶
9.1 Comprehensive Threshold Matrix¶
| Validation Stage | Metric | Threshold | Action on Fail |
|---|---|---|---|
| Language Detection | Indonesian purity | ≥99% | Flag for manual review |
| Semantic Similarity | Cosine similarity | ≥0.75 | Filter out |
| LLM-as-Judge | Quality score | ≥3.5/5.0 | Filter out |
| Cultural Terms | Preservation ratio | ≥95% | Flag for review |
| Register | Consistency | Match source | Flag if mismatch |
| Human Validation | IAA | κ≥0.8 | Re-train if low |
9.2 Expected Kept Ratios by Data Source¶
| Data Source | Expected Kept | Confidence |
|---|---|---|
| Native datasets | 95-100% | High (already vetted) |
| BEIR translations | 68-73% | Medium (based on TR-MTEB) |
| MTEB translations | 70-75% | Medium-high (EN-ID advantage) |
| AI-generated | 60-70% | Medium (validate rigorously) |
9.3 Task-Specific Kept Ratios (Projected)¶
Based on VN-MTEB experience with EN→VN, adjusted for EN→ID:
| Task | VN-MTEB Kept | ID-MTEB Projected | Rationale |
|---|---|---|---|
| Retrieval (15 datasets) | 66.03% | 70-75% | EN-ID closer than EN-VN |
| Classification (13 datasets) | 70.11% | 73-78% | Latin script advantage |
| Clustering (5 datasets) | 71.98% | 73-78% | Structure preserved |
| Pair Classification (3 datasets) | 67.2% | 69-74% | Similar to VN-MTEB |
| Reranking (3 datasets) | 65.2% | 68-73% | Slightly better than VN |
| STS (3 datasets) | 53.4% | 55-60% | Hardest to preserve |
10. Failure Analysis and Recovery¶
10.1 Common Failure Modes¶
| Failure Type | Description | Frequency | Recovery Strategy |
|---|---|---|---|
| Cultural erosion | Cultural terms translated away | Medium | Add cultural term list |
| Over-formalization | Casual → formal register | Medium | Register-aware prompt |
| Code-mixing loss | English terms removed | High (social) | Preserve tech terms |
| NER issues | Entities mistranslated | Low | NER validation step |
| Hallucination | AI adds extra content | Low (translation) | LLM-as-judge catches |
10.2 Recovery Strategies¶
Stage 1: Automated Recovery
def automated_recovery(item: dict, failure_type: str) -> dict:
"""Attempt automated recovery for failed validation"""
if failure_type == "cultural_erosion":
# Re-translate with cultural preservation prompt
return retranslate_with_cultural_hint(item)
elif failure_type == "over_formal":
# Re-translate with register matching instruction
return retranslate_with_register_hint(item)
elif failure_type == "code_mixing_loss":
# Restore English terms from source
return restore_technical_terms(item)
elif failure_type == "similarity_low":
# Try alternative translation model
return translate_with_alternative_model(item)
return item # No recovery possible
Stage 2: Manual Recovery
- Items failing automated recovery go to human queue
- Prioritized by: (1) Criticality, (2) Task category gap, (3) Batch size
10.3 Quality Trend Monitoring¶
Weekly Metrics Dashboard:
| Metric | Calculation | Target |
|---|---|---|
| Overall kept ratio | Total kept / Total processed | ≥70% |
| Stage 1 pass rate | Language detection pass | ≥99% |
| Stage 2 pass rate | Similarity filter pass | ≥75% |
| Stage 3 pass rate | LLM-judge pass | ≥80% |
| Human validation pass | Human-validated as PASS | ≥90% |
Trend Analysis: - Weekly report on kept ratios by task - Alert on significant deviations (>5%) - Root cause analysis for failures
11. Implementation Checklist¶
11.1 Infrastructure Setup¶
- Download models
- Qwen2.5-3B-Instruct (language detection)
- gte-Qwen2-7B-instruct (semantic similarity)
- Llama-SEA-LION-v4-70B-IT (LLM-judge)
-
Alternative: SEA-LION-v4-instruct (backup)
-
Configure deployment
- 4×H100 or 4×A100 cluster
- vLLM or similar inference engine
- Monitoring and logging setup
- GPU memory optimization
11.2 Validation Pipeline Implementation¶
- Stage 1: Language Detection
- Implement LLM-based detection
- Set up batch processing
-
Create logging for language detection results
-
Stage 2: Semantic Similarity
- Implement embedding-based similarity
- Determine task-specific thresholds
-
Store similarity scores for analysis
-
Stage 3: LLM-as-a-Judge
- Implement 5-criteria evaluation
- Create calibration dataset (115 samples)
- Calibrate against human labels (target: 88% F1)
-
Implement Chain-of-Thought prompting
-
Indonesian-Specific Validation
- Implement cultural term checker
- Implement register detector
- Implement code-mixing validator
- Add regional language influence check
11.3 Statistical Validation¶
- Implement word length distribution analysis
- Track kept ratios by task
- Create weekly quality dashboard
- Set up alert thresholds
11.4 Human Validation¶
- Recruit 2-3 native Indonesian annotators
- Create annotation guidelines
- Set up annotation interface
- Implement IAA calculation
- Schedule weekly calibration
11.5 Testing and Validation¶
- Run pilot on 1,000 samples
- Validate kept ratios by model
- Adjust thresholds based on pilot
- Finalize routing strategy
- Document edge cases
11.6 Documentation¶
- Document validation pipeline
- Create failure analysis report
- Document Indonesian-specific adaptations
- Create reproduction guide
- API documentation for validation tools
12. Key References¶
Regional MTEB Methodologies¶
-
VN-MTEB - Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500. [3-stage translation pipeline, kept ratios by task]
-
TR-MTEB - Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish." EMNLP 2025 Findings. [LLM-as-judge calibration, 88.4% F1]
-
CheckEval - Lee et al. (2025). "CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists." EMNLP 2025. [+0.45 agreement improvement]
Indonesian-Specific Resources¶
-
IndoJavE - Hidayatullah et al. (2025). "Pre-trained language model for code-mixed text in Indonesian, Javanese, and English." Social Network Analysis.
-
IndoCollex - Wibowo et al. (2021). "IndoCollex: A Testbed for Morphological Transformation of Indonesian Word Colloquialism." ACL Findings.
-
IndoCulture - Koto et al. (2024). "IndoCulture: Exploring Geographically-Influenced Cultural Commonsense Reasoning." TACL.
-
COPAL-ID - Wibowo et al. (2024). "COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances." NAACL 2024.
-
NusaBERT - Wongso et al. (2025). "NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural." SEA-LP Workshop.
Validation Metrics¶
-
COMET - Rei et al. (2020). "COMET: A Neural Framework for MT Evaluation." [State-of-the-art MT quality metric]
-
Culturally-Aware NLP - Liu et al. (2025). "Culturally Aware and Adapted NLP: A Taxonomy and Survey." TACL.
-
Semantic Similarity - Comprehensive guide for STS in 2026. [shadecoder.com]
Benchmarks and Frameworks¶
-
MTEB Maintenance - "Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks." arXiv:2506.21182.
-
MMTEB - Enevoldsen et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark." ICLR 2025.
13. Next Steps (Document Roadmap)¶
| Document | Content | Status |
|---|---|---|
| 01 | Project Overview | ✅ Complete |
| 02 | MTEB Structure Analysis | ✅ Complete |
| 03 | Existing Indonesian Datasets | ✅ Complete |
| 04 | Regional MTEB Methodologies | ✅ Complete |
| 05 | Translation Models Benchmark | ✅ Complete |
| 06 | AI Dataset Generation Methods | ✅ Complete |
| 07 | Validation Strategies | ✅ Complete |
| 08 | ACL Dataset Paper Standards | 🔲 Next |
| 09 | Novelty Angle & Publication | Pending |
| 10 | Implementation Roadmap | Pending |
This document synthesizes state-of-the-art validation methodologies from VN-MTEB, TR-MTEB, CheckEval, and Indonesian NLP research to provide a comprehensive framework for ensuring Indonesia-MTEB dataset quality.
Last updated: 2026-01-25