Project: Indonesia-MTEB Benchmark Document: 01 - Project Overview & Scope Definition Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning
[!NOTE]
Document Navigation¶
This is the first of twelve documents comprising the Indonesia-MTEB Benchmark research foundation. Each document builds upon the previous, establishing a comprehensive blueprint for creating Indonesia's first unified text embedding benchmark following MTEB methodology.
| Document | Title | Focus Area |
|---|---|---|
| 01 | Project Overview & Scope | Current Document |
| 02 | MTEB Structure Analysis | Framework deep-dive |
| 03 | Existing Indonesian Datasets | Data aggregation sources |
| 04 | Regional MTEB Methodologies | Precedent analysis |
| 05 | Translation Models Benchmark | Model selection & evaluation |
| 06 | AI Dataset Generation Methods | Novel data creation |
| 07 | Validation Strategies | Quality assurance protocols |
| 08 | ACL Dataset Paper Standards | Publication requirements |
| 09 | Novelty Angle & Publication | Research contribution |
| 10 | Implementation Roadmap | Technical execution plan |
| 11 | Python Package Development | Software architecture |
| 12 | Summary & Quick Reference | Consolidated reference |
Indonesia-MTEB: A Comprehensive Text Embedding Benchmark for Indonesian¶
"The absence of a unified embedding benchmark for Indonesian represents a critical gap in Southeast Asian NLP infrastructure. With 280+ million speakers, Indonesian ranks among the world's most spoken languages, yet remains systematically underrepresented in embedding evaluation frameworks."
Table of Contents¶
- Executive Summary
- The Indonesian Language Context
- Background: The MTEB Framework
- The Gap Analysis
- Regional MTEB Precedents
- Project Scope & Deliverables
- Research Questions
- Proposed Methodology
- Technical Architecture
- Success Criteria
- Timeline & Milestones
- References
1. Executive Summary¶
1.1 The Problem Statement¶
The Masssive Text Embedding Benchmark (MTEB) has emerged as the dominant evaluation framework for text embedding models globally. Since its introduction at EACL 2023, MTEB has undergone exponential expansion through the MMTEB (Massive Multilingual Text Embedding Benchmark) initiative at ICLR 2025, now encompassing:
| Milestone | Scale | Languages | Datasets |
|---|---|---|---|
| MTEB Original (EACL 2023) | Foundational | 112 | 58 |
| MMTEB (ICLR 2025) | Community-driven | 1,000+ | 500+ |
| Current (2026) | Production | 1,000+ | 1,308+ |
However, Indonesian language coverage remains fragmented and insufficient for rigorous embedding evaluation:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIAN EMBEDDING EVALUATION GAP │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ GLOBAL MTEB │ INDONESIAN STATUS │
│ ────────────── │ ───────────────── │
│ ✓ 8 Task Categories │ ✗ No unified Indonesian benchmark │
│ ✓ 500+ Quality-controlled tasks │ ✗ Scattered individual datasets │
│ ✓ Standardized metrics │ ✗ No embedding-specific evaluation │
│ ✓ Active leaderboard │ ✗ No Indonesian embedding leaderboard │
│ ✓ Community governance │ ✗ No centralized benchmark hub │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
The Critical Gap: Despite being the 11th most spoken language globally with 280+ million speakers and serving as the lingua franca of Southeast Asia, Indonesian lacks a dedicated, comprehensive embedding benchmark following MTEB standards.
1.2 Research Objective¶
Primary Goal: Create Indonesia-MTEB — a unified, comprehensive Indonesian text embedding benchmark following MTEB methodology, covering all 8 MTEB task categories with minimum 50 datasets (target: 100+).
Three-Pronged Data Strategy:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB DATASET ACQUISITION STRATEGY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ PHASE 1: AGGREGATION ║│
│ ║ ───────────────── ║│
│ ║ • Identify and catalog existing Indonesian NLP datasets ║│
│ ║ • Convert to MTEB-compatible format ║│
│ ║ • Sources: IndoNLU, NusaX, IndoMMLU, MIRACL-ID, SEACrowd ║│
│ ║ • Expected Coverage: ~20-30 datasets ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ PHASE 2: TRANSLATION ║│
│ ║ ───────────────── ║│
│ ║ • Full MTEB benchmark translation to Indonesian ║│
│ ║ • Primary Model: TranslateGemma (4B/12B) - 55 language support ║│
│ ║ • Alternative: NLLB-200, mT5, Bloom ║│
│ ║ • Quality Control: LLM-as-judge + Human validation (10% sample) ║│
│ ║ • Expected Coverage: ~40-60 datasets ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ PHASE 3: AI-GENERATED DATASETS ║│
│ ║ ───────────────────────────── ║│
│ ║ • Identify task gaps after Phase 1 + 2 ║│
│ ║ • Generate novel Indonesian datasets using LLMs ║│
│ ║ • Domains: Legal, Healthcare, Finance, Social Media ║│
│ ║ • Validation: Statistical consistency + Human expert review ║│
│ ║ • Expected Coverage: ~10-20 novel datasets ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ INTEGRATION & VALIDATION ║│
│ ║ ───────────────────────────── ║│
│ ║ • Unified dataset format validation ║│
│ ║ • Baseline model evaluation on all tasks ║│
│ ║ • Leaderboard integration with MTEB ecosystem ║│
│ ║ • Publication: ACL/EMNLP/NAACL dataset paper ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
1.3 Key Contributions¶
| Contribution Type | Description | Impact |
|---|---|---|
| Infrastructure | First unified Indonesian embedding benchmark | Enables systematic model comparison |
| Methodological | Three-pronged data acquisition framework | Replicable for other low-resource languages |
| Empirical | Baseline evaluation of existing models | Establishes performance floor |
| Community | Open-source Python package | Democratizes access to embedding evaluation |
2. The Indonesian Language Context¶
2.1 Demographic Significance¶
Understanding the scale and importance of Indonesian (Bahasa Indonesia) is essential for contextualizing this benchmark:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIAN LANGUAGE: KEY STATISTICS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SPEAKER COUNT │
│ ───────────── │
│ • Total Speakers: ~280 million (2024) │
│ • Native Speakers: ~42 million │
│ • Second-Language Speakers: ~238 million │
│ • Global Ranking: 11th most spoken language │
│ │
│ GEOGRAPHIC DISTRIBUTION │
│ ─────────────────────── │
│ • Primary Country: Indonesia (4th most populous nation) │
│ • ASEAN Presence: Working language of ASEAN │
│ • Diaspora: Malaysia, Singapore, Netherlands, etc. │
│ │
│ LINGUISTIC CONTEXT │
│ ─────────────────── │
│ • Language Family: Austronesian │
│ • Script: Latin (Roman) alphabet │
│ • Regional Languages: 700+ indigenous languages in Indonesia │
│ • Official Status: Sole official language (since 1928) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
[!TIP] Why Indonesian Matters for AI: Indonesia is the largest economy in Southeast Asia and a rapidly growing digital market. With over 200 million internet users and a thriving startup ecosystem, Indonesian NLP capabilities have direct commercial and social impact.
2.2 Linguistic Characteristics Affecting Embeddings¶
Indonesian presents unique challenges for text embedding models due to its morphological and syntactic properties:
| Linguistic Feature | Description | Embedding Challenge |
|---|---|---|
| Agglutinative Morphology | Words change through affixation (prefixes, suffixes, infixes, circumfixes) | Embeddings must capture morphological variants |
| Reduplication | Complete or partial word repetition for plurality or emphasis | Creates vocabulary explosion |
| Productive Affixation | Thousands of possible affix combinations | Sparse embedding space for derived forms |
| Loanword Integration | Extensive borrowing from Dutch, Arabic, Sanskrit, English, Javanese | Requires cross-lingual alignment |
| Pro-Drop Language | Subject pronouns often omitted | Embeddings must infer from context |
| Formal vs. Informal Registers | Significant divergence between written and colloquial forms | Domain shift challenges |
Example of Agglutinative Complexity:
Root Word: "tulis" (write)
│
├── "me-" → "menulis" (to write - active)
│ │
│ ├── "kan" → "menuliskan" (to write for someone)
│ │ │
│ │ ├── "pem-" → "pemenulisan" (the act of writing)
│ │ │ │
│ │ │ ├── "-an" → "pemenulisanan" (documentation)
│ │ │
│ │ └── "di-" → "dituliskan" (be written for someone - passive)
│ │
│ ├── "-an" → "menulisan" (writing - noun)
│ │
│ └── "peng-" → "penulis" (writer)
│
├── "di-" → "ditulis" (be written - passive)
│ │
│ └── "-an" → "ditulisan" (something written)
│
└── "ke-" → "ketulisan" (writability)
[!NOTE] Implication for Embedding Benchmarks: Indonesian embedding models must demonstrate robustness across these morphological variations. A comprehensive benchmark must include datasets that specifically test these phenomena.
2.3 Current NLP Infrastructure in Indonesia¶
| Resource Type | Status | Notable Examples |
|---|---|---|
| Pretrained Language Models | Emerging | IndoBERT, IndoBART, IndoGPT |
| Embedding Models | Limited | LazarusNLP collections (5-10 models) |
| NLU Benchmarks | Available | IndoNLU (12 tasks) |
| Embedding Benchmarks | None | This is the gap |
| Translation Models | Good | NLLB, SeamlessM4T, TranslateGemma |
3. Background: The MTEB Framework¶
3.1 What is MTEB?¶
MTEB (Massive Text Embedding Benchmark) is a standardized evaluation framework for text embedding models, introduced by Muennighoff et al. (2023) at EACL 2023 and significantly expanded through MMTEB at ICLR 2025.
Evolution Timeline:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB EVOLUTIONARY TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 2022 (October) │
│ ════════════════ │
│ • Original MTEB paper released (arXiv:2210.07316) │
│ • 58 datasets, 112 languages, 8 task categories │
│ • Establishes unified evaluation protocol │
│ │
│ 2023 (April) │
│ ═════════════ │
│ • MTEB presented at EACL 2023 (Main Conference) │
│ • Paper: 1,400+ citations as of 2026 │
│ • HuggingFace integration launched │
│ │
│ 2024 │
│ ════ │
│ • Regional MTEBs emerge: C-MTEB (Chinese), AfriMTEB (African) │
│ • Dataset count exceeds 1,000 │
│ • Leaderboard becomes industry standard │
│ │
│ 2025 (January) │
│ ═══════════════ │
│ • MMTEB announced: Massive Multilingual expansion │
│ • 500+ tasks, 1,000+ languages │
│ • Community-driven governance model │
│ │
│ 2025 (May) │
│ ═════════════ │
│ • MMTEB presented at ICLR 2025 │
│ • New task categories: Instruction Following, Long-Document Retrieval │
│ • 86+ citations and growing rapidly │
│ │
│ 2026 (Current) │
│ ═════════════ │
│ • 1,308+ datasets in production │
│ • Active model submissions: 500+ models evaluated │
│ • Regional expansions: VN-MTEB, SEA-BED, others │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.2 The 8 MTEB Task Categories¶
Indonesia-MTEB will comprehensively cover all 8 MTEB task categories. Each category evaluates different aspects of embedding quality:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB TASK CATEGORIES & EVALUATION METRICS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 1. CLASSIFICATION │ │
│ │ ──────────────── │ │
│ │ Task: Single-label text classification │ │
│ │ Metrics: Accuracy, F1-score (macro/micro) │ │
│ │ Example: Sentiment analysis, topic categorization │ │
│ │ Indonesian Focus: sentiment (NusaX), news classification (IndoNLU) │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 2. CLUSTERING │ │
│ │ ───────────── │ │
│ │ Task: Group similar texts without labels │ │
│ │ Metrics: V-measure (homogeneity + completeness), ARI │ │
│ │ Example: Document clustering, topic discovery │ │
│ │ Indonesian Focus: news clustering, social media grouping │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 3. PAIR CLASSIFICATION │ │
│ │ ────────────────────── │ │
│ │ Task: Binary classification of text pairs │ │
│ │ Metrics: Accuracy, Average Precision (AP) │ │
│ │ Example: Paraphrase detection, duplicate identification │ │
│ │ Indonesian Focus: paraphrase ID, semantic equivalence │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 4. RERANKING │ │
│ │ ───────────── │ │
│ │ Task: Reorder retrieved documents by relevance │ │
│ │ Metrics: MAP (Mean Average Precision), nDCG │ │
│ │ Example: Search result refinement │ │
│ │ Indonesian Focus: document reranking, web search refinement │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 5. RETRIEVAL │ │
│ │ ───────────── │ │
│ │ Task: Find relevant documents for queries │ │
│ │ Metrics: nDCG@k, Recall@k, MAP, MRR │ │
│ │ Example: Search engines, RAG systems │ │
│ │ Indonesian Focus: MIRACL-ID, Wikipedia retrieval, FAQ retrieval │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 6. STS (Semantic Textual Similarity) │ │
│ │ ────────────────────────────────────── │ │
│ │ Task: Predict similarity scores for text pairs │ │
│ │ Metrics: Pearson correlation, Spearman correlation │ │
│ │ Example: Semantic relatedness, paraphrase similarity │ │
│ │ Indonesian Focus: translation-adapted STS datasets │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 7. SUMMARIZATION │ │
│ │ ──────────────────── │ │
│ │ Task: Assess summary quality relative to source │ │
│ │ Metrics: Cosine similarity, ROUGE (as reference) │ │
│ │ Example: Summary relevance assessment │ │
│ │ Indonesian Focus: news summary evaluation │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ 8. INSTRUCTION FOLLOWING │ │
│ │ ──────────────────────── │ │
│ │ Task: Follow embedding-specific instructions │ │
│ │ Metrics: Task-specific (varies by instruction type) │ │
│ │ Example: Domain-specific retrieval, style-conditioned embedding │ │
│ │ Indonesian Focus: Domain instruction datasets │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.3 Key Evaluation Metrics Explained¶
For each task category, MTEB employs specific metrics. Understanding these is crucial for benchmark design:
[!NOTE] Metric Reference for Indonesia-MTEB Implementation:
| Metric | Formula | Range | Interpretation | Use Case |
|---|---|---|---|---|
| Accuracy | correct / total | [0, 1] | Percentage correct | Classification |
| F1-Score | 2·(precision·recall)/(precision+recall) | [0, 1] | Harmonic mean of precision/recall | Classification |
| V-Measure | 2·(homogeneity·completeness)/(homogeneity+completeness) | [0, 1] | Clustering quality independent of label permutation | Clustering |
| ARI | (RI - Expected_RI) / (Max_RI - Expected_RI) | [-1, 1] | Adjusted Rand Index - clustering similarity to ground truth | Clustering |
| MAP | mean(Average_Precision) | [0, 1] | Mean of average precision across queries | Retrieval, Reranking |
| nDCG@k | DCG@k / IDCG@k | [0, 1] | Normalized Discounted Cumulative Gain at position k | Retrieval, Reranking |
| Recall@k | relevant_in_top_k / total_relevant | [0, 1] | Percentage of relevant documents found in top k | Retrieval |
| MRR | mean(1/rank_of_first_relevant) | [0, 1] | Mean Reciprocal Rank | Retrieval |
| Pearson | covariance/(σ_x·σ_y) | [-1, 1] | Linear correlation between predicted and actual | STS |
| Spearman | rank_correlation | [-1, 1] | Monotonic correlation between predicted and actual | STS |
3.4 MTEB Leaderboard & Submission Process¶
The MTEB leaderboard, hosted on HuggingFace, serves as the central hub for embedding model evaluation:
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB LEADERBOARD SUBMISSION PROCESS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. MODEL PREPARATION │
│ ──────────────────── │
│ • Upload model to HuggingFace Hub │
│ • Ensure model card includes: │
│ - Model architecture │
│ - Training data sources │
│ - Parameter count │
│ - License information │
│ │
│ 2. SUBMISSION PACKAGE │
│ ──────────────────── │
│ • Fork MTEB repository │
│ • Add model metadata to models/registry.yaml │
│ • Format: │
│ name: "ModelName" │
│ language: ["id"] # for Indonesian models │
│ open_source: true │
│ revision: "commit_hash" │
│ │
│ 3. AUTOMATED EVALUATION │
│ ──────────────────── │
│ • MTEB CI automatically evaluates on all benchmarks │
│ • Results aggregated across task categories │
│ • Leaderboard updated automatically │
│ │
│ 4. TRANSPARENCY REQUIREMENTS │
│ ──────────────────────────── │
│ • Reference implementation required │
│ • Training data disclosure │
│ • Reproducibility checklist │
│ • Code availability │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
[!TIP] For Indonesia-MTEB: We will establish integration with the MTEB leaderboard through: 1. Official dataset submission to MTEB repository 2. Indonesian-specific leaderboard sub-section 3. Automated evaluation pipeline for Indonesian models
4. The Gap Analysis¶
4.1 Current Indonesian Embedding Landscape¶
A comprehensive analysis reveals significant gaps in Indonesian embedding evaluation infrastructure:
| Resource | Type | Coverage | MTEB-Compatible | Status |
|---|---|---|---|---|
| IndoNLU | NLU Benchmark | 12 tasks, Indonesian only | ❌ NLU tasks only, not embedding-specific | Established (2020) |
| NusaX | Sentiment Dataset | 10 Indonesian local languages | ❌ Single task (sentiment) | Established (2022) |
| IndoMMLU | Knowledge QA | Culture + language understanding | ❌ Knowledge-focused, not embedding | Available |
| MIRACL-ID | Retrieval | Indonesian subset of 18 languages | ⚠️ Partial - retrieval only | Available |
| LazarusNLP | Embedding Models | 5-10 Indonesian embedding models | ❌ Models, not benchmark | Active (2024) |
| SEA-BED | Regional Benchmark | 10 SEA languages, 169 datasets, 9 tasks | ⚠️ Multi-language, not Indonesia-focused | New (2025) |
| SEACrowd | Data Hub | 13 tasks, 38 SEA indigenous languages | ⚠️ Includes Indonesian but not embedding-specific | New (2024) |
| Indonesia-MTEB | Embedding Benchmark | 8 tasks, 50-100+ datasets | ✅ Full MTEB compatibility | This Project |
Key Findings:
-
No Indonesia-Specific Embedding Benchmark: Existing resources are either NLU-focused (IndoNLU) or multi-language (SEA-BED, SEACrowd) without dedicated emphasis on Indonesian embeddings.
-
Fragmented Task Coverage: No single resource covers all 8 MTEB task categories for Indonesian.
-
No Centralized Evaluation: Indonesian embedding models (LazarusNLP) are evaluated on scattered datasets without unified comparison.
4.2 Comparison with Regional Benchmarks¶
| Benchmark | Language | Datasets | Tasks | MTEB Integration | Indonesia Coverage |
|---|---|---|---|---|---|
| C-MTEB | Chinese | 35 | 6 | ✅ Full | N/A |
| VN-MTEB | Vietnamese | ~30+ | Multi | ✅ Full | N/A |
| AfriMTEB | African languages | Subset | Multi | ✅ Full | N/A |
| SEA-BED | 10 SEA languages | 169 | 9 | ⚠️ Independent | Partial (1 of 10) |
| Indonesia-MTEB | Indonesian | 50-100+ | 8 | 🎯 Planned | 🎯 100% |
Positioning: Indonesia-MTEB is the first dedicated Indonesian embedding benchmark with full MTEB methodology compatibility and comprehensive task coverage.
5. Regional MTEB Precedents¶
5.1 Successful Regional Benchmarks¶
Analysis of existing regional MTEB implementations provides valuable methodological precedents:
C-MTEB (Chinese Massive Text Embedding Benchmark)¶
Specification:
| Aspect | Details |
|---|---|
| Language | Chinese (Simplified & Traditional) |
| Scale | 35 datasets, 6 task categories |
| Paper | Xiao et al. (2023) - "Packed Resources For General Chinese Embeddings" |
| Citations | 1,171+ (as of 2024) |
| Repository | HuggingFace C-MTEB collection |
| Key Innovation | Russian Doll Representational Learning for multi-grained embeddings |
Methodological Insights for Indonesia-MTEB: - Emphasis on domain diversity (news, medical, legal, e-commerce) - Separate evaluation for Simplified vs. Traditional variants - Comprehensive baseline evaluation (30+ models)
VN-MTEB (Vietnamese Massive Text Embedding Benchmark)¶
Specification:
| Aspect | Details |
|---|---|
| Language | Vietnamese |
| Scale | ~30 datasets, multi-task |
| Paper | Pham et al. (2025) - arXiv:2507.21500 |
| Publication Date | July 2025 |
| Key Focus | Toxicity detection, online content moderation |
| Repository | GreenNode/VN-MTEB collection on HuggingFace |
Methodological Insights for Indonesia-MTEB: - Recent publication demonstrates viability of new language benchmarks - Domain-specific focus (toxicity) as novel contribution - Community-driven model collection approach
SEA-BED (Southeast Asia Embedding Benchmark)¶
Specification:
| Aspect | Details |
|---|---|
| Languages | 10 SEA languages (Indonesian, Thai, Vietnamese, etc.) |
| Scale | 169 datasets, 9 tasks |
| Paper | Ponwitayarat et al. (2025) - arXiv:2508.12243 |
| Publication Date | August 2025 |
| Novelty | 87% of datasets not in MMTEB |
| Human Annotations | 71% human-formulated datasets |
Methodological Insights for Indonesia-MTEB: - Demonstrates regional benchmark viability - High proportion of novel (non-MMTEB) datasets validates unique regional needs - Human annotation emphasis for quality control
5.2 Lessons Learned for Indonesia-MTEB¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ METHODOLOGICAL BEST PRACTICES FROM REGIONAL MTEBS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ FROM C-MTEB (Chinese) │
│ ──────────────────────── │
│ ✓ Domain diversity is critical for comprehensive evaluation │
│ ✓ Multi-grained evaluation (character, word, sentence level) │
│ ✓ Comprehensive baseline evaluation establishes performance floor │
│ │
│ FROM VN-MTEB (Vietnamese) │
│ ──────────────────────────────── │
│ ✓ Domain-specific focus can be a novel contribution │
│ ✓ Community-driven model collection accelerates adoption │
│ ✓ HuggingFace integration maximizes accessibility │
│ │
│ FROM SEA-BED (Southeast Asia) │
│ ──────────────────────────────────────── │
│ ✓ Regional datasets often differ from global MTEB - prioritize novelty │
│ ✓ High human annotation ratio ensures quality │
│ ✓ Language-specific challenges (agglutinative morphology, etc.) warrant │
│ specialized datasets │
│ │
│ INDONESIA-MTEB SYNTHESIS │
│ ──────────────────────── │
│ ✓ Combine domain diversity with Indonesian-specific focus │
│ ✓ Emphasize morphological complexity in dataset design │
│ ✓ High human validation ratio (minimum 10% of translated data) │
│ ✓ Full HuggingFace + MTEB ecosystem integration │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
6. Project Scope & Deliverables¶
6.1 In-Scope Deliverables¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB DELIVERABLES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ DELIVERABLE 1: DATASET SUITE ║│
│ ║ ────────────────────────── ║│
│ ║ Specification: ║│
│ ║ • All 8 MTEB task categories covered ║│
│ ║ • Minimum 50 datasets (target: 100+) ║│
│ ║ • Train/validation/test splits for supervised tasks ║│
│ ║ • Metadata documentation (license, source, creation method) ║│
│ ║ ║│
│ ║ Data Sources: ║│
│ ║ • Aggregation: ~20-30 existing Indonesian datasets ║│
│ ║ • Translation: ~40-60 translated MTEB datasets ║│
│ ║ • AI-Generated: ~10-20 novel Indonesian datasets ║│
│ ║ ║│
│ ║ Format: ║│
│ ║ • HuggingFace datasets format ║│
│ ║ • MTEB-compatible metadata ║│
│ ║ • Comprehensive documentation cards ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ DELIVERABLE 2: EVALUATION FRAMEWORK ║│
│ ║ ───────────────────────────────── ║│
│ ║ Components: ║│
│ ║ • MTEB-compatible evaluation script ║│
│ ║ • Indonesian-specific metric calculations ║│
│ ║ • Baseline model evaluations (10+ models) ║│
│ ║ • Leaderboard integration (HuggingFace Spaces) ║│
│ ║ • Reproducibility guarantees ║│
│ ║ ║│
│ ║ Models for Baseline Evaluation: ║│
│ ║ • Multilingual: E5, BGE, GTE, jina (current SOTA) ║│
│ ║ • Indonesian-specific: LazarusNLP models ║│
│ ║ • General: sentence-transformers baselines ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ DELIVERABLE 3: PYTHON PACKAGE ║│
│ ║ ───────────────────────────── ║│
│ ║ Package: indonesiamteb (PyPI) ║│
│ ║ Features: ║│
│ ║ • pip install indonesiamteb ║│
│ ║ • Easy dataset loading: load_benchmark(task_name) ║│
│ ║ • One-line evaluation: evaluate(model, benchmark) ║│
│ ║ • Leaderboard submission tools ║│
│ ║ • Comprehensive documentation ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ DELIVERABLE 4: RESEARCH PAPER ║│
│ ║ ────────────────────────── ║│
│ ║ Target Venue: ACL/EMNLP/NAACL dataset track ║│
│ ║ Sections: ║│
│ ║ • Abstract & Introduction ║│
│ ║ • Background & Related Work (MTEB, Indonesian NLP, regional MTEBs) ║│
│ ║ • Methodology (data acquisition, translation, generation) ║│
│ ║ • Dataset descriptions (all datasets with statistics) ║│
│ ║ • Baseline evaluation results ║│
│ ║ • Cross-lingual analysis (ID ↔ EN performance) ║│
│ ║ • Limitations & Ethics ║│
│ ║ • Conclusion & Future Work ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
6.2 Out-of-Scope (Explicitly Excluded)¶
| Excluded | Reason | Alternative Approach |
|---|---|---|
| Training new embedding models | Focus is benchmark, not models | Evaluate existing models; model training is separate project |
| Domain-specific evaluation | Keep benchmark general-purpose | Domain-specific datasets included but benchmark remains general |
| Indonesian local languages (Javanese, Sundanese, etc.) | Focus on Bahasa Indonesia first | Future expansion to regional languages |
| Real-time leaderboard hosting | Infrastructure scope | HuggingFace Spaces integration; no independent hosting |
| Commercial applications | Research focus | Open-source for community use |
6.3 Success Criteria¶
| Metric | Target | Measurement Method |
|---|---|---|
| Task Coverage | All 8 MTEB categories | Dataset inventory |
| Dataset Count | Minimum 50, target 100+ | Final dataset count |
| Translation Quality | ≥ 85% human acceptance rate | Human validation on 10% sample |
| Baseline Models | ≥ 10 models evaluated | Evaluation results |
| Publication | ACL/EMNLP/NAACL dataset paper | Acceptance notification |
| MTEB Integration | Official integration into MTEB | Pull request acceptance |
| Package Usage | ≥ 50 monthly downloads (6 months post-release) | PyPI statistics |
| Community Adoption | ≥ 5 models use Indonesia-MTEB for evaluation | Leaderboard, GitHub citations |
7. Research Questions¶
7.1 Primary Research Questions¶
RQ1: Gap Analysis & State of the Art
What is the current state of Indonesian embedding evaluation, and what specific gaps exist compared to MTEB standards?
Sub-questions: - RQ1.1: Which Indonesian NLP datasets exist and what is their MTEB compatibility? - RQ1.2: What task categories are currently underrepresented for Indonesian? - RQ1.3: How do existing Indonesian embedding models perform on MTEB-style evaluations?
RQ2: Translation Methodology
How can we effectively translate MTEB datasets to Indonesian while preserving semantic equivalence and task validity?
Sub-questions: - RQ2.1: Which translation model (TranslateGemma, NLLB, mT5) achieves optimal quality for Indonesian? - RQ2.2: What quality control mechanisms (human validation, LLM-as-judge) ensure semantic preservation? - RQ2.3: How does translation impact embedding model performance relative to original English datasets?
RQ3: Novel Dataset Generation
What novel Indonesian embedding tasks can be created via AI generation that fill unique gaps not addressed by translation or aggregation?
Sub-questions: - RQ3.1: Which task categories remain underserved after aggregation and translation? - RQ3.2: How can LLMs generate high-quality Indonesian datasets with statistical consistency? - RQ3.3: What Indonesian-specific linguistic phenomena should novel datasets target?
RQ4: Baseline Evaluation
How do existing embedding models (multilingual and Indonesian-specific) perform on a unified Indonesian benchmark across all 8 task categories?
Sub-questions: - RQ4.1: Which model architectures excel on which task types for Indonesian? - RQ4.2: How does Indonesian performance correlate with performance on other languages? - RQ4.3: What performance gaps exist between multilingual and Indonesian-specific models?
RQ5: Cross-Lingual Analysis
What does Indonesia-MTEB reveal about cross-lingual embedding capabilities and transfer learning to Indonesian?
Sub-questions: - RQ5.1: How do models trained on English/other languages transfer to Indonesian? - RQ5.2: What is the performance gap between monolingual Indonesian and multilingual models? - RQ5.3: Can Indonesia-MTEB inform embedding model design for other agglutinative languages?
8. Proposed Methodology¶
8.1 Phase Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB METHODOLOGY PHASES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: AGGREGATION PHASE 2: TRANSLATION │
│ ────────────────────── ──────────────────── │
│ │ │ │
│ │ • Dataset discovery │ • MTEB dataset selection │
│ │ • Format conversion │ • Translation model benchmark │
│ │ • Quality assessment │ • Batch translation │
│ │ • MTEB compatibility check │ • Quality control pipeline │
│ │ │ • Human validation (10% sample) │
│ │ │ │
│ └──────────────────────────────┘ └──────────────────────────────────────┘
│ │ │ │
│ ▼ ▼ │
│ │
│ ╔═══════════════════════════════════════════════════════════════════════╗│
│ ║ PHASE 3: NOVEL DATASET GENERATION ║│
│ ║ ───────────────────────────────────── ║│
│ ║ ║│
│ ║ • Gap identification (post-aggregation + translation) ║│
│ ║ • LLM prompt engineering for dataset generation ║│
│ ║ • Domain-specific dataset creation (legal, medical, etc.) ║│
│ ║ • Statistical consistency validation ║│
│ ║ • Human expert review ║│
│ ║ ║│
│ ╚═══════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ │
│ ╔═══════════════════════════════════════════════════════════════════════╗│
│ ║ PHASE 4: INTEGRATION & VALIDATION ║│
│ ║ ────────────────────────────────────── ║│
│ ║ ║│
│ ║ • Unified dataset format validation ║│
│ ║ • Baseline model evaluation (10+ models) ║│
│ ║ • Statistical analysis of results ║│
│ ║ • Cross-lingual comparison ║│
│ ║ • Leaderboard deployment ║│
│ ║ • Paper writing and submission ║│
│ ║ ║│
│ ╚═══════════════════════════════════════════════════════════════════════╝│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.2 Phase 1: Aggregation - Detailed Methodology¶
Objective: Identify, convert, and validate existing Indonesian datasets for MTEB compatibility.
Step 1: Dataset Discovery
| Source | Datasets of Interest | MTEB Category |
|---|---|---|
| IndoNLU | SMSA, EmoT, etc. | Classification |
| NusaX | Sentiment (10 languages) | Classification |
| IndoMMLU | Knowledge QA | Classification |
| MIRACL-ID | Wikipedia retrieval | Retrieval |
| SEACrowd | Various tasks | Multiple |
Step 2: Format Conversion
- Target Format: HuggingFace datasets with MTEB-specific schema
- Required Fields: text, label, split (train/validation/test)
- Metadata: license, source language, domain, creation date
Step 3: Quality Assessment
- Check for data leakage between splits
- Verify label distribution balance
- Assess text quality (encoding issues, noise)
8.3 Phase 2: Translation - Detailed Methodology¶
Objective: Translate selected MTEB datasets to Indonesian with semantic preservation.
Step 1: Translation Model Selection
| Model | Parameters | Languages | Strength | Weakness |
|---|---|---|---|---|
| TranslateGemma | 4B / 12B / 27B | 55 | Latest, optimized | New (2026) |
| NLLB-200 | 3.3B | 200 | Proven quality | Older architecture |
| mT5 | 580M / 1.1B | 101 | Flexible | Requires fine-tuning |
| SeamlessM4T | 2.3B | 100 | Multimodal | Overkill for text-only |
Step 2: Translation Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ TRANSLATION QUALITY CONTROL PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SOURCE TEXT (English MTEB dataset) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ AUTOMATED TRANSLATION (TranslateGemma 12B) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LLM-AS-JUDGE VALIDATION (GPT-4 / Claude) │ │
│ │ Criteria: │ │
│ │ • Semantic equivalence (1-5 scale) │ │
│ │ • Grammatical correctness │ │
│ │ • Cultural appropriateness │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ├──────────────┬──────────────┐ │
│ ▼ ▼ ▼ │
│ ACCEPT REJECT FLAG │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌───────────────────┐ │
│ │ │ │ HUMAN REVIEW │ │
│ │ │ │ (10% sample) │ │
│ │ │ └───────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ FINAL INDONESIAN DATASET │ │
│ │ • Accepted translations │ │
│ │ • Human-reviewed corrections │ │
│ │ • Quality score metadata │ │
│ └───────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Step 3: Quality Metrics
- Semantic Preservation Score: LLM-as-judge rating (1-5)
- Acceptance Threshold: ≥ 4.0/5.0
- Human Validation Rate: 10% random sample + all flagged items
- Target Human Acceptance Rate: ≥ 85%
8.4 Phase 3: AI Generation - Detailed Methodology¶
Objective: Create novel Indonesian datasets for underserved tasks.
Target Domains:
| Domain | Rationale | Task Category |
|---|---|---|
| Legal | Complex morphology in legal texts | Classification, Retrieval |
| Healthcare | Technical terminology, code-switching | STS, Classification |
| Finance | Numeral expressions, named entities | Clustering, Pair Classification |
| Social Media | Informal language, slang | Sentiment, STS |
| News | Formal Indonesian, topic diversity | Clustering, Retrieval |
Generation Methodology:
- Gap Identification: Analyze coverage after Phases 1-2
- Prompt Engineering: Design prompts for LLM dataset generation
- Iterative Generation: Generate, validate, refine
- Statistical Checks: Label distribution, text length, vocabulary diversity
- Human Review: Domain expert validation
8.5 Phase 4: Integration & Validation - Detailed Methodology¶
Objective: Unify all datasets, evaluate baselines, and publish.
Step 1: Unified Format Validation
- Schema validation across all datasets
- Consistent metadata formatting
- HuggingFace dataset card generation
Step 2: Baseline Evaluation
┌─────────────────────────────────────────────────────────────────────────────┐
│ BASELINE MODEL EVALUATION MATRIX │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────┐ │
│ │ MODELS TO EVALUATE │ │
│ ├─────────────────────────────────────┤ │
│ │ Multilingual: │ │
│ │ • E5-large-v2 │ │
│ │ • bge-m3 (multilingual) │ │
│ │ • gte-large │ │
│ │ • jina-embeddings-v3 │ │
│ │ │ │
│ │ Indonesian-Specific: │ │
│ │ • LazarusNLP/indonesian-sbert... │ │
│ │ • (others from HuggingFace) │ │
│ │ │ │
│ │ Baselines: │ │
│ │ • sentence-transformers/LaBSE │ │
│ │ • sentence-transformers/distiluse │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ TASK CATEGORIES │ │
│ ├─────────────────────────────────────┤ │
│ │ 1. Classification │ 5. Retrieval │ │
│ │ 2. Clustering │ 6. STS │ │
│ │ 3. Pair Class. │ 7. Summariz. │ │
│ │ 4. Reranking │ 8. Instr. Fol. │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ OUTPUT │ │
│ ├─────────────────────────────────────┤ │
│ │ • Per-task performance scores │ │
│ │ • Aggregate benchmark score │ │
│ │ • Cross-lingual comparisons │ │
│ │ • Leaderboard rankings │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Step 3: Statistical Analysis
- Mean performance across models per task
- Performance variance analysis
- Correlation between tasks (task similarity)
- Cross-lingual performance correlation
9. Technical Architecture¶
9.1 System Architecture¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB TECHNICAL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ DATA LAYER │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ HuggingFace │ │ Source Files │ │ Generated Data │ │ │
│ │ │ Datasets Hub │ │ (IndoNLU, etc) │ │ (AI-created) │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ PROCESSING LAYER │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Format │ │ Translation │ │ Quality │ │ │
│ │ │ Converters │ │ Pipeline │ │ Validation │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ EVALUATION LAYER │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ MTEB Core │ │ Custom Metrics │ │ Statistical │ │ │
│ │ │ Evaluator │ │ (ID-specific) │ │ Analysis │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ PRESENTATION LAYER │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ HuggingFace │ │ PyPI Package │ │ Documentation │ │ │
│ │ │ Spaces │ │ CLI/API │ │ Site │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
9.2 Python Package Structure¶
indonesiamteb/
├── indonesiamteb/
│ ├── __init__.py
│ ├── data/
│ │ ├── __init__.py
│ │ ├── loading.py # Dataset loading utilities
│ │ └── metadata.py # Dataset metadata registry
│ ├── tasks/
│ │ ├── __init__.py
│ │ ├── classification.py # Classification task wrappers
│ │ ├── clustering.py # Clustering task wrappers
│ │ ├── retrieval.py # Retrieval task wrappers
│ │ ├── sts.py # STS task wrappers
│ │ └── ... # Other task types
│ ├── evaluation/
│ │ ├── __init__.py
│ │ ├── evaluator.py # Main evaluation class
│ │ ├── metrics.py # Custom metrics
│ │ └── leaderboard.py # Leaderboard utilities
│ └── utils/
│ ├── __init__.py
│ ├── translation.py # Translation utilities
│ └── validation.py # Quality validation
├── benchmarks/
│ ├── classification/
│ ├── clustering/
│ └── ... # Dataset implementations
├── tests/
│ ├── test_data.py
│ ├── test_evaluation.py
│ └── test_tasks.py
├── examples/
│ ├── basic_usage.py
│ └── custom_evaluation.py
├── setup.py
├── pyproject.toml
└── README.md
10. Success Criteria¶
10.1 Quantitative Metrics¶
| Metric | Minimum | Target | Stretch |
|---|---|---|---|
| Total Datasets | 50 | 100 | 150+ |
| Task Coverage | 8/8 categories | 8/8 categories | 8/8 categories |
| Translation Acceptance Rate | 85% | 90% | 95% |
| Baseline Models Evaluated | 10 | 15 | 20+ |
| MTEB Integration | Official submission | Accepted | Featured |
| PyPI Downloads (6 months) | 50 | 500 | 1000+ |
| Community Adoptions | 3 models | 10 models | 20+ models |
| Paper Citations (1 year) | 5 | 20 | 50+ |
10.2 Qualitative Milestones¶
- All datasets pass quality validation
- Baseline evaluation complete with documented results
- Python package published on PyPI
- HuggingFace Spaces leaderboard deployed
- Research paper submitted to top-tier venue
- Community engagement (GitHub stars, forks, discussions)
- Integration with MTEB main repository
11. Timeline & Milestones¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB PROJECT TIMELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ MONTH 1-2: FOUNDATION │
│ ═══════════════════ │
│ ✓ Literature review complete │
│ ✓ Dataset inventory finalized │
│ ✓ Translation model benchmark selected │
│ ✓ Technical architecture designed │
│ │
│ MONTH 3-4: DATA ACQUISITION │
│ ════════════════════════ │
│ ✓ Phase 1: Aggregation complete (20-30 datasets) │
│ ✓ Phase 2: Translation pipeline operational │
│ ✓ Phase 3: AI generation begins │
│ │
│ MONTH 5-6: DATASET COMPLETION │
│ ═════════════════════════ │
│ ✓ Translation complete (40-60 datasets) │
│ ✓ AI-generated datasets complete (10-20 datasets) │
│ ✓ Quality validation complete │
│ │
│ MONTH 7-8: EVALUATION │
│ ═════════════════ │
│ ✓ Baseline model evaluations (10+ models) │
│ ✓ Statistical analysis complete │
│ ✓ Cross-lingual comparison complete │
│ │
│ MONTH 9-10: PACKAGE & PAPER │
│ ════════════════════════ │
│ ✓ Python package development complete │
│ ✓ HuggingFace integration complete │
│ ✓ Research paper drafted │
│ │
│ MONTH 11-12: PUBLICATION │
│ ═════════════════════ │
│ ✓ Paper submitted to target venue │
│ ✓ PyPI package released │
│ ✓ Leaderboard deployed │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
12. References¶
12.1 Primary Sources¶
-
Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316
-
Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595
-
Xiao, S., et al. (2023). "Packed Resources For General Chinese Embeddings: C-MTEB and C-MTP". Findings of the Association for Computational Linguistics (ACL 2023). arXiv:2309.07597
12.2 Regional Benchmarks¶
-
Ponwitayarat, W., et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark". arXiv. arXiv:2508.12243
-
Pham, L., et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark". arXiv. arXiv:2507.21500
12.3 Indonesian NLP Resources¶
-
Winata, G., et al. (2020). "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding". arXiv. arXiv:2009.05387
-
Winata, G., et al. (2022). "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages". arXiv. arXiv:2205.15960
-
Lovenia, H., et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark for Southeast Asian Languages". Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024).
12.4 Translation Models¶
-
Finkelstein, M., et al. (2026). "TranslateGemma Technical Report". arXiv. arXiv:2601.09012
-
NLLB Team (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv. arXiv:2207.04872
12.5 Evaluation Methodology¶
-
Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.
-
Humeun, L., et al. (2025). "HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks". arXiv. arXiv:2510.10062
13. Document Status¶
[!NOTE] Next Document: Document 02 - MTEB Structure Analysis
This document provides detailed analysis of MTEB's internal structure, dataset formats, evaluation protocols, and integration requirements for Indonesia-MTEB.
Change Log:
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2026-01-25 | Initial version | Research Team |
| 2.0 | 2026-01-25 | Enhanced edition with expanded sections, latest research | Research Team |
This document is a living record. Updated as research progresses.