Skip to content

Project: Indonesia-MTEB Benchmark Document: 01 - Project Overview & Scope Definition Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning


[!NOTE]

Document Navigation

This is the first of twelve documents comprising the Indonesia-MTEB Benchmark research foundation. Each document builds upon the previous, establishing a comprehensive blueprint for creating Indonesia's first unified text embedding benchmark following MTEB methodology.

Document Title Focus Area
01 Project Overview & Scope Current Document
02 MTEB Structure Analysis Framework deep-dive
03 Existing Indonesian Datasets Data aggregation sources
04 Regional MTEB Methodologies Precedent analysis
05 Translation Models Benchmark Model selection & evaluation
06 AI Dataset Generation Methods Novel data creation
07 Validation Strategies Quality assurance protocols
08 ACL Dataset Paper Standards Publication requirements
09 Novelty Angle & Publication Research contribution
10 Implementation Roadmap Technical execution plan
11 Python Package Development Software architecture
12 Summary & Quick Reference Consolidated reference

Indonesia-MTEB: A Comprehensive Text Embedding Benchmark for Indonesian

"The absence of a unified embedding benchmark for Indonesian represents a critical gap in Southeast Asian NLP infrastructure. With 280+ million speakers, Indonesian ranks among the world's most spoken languages, yet remains systematically underrepresented in embedding evaluation frameworks."


Table of Contents

  1. Executive Summary
  2. The Indonesian Language Context
  3. Background: The MTEB Framework
  4. The Gap Analysis
  5. Regional MTEB Precedents
  6. Project Scope & Deliverables
  7. Research Questions
  8. Proposed Methodology
  9. Technical Architecture
  10. Success Criteria
  11. Timeline & Milestones
  12. References

1. Executive Summary

1.1 The Problem Statement

The Masssive Text Embedding Benchmark (MTEB) has emerged as the dominant evaluation framework for text embedding models globally. Since its introduction at EACL 2023, MTEB has undergone exponential expansion through the MMTEB (Massive Multilingual Text Embedding Benchmark) initiative at ICLR 2025, now encompassing:

Milestone Scale Languages Datasets
MTEB Original (EACL 2023) Foundational 112 58
MMTEB (ICLR 2025) Community-driven 1,000+ 500+
Current (2026) Production 1,000+ 1,308+

However, Indonesian language coverage remains fragmented and insufficient for rigorous embedding evaluation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIAN EMBEDDING EVALUATION GAP                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   GLOBAL MTEB                    │   INDONESIAN STATUS                      │
│   ──────────────                  │   ─────────────────                      │
│   ✓ 8 Task Categories            │   ✗ No unified Indonesian benchmark     │
│   ✓ 500+ Quality-controlled tasks │   ✗ Scattered individual datasets       │
│   ✓ Standardized metrics         │   ✗ No embedding-specific evaluation    │
│   ✓ Active leaderboard           │   ✗ No Indonesian embedding leaderboard │
│   ✓ Community governance         │   ✗ No centralized benchmark hub        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

The Critical Gap: Despite being the 11th most spoken language globally with 280+ million speakers and serving as the lingua franca of Southeast Asia, Indonesian lacks a dedicated, comprehensive embedding benchmark following MTEB standards.

1.2 Research Objective

Primary Goal: Create Indonesia-MTEB — a unified, comprehensive Indonesian text embedding benchmark following MTEB methodology, covering all 8 MTEB task categories with minimum 50 datasets (target: 100+).

Three-Pronged Data Strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB DATASET ACQUISITION STRATEGY              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 1: AGGREGATION                                                     ║│
│  ║  ─────────────────                                                        ║│
│  ║  • Identify and catalog existing Indonesian NLP datasets                  ║│
│  ║  • Convert to MTEB-compatible format                                      ║│
│  ║  • Sources: IndoNLU, NusaX, IndoMMLU, MIRACL-ID, SEACrowd                ║│
│  ║  • Expected Coverage: ~20-30 datasets                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                     │                                         │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 2: TRANSLATION                                                      ║│
│  ║  ─────────────────                                                        ║│
│  ║  • Full MTEB benchmark translation to Indonesian                          ║│
│  ║  • Primary Model: TranslateGemma (4B/12B) - 55 language support          ║│
│  ║  • Alternative: NLLB-200, mT5, Bloom                                     ║│
│  ║  • Quality Control: LLM-as-judge + Human validation (10% sample)         ║│
│  ║  • Expected Coverage: ~40-60 datasets                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                     │                                         │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 3: AI-GENERATED DATASETS                                           ║│
│  ║  ─────────────────────────────                                           ║│
│  ║  • Identify task gaps after Phase 1 + 2                                  ║│
│  ║  • Generate novel Indonesian datasets using LLMs                          ║│
│  ║  • Domains: Legal, Healthcare, Finance, Social Media                     ║│
│  ║  • Validation: Statistical consistency + Human expert review             ║│
│  ║  • Expected Coverage: ~10-20 novel datasets                               ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  INTEGRATION & VALIDATION                                                 ║│
│  ║  ─────────────────────────────                                           ║│
│  ║  • Unified dataset format validation                                     ║│
│  ║  • Baseline model evaluation on all tasks                                ║│
│  ║  • Leaderboard integration with MTEB ecosystem                           ║│
│  ║  • Publication: ACL/EMNLP/NAACL dataset paper                            ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Key Contributions

Contribution Type Description Impact
Infrastructure First unified Indonesian embedding benchmark Enables systematic model comparison
Methodological Three-pronged data acquisition framework Replicable for other low-resource languages
Empirical Baseline evaluation of existing models Establishes performance floor
Community Open-source Python package Democratizes access to embedding evaluation

2. The Indonesian Language Context

2.1 Demographic Significance

Understanding the scale and importance of Indonesian (Bahasa Indonesia) is essential for contextualizing this benchmark:

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIAN LANGUAGE: KEY STATISTICS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SPEAKER COUNT                                                              │
│  ─────────────                                                              │
│  • Total Speakers:              ~280 million (2024)                          │
│  • Native Speakers:            ~42 million                                   │
│  • Second-Language Speakers:   ~238 million                                 │
│  • Global Ranking:             11th most spoken language                    │
│                                                                              │
│  GEOGRAPHIC DISTRIBUTION                                                     │
│  ───────────────────────                                                     │
│  • Primary Country:            Indonesia (4th most populous nation)         │
│  • ASEAN Presence:             Working language of ASEAN                    │
│  • Diaspora:                   Malaysia, Singapore, Netherlands, etc.       │
│                                                                              │
│  LINGUISTIC CONTEXT                                                          │
│  ───────────────────                                                          │
│  • Language Family:           Austronesian                                   │
│  • Script:                    Latin (Roman) alphabet                        │
│  • Regional Languages:        700+ indigenous languages in Indonesia        │
│  • Official Status:           Sole official language (since 1928)           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!TIP] Why Indonesian Matters for AI: Indonesia is the largest economy in Southeast Asia and a rapidly growing digital market. With over 200 million internet users and a thriving startup ecosystem, Indonesian NLP capabilities have direct commercial and social impact.

2.2 Linguistic Characteristics Affecting Embeddings

Indonesian presents unique challenges for text embedding models due to its morphological and syntactic properties:

Linguistic Feature Description Embedding Challenge
Agglutinative Morphology Words change through affixation (prefixes, suffixes, infixes, circumfixes) Embeddings must capture morphological variants
Reduplication Complete or partial word repetition for plurality or emphasis Creates vocabulary explosion
Productive Affixation Thousands of possible affix combinations Sparse embedding space for derived forms
Loanword Integration Extensive borrowing from Dutch, Arabic, Sanskrit, English, Javanese Requires cross-lingual alignment
Pro-Drop Language Subject pronouns often omitted Embeddings must infer from context
Formal vs. Informal Registers Significant divergence between written and colloquial forms Domain shift challenges

Example of Agglutinative Complexity:

Root Word:     "tulis" (write)
               ├── "me-" → "menulis" (to write - active)
               │      │
               │      ├── "kan" → "menuliskan" (to write for someone)
               │      │      │
               │      │      ├── "pem-" → "pemenulisan" (the act of writing)
               │      │      │      │
               │      │      │      ├── "-an" → "pemenulisanan" (documentation)
               │      │      │
               │      │      └── "di-" → "dituliskan" (be written for someone - passive)
               │      │
               │      ├── "-an" → "menulisan" (writing - noun)
               │      │
               │      └── "peng-" → "penulis" (writer)
               ├── "di-" → "ditulis" (be written - passive)
               │      │
               │      └── "-an" → "ditulisan" (something written)
               └── "ke-" → "ketulisan" (writability)

[!NOTE] Implication for Embedding Benchmarks: Indonesian embedding models must demonstrate robustness across these morphological variations. A comprehensive benchmark must include datasets that specifically test these phenomena.

2.3 Current NLP Infrastructure in Indonesia

Resource Type Status Notable Examples
Pretrained Language Models Emerging IndoBERT, IndoBART, IndoGPT
Embedding Models Limited LazarusNLP collections (5-10 models)
NLU Benchmarks Available IndoNLU (12 tasks)
Embedding Benchmarks None This is the gap
Translation Models Good NLLB, SeamlessM4T, TranslateGemma

3. Background: The MTEB Framework

3.1 What is MTEB?

MTEB (Massive Text Embedding Benchmark) is a standardized evaluation framework for text embedding models, introduced by Muennighoff et al. (2023) at EACL 2023 and significantly expanded through MMTEB at ICLR 2025.

Evolution Timeline:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MTEB EVOLUTIONARY TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  2022 (October)                                                             │
│  ════════════════                                                            │
│  • Original MTEB paper released (arXiv:2210.07316)                           │
│  • 58 datasets, 112 languages, 8 task categories                            │
│  • Establishes unified evaluation protocol                                  │
│                                                                              │
│  2023 (April)                                                                │
│  ═════════════                                                               │
│  • MTEB presented at EACL 2023 (Main Conference)                            │
│  • Paper: 1,400+ citations as of 2026                                       │
│  • HuggingFace integration launched                                         │
│                                                                              │
│  2024                                                                        │
│  ════                                                                        │
│  • Regional MTEBs emerge: C-MTEB (Chinese), AfriMTEB (African)              │
│  • Dataset count exceeds 1,000                                              │
│  • Leaderboard becomes industry standard                                    │
│                                                                              │
│  2025 (January)                                                              │
│  ═══════════════                                                             │
│  • MMTEB announced: Massive Multilingual expansion                          │
│  • 500+ tasks, 1,000+ languages                                             │
│  • Community-driven governance model                                        │
│                                                                              │
│  2025 (May)                                                                  │
│  ═════════════                                                               │
│  • MMTEB presented at ICLR 2025                                             │
│  • New task categories: Instruction Following, Long-Document Retrieval      │
│  • 86+ citations and growing rapidly                                        │
│                                                                              │
│  2026 (Current)                                                              │
│  ═════════════                                                               │
│  • 1,308+ datasets in production                                            │
│  • Active model submissions: 500+ models evaluated                          │
│  • Regional expansions: VN-MTEB, SEA-BED, others                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 The 8 MTEB Task Categories

Indonesia-MTEB will comprehensively cover all 8 MTEB task categories. Each category evaluates different aspects of embedding quality:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB TASK CATEGORIES & EVALUATION METRICS                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 1. CLASSIFICATION                                                     │   │
│  │ ────────────────                                                     │   │
│  │ Task: Single-label text classification                              │   │
│  │ Metrics: Accuracy, F1-score (macro/micro)                           │   │
│  │ Example: Sentiment analysis, topic categorization                   │   │
│  │ Indonesian Focus: sentiment (NusaX), news classification (IndoNLU)  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 2. CLUSTERING                                                         │   │
│  │ ─────────────                                                         │   │
│  │ Task: Group similar texts without labels                             │   │
│  │ Metrics: V-measure (homogeneity + completeness), ARI                 │   │
│  │ Example: Document clustering, topic discovery                        │   │
│  │ Indonesian Focus: news clustering, social media grouping            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 3. PAIR CLASSIFICATION                                                │   │
│  │ ──────────────────────                                                │   │
│  │ Task: Binary classification of text pairs                            │   │
│  │ Metrics: Accuracy, Average Precision (AP)                            │   │
│  │ Example: Paraphrase detection, duplicate identification             │   │
│  │ Indonesian Focus: paraphrase ID, semantic equivalence               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 4. RERANKING                                                          │   │
│  │ ─────────────                                                         │   │
│  │ Task: Reorder retrieved documents by relevance                       │   │
│  │ Metrics: MAP (Mean Average Precision), nDCG                          │   │
│  │ Example: Search result refinement                                    │   │
│  │ Indonesian Focus: document reranking, web search refinement         │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 5. RETRIEVAL                                                          │   │
│  │ ─────────────                                                         │   │
│  │ Task: Find relevant documents for queries                            │   │
│  │ Metrics: nDCG@k, Recall@k, MAP, MRR                                  │   │
│  │ Example: Search engines, RAG systems                                 │   │
│  │ Indonesian Focus: MIRACL-ID, Wikipedia retrieval, FAQ retrieval     │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 6. STS (Semantic Textual Similarity)                                  │   │
│  │ ──────────────────────────────────────                                │   │
│  │ Task: Predict similarity scores for text pairs                      │   │
│  │ Metrics: Pearson correlation, Spearman correlation                   │   │
│  │ Example: Semantic relatedness, paraphrase similarity                │   │
│  │ Indonesian Focus: translation-adapted STS datasets                  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 7. SUMMARIZATION                                                      │   │
│  │ ────────────────────                                                  │   │
│  │ Task: Assess summary quality relative to source                      │   │
│  │ Metrics: Cosine similarity, ROUGE (as reference)                     │   │
│  │ Example: Summary relevance assessment                                │   │
│  │ Indonesian Focus: news summary evaluation                            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 8. INSTRUCTION FOLLOWING                                              │   │
│  │ ────────────────────────                                              │   │
│  │ Task: Follow embedding-specific instructions                         │   │
│  │ Metrics: Task-specific (varies by instruction type)                  │   │
│  │ Example: Domain-specific retrieval, style-conditioned embedding     │   │
│  │ Indonesian Focus: Domain instruction datasets                        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.3 Key Evaluation Metrics Explained

For each task category, MTEB employs specific metrics. Understanding these is crucial for benchmark design:

[!NOTE] Metric Reference for Indonesia-MTEB Implementation:

Metric Formula Range Interpretation Use Case
Accuracy correct / total [0, 1] Percentage correct Classification
F1-Score 2·(precision·recall)/(precision+recall) [0, 1] Harmonic mean of precision/recall Classification
V-Measure 2·(homogeneity·completeness)/(homogeneity+completeness) [0, 1] Clustering quality independent of label permutation Clustering
ARI (RI - Expected_RI) / (Max_RI - Expected_RI) [-1, 1] Adjusted Rand Index - clustering similarity to ground truth Clustering
MAP mean(Average_Precision) [0, 1] Mean of average precision across queries Retrieval, Reranking
nDCG@k DCG@k / IDCG@k [0, 1] Normalized Discounted Cumulative Gain at position k Retrieval, Reranking
Recall@k relevant_in_top_k / total_relevant [0, 1] Percentage of relevant documents found in top k Retrieval
MRR mean(1/rank_of_first_relevant) [0, 1] Mean Reciprocal Rank Retrieval
Pearson covariance/(σ_x·σ_y) [-1, 1] Linear correlation between predicted and actual STS
Spearman rank_correlation [-1, 1] Monotonic correlation between predicted and actual STS

3.4 MTEB Leaderboard & Submission Process

The MTEB leaderboard, hosted on HuggingFace, serves as the central hub for embedding model evaluation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                   MTEB LEADERBOARD SUBMISSION PROCESS                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. MODEL PREPARATION                                                        │
│  ────────────────────                                                        │
│     • Upload model to HuggingFace Hub                                       │
│     • Ensure model card includes:                                          │
│       - Model architecture                                                  │
│       - Training data sources                                               │
│       - Parameter count                                                     │
│       - License information                                                 │
│                                                                              │
│  2. SUBMISSION PACKAGE                                                       │
│  ────────────────────                                                        │
│     • Fork MTEB repository                                                  │
│     • Add model metadata to models/registry.yaml                            │
│     • Format:                                                               │
│       name: "ModelName"                                                     │
│       language: ["id"]  # for Indonesian models                             │
│       open_source: true                                                     │
│       revision: "commit_hash"                                               │
│                                                                              │
│  3. AUTOMATED EVALUATION                                                     │
│  ────────────────────                                                        │
│     • MTEB CI automatically evaluates on all benchmarks                     │
│     • Results aggregated across task categories                             │
│     • Leaderboard updated automatically                                     │
│                                                                              │
│  4. TRANSPARENCY REQUIREMENTS                                                │
│  ────────────────────────────                                                │
│     • Reference implementation required                                    │
│     • Training data disclosure                                             │
│     • Reproducibility checklist                                            │
│     • Code availability                                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!TIP] For Indonesia-MTEB: We will establish integration with the MTEB leaderboard through: 1. Official dataset submission to MTEB repository 2. Indonesian-specific leaderboard sub-section 3. Automated evaluation pipeline for Indonesian models


4. The Gap Analysis

4.1 Current Indonesian Embedding Landscape

A comprehensive analysis reveals significant gaps in Indonesian embedding evaluation infrastructure:

Resource Type Coverage MTEB-Compatible Status
IndoNLU NLU Benchmark 12 tasks, Indonesian only ❌ NLU tasks only, not embedding-specific Established (2020)
NusaX Sentiment Dataset 10 Indonesian local languages ❌ Single task (sentiment) Established (2022)
IndoMMLU Knowledge QA Culture + language understanding ❌ Knowledge-focused, not embedding Available
MIRACL-ID Retrieval Indonesian subset of 18 languages ⚠️ Partial - retrieval only Available
LazarusNLP Embedding Models 5-10 Indonesian embedding models ❌ Models, not benchmark Active (2024)
SEA-BED Regional Benchmark 10 SEA languages, 169 datasets, 9 tasks ⚠️ Multi-language, not Indonesia-focused New (2025)
SEACrowd Data Hub 13 tasks, 38 SEA indigenous languages ⚠️ Includes Indonesian but not embedding-specific New (2024)
Indonesia-MTEB Embedding Benchmark 8 tasks, 50-100+ datasets Full MTEB compatibility This Project

Key Findings:

  1. No Indonesia-Specific Embedding Benchmark: Existing resources are either NLU-focused (IndoNLU) or multi-language (SEA-BED, SEACrowd) without dedicated emphasis on Indonesian embeddings.

  2. Fragmented Task Coverage: No single resource covers all 8 MTEB task categories for Indonesian.

  3. No Centralized Evaluation: Indonesian embedding models (LazarusNLP) are evaluated on scattered datasets without unified comparison.

4.2 Comparison with Regional Benchmarks

Benchmark Language Datasets Tasks MTEB Integration Indonesia Coverage
C-MTEB Chinese 35 6 ✅ Full N/A
VN-MTEB Vietnamese ~30+ Multi ✅ Full N/A
AfriMTEB African languages Subset Multi ✅ Full N/A
SEA-BED 10 SEA languages 169 9 ⚠️ Independent Partial (1 of 10)
Indonesia-MTEB Indonesian 50-100+ 8 🎯 Planned 🎯 100%

Positioning: Indonesia-MTEB is the first dedicated Indonesian embedding benchmark with full MTEB methodology compatibility and comprehensive task coverage.


5. Regional MTEB Precedents

5.1 Successful Regional Benchmarks

Analysis of existing regional MTEB implementations provides valuable methodological precedents:

C-MTEB (Chinese Massive Text Embedding Benchmark)

Specification:

Aspect Details
Language Chinese (Simplified & Traditional)
Scale 35 datasets, 6 task categories
Paper Xiao et al. (2023) - "Packed Resources For General Chinese Embeddings"
Citations 1,171+ (as of 2024)
Repository HuggingFace C-MTEB collection
Key Innovation Russian Doll Representational Learning for multi-grained embeddings

Methodological Insights for Indonesia-MTEB: - Emphasis on domain diversity (news, medical, legal, e-commerce) - Separate evaluation for Simplified vs. Traditional variants - Comprehensive baseline evaluation (30+ models)

VN-MTEB (Vietnamese Massive Text Embedding Benchmark)

Specification:

Aspect Details
Language Vietnamese
Scale ~30 datasets, multi-task
Paper Pham et al. (2025) - arXiv:2507.21500
Publication Date July 2025
Key Focus Toxicity detection, online content moderation
Repository GreenNode/VN-MTEB collection on HuggingFace

Methodological Insights for Indonesia-MTEB: - Recent publication demonstrates viability of new language benchmarks - Domain-specific focus (toxicity) as novel contribution - Community-driven model collection approach

SEA-BED (Southeast Asia Embedding Benchmark)

Specification:

Aspect Details
Languages 10 SEA languages (Indonesian, Thai, Vietnamese, etc.)
Scale 169 datasets, 9 tasks
Paper Ponwitayarat et al. (2025) - arXiv:2508.12243
Publication Date August 2025
Novelty 87% of datasets not in MMTEB
Human Annotations 71% human-formulated datasets

Methodological Insights for Indonesia-MTEB: - Demonstrates regional benchmark viability - High proportion of novel (non-MMTEB) datasets validates unique regional needs - Human annotation emphasis for quality control

5.2 Lessons Learned for Indonesia-MTEB

┌─────────────────────────────────────────────────────────────────────────────┐
│                   METHODOLOGICAL BEST PRACTICES FROM REGIONAL MTEBS          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FROM C-MTEB (Chinese)                                                       │
│  ────────────────────────                                                    │
│  ✓ Domain diversity is critical for comprehensive evaluation               │
│  ✓ Multi-grained evaluation (character, word, sentence level)              │
│  ✓ Comprehensive baseline evaluation establishes performance floor         │
│                                                                              │
│  FROM VN-MTEB (Vietnamese)                                                   │
│  ────────────────────────────────                                            │
│  ✓ Domain-specific focus can be a novel contribution                        │
│  ✓ Community-driven model collection accelerates adoption                  │
│  ✓ HuggingFace integration maximizes accessibility                          │
│                                                                              │
│  FROM SEA-BED (Southeast Asia)                                               │
│  ────────────────────────────────────────                                    │
│  ✓ Regional datasets often differ from global MTEB - prioritize novelty    │
│  ✓ High human annotation ratio ensures quality                              │
│  ✓ Language-specific challenges (agglutinative morphology, etc.) warrant   │
│    specialized datasets                                                     │
│                                                                              │
│  INDONESIA-MTEB SYNTHESIS                                                    │
│  ────────────────────────                                                    │
│  ✓ Combine domain diversity with Indonesian-specific focus                 │
│  ✓ Emphasize morphological complexity in dataset design                    │
│  ✓ High human validation ratio (minimum 10% of translated data)            │
│  ✓ Full HuggingFace + MTEB ecosystem integration                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6. Project Scope & Deliverables

6.1 In-Scope Deliverables

┌─────────────────────────────────────────────────────────────────────────────┐
│                      INDONESIA-MTEB DELIVERABLES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 1: DATASET SUITE                                           ║│
│  ║  ──────────────────────────                                             ║│
│  ║  Specification:                                                          ║│
│  ║  • All 8 MTEB task categories covered                                   ║│
│  ║  • Minimum 50 datasets (target: 100+)                                   ║│
│  ║  • Train/validation/test splits for supervised tasks                    ║│
│  ║  • Metadata documentation (license, source, creation method)            ║│
│  ║                                                                          ║│
│  ║  Data Sources:                                                           ║│
│  ║  • Aggregation: ~20-30 existing Indonesian datasets                     ║│
│  ║  • Translation: ~40-60 translated MTEB datasets                         ║│
│  ║  • AI-Generated: ~10-20 novel Indonesian datasets                      ║│
│  ║                                                                          ║│
│  ║  Format:                                                                 ║│
│  ║  • HuggingFace datasets format                                          ║│
│  ║  • MTEB-compatible metadata                                             ║│
│  ║  • Comprehensive documentation cards                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 2: EVALUATION FRAMEWORK                                    ║│
│  ║  ─────────────────────────────────                                      ║│
│  ║  Components:                                                             ║│
│  ║  • MTEB-compatible evaluation script                                    ║│
│  ║  • Indonesian-specific metric calculations                              ║│
│  ║  • Baseline model evaluations (10+ models)                              ║│
│  ║  • Leaderboard integration (HuggingFace Spaces)                         ║│
│  ║  • Reproducibility guarantees                                          ║│
│  ║                                                                          ║│
│  ║  Models for Baseline Evaluation:                                         ║│
│  ║  • Multilingual: E5, BGE, GTE, jina (current SOTA)                      ║│
│  ║  • Indonesian-specific: LazarusNLP models                               ║│
│  ║  • General: sentence-transformers baselines                             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 3: PYTHON PACKAGE                                          ║│
│  ║  ─────────────────────────────                                          ║│
│  ║  Package: indonesiamteb (PyPI)                                          ║│
│  ║  Features:                                                               ║│
│  ║  • pip install indonesiamteb                                            ║│
│  ║  • Easy dataset loading: load_benchmark(task_name)                     ║│
│  ║  • One-line evaluation: evaluate(model, benchmark)                     ║│
│  ║  • Leaderboard submission tools                                        ║│
│  ║  • Comprehensive documentation                                          ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 4: RESEARCH PAPER                                          ║│
│  ║  ──────────────────────────                                             ║│
│  ║  Target Venue: ACL/EMNLP/NAACL dataset track                            ║│
│  ║  Sections:                                                               ║│
│  ║  • Abstract & Introduction                                              ║│
│  ║  • Background & Related Work (MTEB, Indonesian NLP, regional MTEBs)     ║│
│  ║  • Methodology (data acquisition, translation, generation)             ║│
│  ║  • Dataset descriptions (all datasets with statistics)                 ║│
│  ║  • Baseline evaluation results                                         ║│
│  ║  • Cross-lingual analysis (ID ↔ EN performance)                        ║│
│  ║  • Limitations & Ethics                                                 ║│
│  ║  • Conclusion & Future Work                                             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6.2 Out-of-Scope (Explicitly Excluded)

Excluded Reason Alternative Approach
Training new embedding models Focus is benchmark, not models Evaluate existing models; model training is separate project
Domain-specific evaluation Keep benchmark general-purpose Domain-specific datasets included but benchmark remains general
Indonesian local languages (Javanese, Sundanese, etc.) Focus on Bahasa Indonesia first Future expansion to regional languages
Real-time leaderboard hosting Infrastructure scope HuggingFace Spaces integration; no independent hosting
Commercial applications Research focus Open-source for community use

6.3 Success Criteria

Metric Target Measurement Method
Task Coverage All 8 MTEB categories Dataset inventory
Dataset Count Minimum 50, target 100+ Final dataset count
Translation Quality ≥ 85% human acceptance rate Human validation on 10% sample
Baseline Models ≥ 10 models evaluated Evaluation results
Publication ACL/EMNLP/NAACL dataset paper Acceptance notification
MTEB Integration Official integration into MTEB Pull request acceptance
Package Usage ≥ 50 monthly downloads (6 months post-release) PyPI statistics
Community Adoption ≥ 5 models use Indonesia-MTEB for evaluation Leaderboard, GitHub citations

7. Research Questions

7.1 Primary Research Questions

RQ1: Gap Analysis & State of the Art

What is the current state of Indonesian embedding evaluation, and what specific gaps exist compared to MTEB standards?

Sub-questions: - RQ1.1: Which Indonesian NLP datasets exist and what is their MTEB compatibility? - RQ1.2: What task categories are currently underrepresented for Indonesian? - RQ1.3: How do existing Indonesian embedding models perform on MTEB-style evaluations?

RQ2: Translation Methodology

How can we effectively translate MTEB datasets to Indonesian while preserving semantic equivalence and task validity?

Sub-questions: - RQ2.1: Which translation model (TranslateGemma, NLLB, mT5) achieves optimal quality for Indonesian? - RQ2.2: What quality control mechanisms (human validation, LLM-as-judge) ensure semantic preservation? - RQ2.3: How does translation impact embedding model performance relative to original English datasets?

RQ3: Novel Dataset Generation

What novel Indonesian embedding tasks can be created via AI generation that fill unique gaps not addressed by translation or aggregation?

Sub-questions: - RQ3.1: Which task categories remain underserved after aggregation and translation? - RQ3.2: How can LLMs generate high-quality Indonesian datasets with statistical consistency? - RQ3.3: What Indonesian-specific linguistic phenomena should novel datasets target?

RQ4: Baseline Evaluation

How do existing embedding models (multilingual and Indonesian-specific) perform on a unified Indonesian benchmark across all 8 task categories?

Sub-questions: - RQ4.1: Which model architectures excel on which task types for Indonesian? - RQ4.2: How does Indonesian performance correlate with performance on other languages? - RQ4.3: What performance gaps exist between multilingual and Indonesian-specific models?

RQ5: Cross-Lingual Analysis

What does Indonesia-MTEB reveal about cross-lingual embedding capabilities and transfer learning to Indonesian?

Sub-questions: - RQ5.1: How do models trained on English/other languages transfer to Indonesian? - RQ5.2: What is the performance gap between monolingual Indonesian and multilingual models? - RQ5.3: Can Indonesia-MTEB inform embedding model design for other agglutinative languages?


8. Proposed Methodology

8.1 Phase Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB METHODOLOGY PHASES                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHASE 1: AGGREGATION              PHASE 2: TRANSLATION                     │
│  ──────────────────────            ────────────────────                     │
│  │                              │                                          │
│  │  • Dataset discovery         │  • MTEB dataset selection              │
│  │  • Format conversion         │  • Translation model benchmark        │
│  │  • Quality assessment        │  • Batch translation                  │
│  │  • MTEB compatibility check  │  • Quality control pipeline           │
│  │                              │  • Human validation (10% sample)      │
│  │                              │                                          │
│  └──────────────────────────────┘  └──────────────────────────────────────┘
│              │                              │                               │
│              ▼                              ▼                               │
│                                                                              │
│  ╔═══════════════════════════════════════════════════════════════════════╗│
│  ║                   PHASE 3: NOVEL DATASET GENERATION                    ║│
│  ║                   ─────────────────────────────────────                 ║│
│  ║                                                                        ║│
│  ║    • Gap identification (post-aggregation + translation)              ║│
│  ║    • LLM prompt engineering for dataset generation                     ║│
│  ║    • Domain-specific dataset creation (legal, medical, etc.)          ║│
│  ║    • Statistical consistency validation                               ║│
│  ║    • Human expert review                                               ║│
│  ║                                                                        ║│
│  ╚═══════════════════════════════════════════════════════════════════════╝│
│                              │                                              │
│                              ▼                                              │
│                                                                              │
│  ╔═══════════════════════════════════════════════════════════════════════╗│
│  ║                   PHASE 4: INTEGRATION & VALIDATION                    ║│
│  ║                   ──────────────────────────────────────                ║│
│  ║                                                                        ║│
│  ║    • Unified dataset format validation                                 ║│
│  ║    • Baseline model evaluation (10+ models)                            ║│
│  ║    • Statistical analysis of results                                   ║│
│  ║    • Cross-lingual comparison                                          ║│
│  ║    • Leaderboard deployment                                            ║│
│  ║    • Paper writing and submission                                      ║│
│  ║                                                                        ║│
│  ╚═══════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Phase 1: Aggregation - Detailed Methodology

Objective: Identify, convert, and validate existing Indonesian datasets for MTEB compatibility.

Step 1: Dataset Discovery

Source Datasets of Interest MTEB Category
IndoNLU SMSA, EmoT, etc. Classification
NusaX Sentiment (10 languages) Classification
IndoMMLU Knowledge QA Classification
MIRACL-ID Wikipedia retrieval Retrieval
SEACrowd Various tasks Multiple

Step 2: Format Conversion

  • Target Format: HuggingFace datasets with MTEB-specific schema
  • Required Fields: text, label, split (train/validation/test)
  • Metadata: license, source language, domain, creation date

Step 3: Quality Assessment

  • Check for data leakage between splits
  • Verify label distribution balance
  • Assess text quality (encoding issues, noise)

8.3 Phase 2: Translation - Detailed Methodology

Objective: Translate selected MTEB datasets to Indonesian with semantic preservation.

Step 1: Translation Model Selection

Model Parameters Languages Strength Weakness
TranslateGemma 4B / 12B / 27B 55 Latest, optimized New (2026)
NLLB-200 3.3B 200 Proven quality Older architecture
mT5 580M / 1.1B 101 Flexible Requires fine-tuning
SeamlessM4T 2.3B 100 Multimodal Overkill for text-only

Step 2: Translation Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRANSLATION QUALITY CONTROL PIPELINE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SOURCE TEXT (English MTEB dataset)                                          │
│          │                                                                   │
│          ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │  AUTOMATED TRANSLATION (TranslateGemma 12B)                  │           │
│  └─────────────────────────────────────────────────────────────┘           │
│          │                                                                   │
│          ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │  LLM-AS-JUDGE VALIDATION (GPT-4 / Claude)                    │           │
│  │  Criteria:                                                  │           │
│  │  • Semantic equivalence (1-5 scale)                         │           │
│  │  • Grammatical correctness                                  │           │
│  │  • Cultural appropriateness                                 │           │
│  └─────────────────────────────────────────────────────────────┘           │
│          │                                                                   │
│          ├──────────────┬──────────────┐                                    │
│          ▼              ▼              ▼                                    │
│    ACCEPT          REJECT          FLAG                                    │
│        │              │              │                                     │
│        │              │              ▼                                     │
│        │              │    ┌───────────────────┐                          │
│        │              │    │ HUMAN REVIEW      │                          │
│        │              │    │ (10% sample)      │                          │
│        │              │    └───────────────────┘                          │
│        │              │              │                                     │
│        ▼              ▼              ▼                                     │
│  ┌───────────────────────────────────────────────────┐                    │
│  │  FINAL INDONESIAN DATASET                          │                    │
│  │  • Accepted translations                           │                    │
│  │  • Human-reviewed corrections                      │                    │
│  │  • Quality score metadata                          │                    │
│  └───────────────────────────────────────────────────┘                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 3: Quality Metrics

  • Semantic Preservation Score: LLM-as-judge rating (1-5)
  • Acceptance Threshold: ≥ 4.0/5.0
  • Human Validation Rate: 10% random sample + all flagged items
  • Target Human Acceptance Rate: ≥ 85%

8.4 Phase 3: AI Generation - Detailed Methodology

Objective: Create novel Indonesian datasets for underserved tasks.

Target Domains:

Domain Rationale Task Category
Legal Complex morphology in legal texts Classification, Retrieval
Healthcare Technical terminology, code-switching STS, Classification
Finance Numeral expressions, named entities Clustering, Pair Classification
Social Media Informal language, slang Sentiment, STS
News Formal Indonesian, topic diversity Clustering, Retrieval

Generation Methodology:

  1. Gap Identification: Analyze coverage after Phases 1-2
  2. Prompt Engineering: Design prompts for LLM dataset generation
  3. Iterative Generation: Generate, validate, refine
  4. Statistical Checks: Label distribution, text length, vocabulary diversity
  5. Human Review: Domain expert validation

8.5 Phase 4: Integration & Validation - Detailed Methodology

Objective: Unify all datasets, evaluate baselines, and publish.

Step 1: Unified Format Validation

  • Schema validation across all datasets
  • Consistent metadata formatting
  • HuggingFace dataset card generation

Step 2: Baseline Evaluation

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BASELINE MODEL EVALUATION MATRIX                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                   ┌─────────────────────────────────────┐                   │
│                   │         MODELS TO EVALUATE           │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ Multilingual:                        │                   │
│                   │   • E5-large-v2                      │                   │
│                   │   • bge-m3 (multilingual)            │                   │
│                   │   • gte-large                        │                   │
│                   │   • jina-embeddings-v3               │                   │
│                   │                                      │                   │
│                   │ Indonesian-Specific:                 │                   │
│                   │   • LazarusNLP/indonesian-sbert...  │                   │
│                   │   • (others from HuggingFace)        │                   │
│                   │                                      │                   │
│                   │ Baselines:                           │                   │
│                   │   • sentence-transformers/LaBSE     │                   │
│                   │   • sentence-transformers/distiluse │                   │
│                   └─────────────────────────────────────┘                   │
│                                  │                                           │
│                                  ▼                                           │
│                   ┌─────────────────────────────────────┐                   │
│                   │         TASK CATEGORIES              │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ 1. Classification   │ 5. Retrieval  │                   │
│                   │ 2. Clustering       │ 6. STS        │                   │
│                   │ 3. Pair Class.      │ 7. Summariz.  │                   │
│                   │ 4. Reranking        │ 8. Instr. Fol. │                   │
│                   └─────────────────────────────────────┘                   │
│                                  │                                           │
│                                  ▼                                           │
│                   ┌─────────────────────────────────────┐                   │
│                   │         OUTPUT                       │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ • Per-task performance scores        │                   │
│                   │ • Aggregate benchmark score           │                   │
│                   │ • Cross-lingual comparisons          │                   │
│                   │ • Leaderboard rankings               │                   │
│                   └─────────────────────────────────────┘                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 3: Statistical Analysis

  • Mean performance across models per task
  • Performance variance analysis
  • Correlation between tasks (task similarity)
  • Cross-lingual performance correlation

9. Technical Architecture

9.1 System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB TECHNICAL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        DATA LAYER                                     │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ HuggingFace    │  │ Source Files   │  │ Generated Data │        │   │
│  │  │ Datasets Hub   │  │ (IndoNLU, etc) │  │ (AI-created)  │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                     PROCESSING LAYER                                  │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ Format         │  │ Translation    │  │ Quality        │        │   │
│  │  │ Converters     │  │ Pipeline       │  │ Validation     │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                      EVALUATION LAYER                                 │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ MTEB Core      │  │ Custom Metrics │  │ Statistical    │        │   │
│  │  │ Evaluator      │  │ (ID-specific)  │  │ Analysis       │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                       PRESENTATION LAYER                               │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ HuggingFace    │  │ PyPI Package   │  │ Documentation  │        │   │
│  │  │ Spaces         │  │ CLI/API        │  │ Site           │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

9.2 Python Package Structure

indonesiamteb/
├── indonesiamteb/
│   ├── __init__.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── loading.py           # Dataset loading utilities
│   │   └── metadata.py          # Dataset metadata registry
│   ├── tasks/
│   │   ├── __init__.py
│   │   ├── classification.py    # Classification task wrappers
│   │   ├── clustering.py        # Clustering task wrappers
│   │   ├── retrieval.py         # Retrieval task wrappers
│   │   ├── sts.py               # STS task wrappers
│   │   └── ...                  # Other task types
│   ├── evaluation/
│   │   ├── __init__.py
│   │   ├── evaluator.py         # Main evaluation class
│   │   ├── metrics.py           # Custom metrics
│   │   └── leaderboard.py       # Leaderboard utilities
│   └── utils/
│       ├── __init__.py
│       ├── translation.py       # Translation utilities
│       └── validation.py        # Quality validation
├── benchmarks/
│   ├── classification/
│   ├── clustering/
│   └── ...                      # Dataset implementations
├── tests/
│   ├── test_data.py
│   ├── test_evaluation.py
│   └── test_tasks.py
├── examples/
│   ├── basic_usage.py
│   └── custom_evaluation.py
├── setup.py
├── pyproject.toml
└── README.md

10. Success Criteria

10.1 Quantitative Metrics

Metric Minimum Target Stretch
Total Datasets 50 100 150+
Task Coverage 8/8 categories 8/8 categories 8/8 categories
Translation Acceptance Rate 85% 90% 95%
Baseline Models Evaluated 10 15 20+
MTEB Integration Official submission Accepted Featured
PyPI Downloads (6 months) 50 500 1000+
Community Adoptions 3 models 10 models 20+ models
Paper Citations (1 year) 5 20 50+

10.2 Qualitative Milestones

  • All datasets pass quality validation
  • Baseline evaluation complete with documented results
  • Python package published on PyPI
  • HuggingFace Spaces leaderboard deployed
  • Research paper submitted to top-tier venue
  • Community engagement (GitHub stars, forks, discussions)
  • Integration with MTEB main repository

11. Timeline & Milestones

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB PROJECT TIMELINE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  MONTH 1-2: FOUNDATION                                                       │
│  ═══════════════════                                                         │
│  ✓ Literature review complete                                               │
│  ✓ Dataset inventory finalized                                              │
│  ✓ Translation model benchmark selected                                     │
│  ✓ Technical architecture designed                                          │
│                                                                              │
│  MONTH 3-4: DATA ACQUISITION                                                  │
│  ════════════════════════                                                    │
│  ✓ Phase 1: Aggregation complete (20-30 datasets)                           │
│  ✓ Phase 2: Translation pipeline operational                                │
│  ✓ Phase 3: AI generation begins                                            │
│                                                                              │
│  MONTH 5-6: DATASET COMPLETION                                               │
│  ═════════════════════════                                                   │
│  ✓ Translation complete (40-60 datasets)                                    │
│  ✓ AI-generated datasets complete (10-20 datasets)                          │
│  ✓ Quality validation complete                                              │
│                                                                              │
│  MONTH 7-8: EVALUATION                                                       │
│  ═════════════════                                                          │
│  ✓ Baseline model evaluations (10+ models)                                  │
│  ✓ Statistical analysis complete                                            │
│  ✓ Cross-lingual comparison complete                                        │
│                                                                              │
│  MONTH 9-10: PACKAGE & PAPER                                                │
│  ════════════════════════                                                    │
│  ✓ Python package development complete                                      │
│  ✓ HuggingFace integration complete                                         │
│  ✓ Research paper drafted                                                  │
│                                                                              │
│  MONTH 11-12: PUBLICATION                                                    │
│  ═════════════════════                                                       │
│  ✓ Paper submitted to target venue                                          │
│  ✓ PyPI package released                                                   │
│  ✓ Leaderboard deployed                                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

12. References

12.1 Primary Sources

  1. Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316

  2. Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595

  3. Xiao, S., et al. (2023). "Packed Resources For General Chinese Embeddings: C-MTEB and C-MTP". Findings of the Association for Computational Linguistics (ACL 2023). arXiv:2309.07597

12.2 Regional Benchmarks

  1. Ponwitayarat, W., et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark". arXiv. arXiv:2508.12243

  2. Pham, L., et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark". arXiv. arXiv:2507.21500

12.3 Indonesian NLP Resources

  1. Winata, G., et al. (2020). "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding". arXiv. arXiv:2009.05387

  2. Winata, G., et al. (2022). "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages". arXiv. arXiv:2205.15960

  3. Lovenia, H., et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark for Southeast Asian Languages". Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024).

12.4 Translation Models

  1. Finkelstein, M., et al. (2026). "TranslateGemma Technical Report". arXiv. arXiv:2601.09012

  2. NLLB Team (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv. arXiv:2207.04872

12.5 Evaluation Methodology

  1. Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.

  2. Humeun, L., et al. (2025). "HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks". arXiv. arXiv:2510.10062


13. Document Status

[!NOTE] Next Document: Document 02 - MTEB Structure Analysis

This document provides detailed analysis of MTEB's internal structure, dataset formats, evaluation protocols, and integration requirements for Indonesia-MTEB.

Change Log:

Version Date Changes Author
1.0 2026-01-25 Initial version Research Team
2.0 2026-01-25 Enhanced edition with expanded sections, latest research Research Team

This document is a living record. Updated as research progresses.