Project: Indonesia-MTEB Benchmark Document: 01 - Project Overview & Scope Definition Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning

[!NOTE]

This is the first of twelve documents comprising the Indonesia-MTEB Benchmark research foundation. Each document builds upon the previous, establishing a comprehensive blueprint for creating Indonesia's first unified text embedding benchmark following MTEB methodology.

Document	Title	Focus Area
01	Project Overview & Scope	Current Document
02	MTEB Structure Analysis	Framework deep-dive
03	Existing Indonesian Datasets	Data aggregation sources
04	Regional MTEB Methodologies	Precedent analysis
05	Translation Models Benchmark	Model selection & evaluation
06	AI Dataset Generation Methods	Novel data creation
07	Validation Strategies	Quality assurance protocols
08	ACL Dataset Paper Standards	Publication requirements
09	Novelty Angle & Publication	Research contribution
10	Implementation Roadmap	Technical execution plan
11	Python Package Development	Software architecture
12	Summary & Quick Reference	Consolidated reference

Indonesia-MTEB: A Comprehensive Text Embedding Benchmark for Indonesian¶

"The absence of a unified embedding benchmark for Indonesian represents a critical gap in Southeast Asian NLP infrastructure. With 280+ million speakers, Indonesian ranks among the world's most spoken languages, yet remains systematically underrepresented in embedding evaluation frameworks."

1. Executive Summary¶

1.1 The Problem Statement¶

The Masssive Text Embedding Benchmark (MTEB) has emerged as the dominant evaluation framework for text embedding models globally. Since its introduction at EACL 2023, MTEB has undergone exponential expansion through the MMTEB (Massive Multilingual Text Embedding Benchmark) initiative at ICLR 2025, now encompassing:

Milestone	Scale	Languages	Datasets
MTEB Original (EACL 2023)	Foundational	112	58
MMTEB (ICLR 2025)	Community-driven	1,000+	500+
Current (2026)	Production	1,000+	1,308+

However, Indonesian language coverage remains fragmented and insufficient for rigorous embedding evaluation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIAN EMBEDDING EVALUATION GAP                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   GLOBAL MTEB                    │   INDONESIAN STATUS                      │
│   ──────────────                  │   ─────────────────                      │
│   ✓ 8 Task Categories            │   ✗ No unified Indonesian benchmark     │
│   ✓ 500+ Quality-controlled tasks │   ✗ Scattered individual datasets       │
│   ✓ Standardized metrics         │   ✗ No embedding-specific evaluation    │
│   ✓ Active leaderboard           │   ✗ No Indonesian embedding leaderboard │
│   ✓ Community governance         │   ✗ No centralized benchmark hub        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

The Critical Gap: Despite being the 11^th most spoken language globally with 280+ million speakers and serving as the lingua franca of Southeast Asia, Indonesian lacks a dedicated, comprehensive embedding benchmark following MTEB standards.

1.2 Research Objective¶

Primary Goal: Create Indonesia-MTEB — a unified, comprehensive Indonesian text embedding benchmark following MTEB methodology, covering all 8 MTEB task categories with minimum 50 datasets (target: 100+).

Three-Pronged Data Strategy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB DATASET ACQUISITION STRATEGY              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 1: AGGREGATION                                                     ║│
│  ║  ─────────────────                                                        ║│
│  ║  • Identify and catalog existing Indonesian NLP datasets                  ║│
│  ║  • Convert to MTEB-compatible format                                      ║│
│  ║  • Sources: IndoNLU, NusaX, IndoMMLU, MIRACL-ID, SEACrowd                ║│
│  ║  • Expected Coverage: ~20-30 datasets                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                     │                                         │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 2: TRANSLATION                                                      ║│
│  ║  ─────────────────                                                        ║│
│  ║  • Full MTEB benchmark translation to Indonesian                          ║│
│  ║  • Primary Model: TranslateGemma (4B/12B) - 55 language support          ║│
│  ║  • Alternative: NLLB-200, mT5, Bloom                                     ║│
│  ║  • Quality Control: LLM-as-judge + Human validation (10% sample)         ║│
│  ║  • Expected Coverage: ~40-60 datasets                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                     │                                         │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  PHASE 3: AI-GENERATED DATASETS                                           ║│
│  ║  ─────────────────────────────                                           ║│
│  ║  • Identify task gaps after Phase 1 + 2                                  ║│
│  ║  • Generate novel Indonesian datasets using LLMs                          ║│
│  ║  • Domains: Legal, Healthcare, Finance, Social Media                     ║│
│  ║  • Validation: Statistical consistency + Human expert review             ║│
│  ║  • Expected Coverage: ~10-20 novel datasets                               ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│                                     ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  INTEGRATION & VALIDATION                                                 ║│
│  ║  ─────────────────────────────                                           ║│
│  ║  • Unified dataset format validation                                     ║│
│  ║  • Baseline model evaluation on all tasks                                ║│
│  ║  • Leaderboard integration with MTEB ecosystem                           ║│
│  ║  • Publication: ACL/EMNLP/NAACL dataset paper                            ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.3 Key Contributions¶

Contribution Type	Description	Impact
Infrastructure	First unified Indonesian embedding benchmark	Enables systematic model comparison
Methodological	Three-pronged data acquisition framework	Replicable for other low-resource languages
Empirical	Baseline evaluation of existing models	Establishes performance floor
Community	Open-source Python package	Democratizes access to embedding evaluation

2. The Indonesian Language Context¶

2.1 Demographic Significance¶

Understanding the scale and importance of Indonesian (Bahasa Indonesia) is essential for contextualizing this benchmark:

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIAN LANGUAGE: KEY STATISTICS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SPEAKER COUNT                                                              │
│  ─────────────                                                              │
│  • Total Speakers:              ~280 million (2024)                          │
│  • Native Speakers:            ~42 million                                   │
│  • Second-Language Speakers:   ~238 million                                 │
│  • Global Ranking:             11th most spoken language                    │
│                                                                              │
│  GEOGRAPHIC DISTRIBUTION                                                     │
│  ───────────────────────                                                     │
│  • Primary Country:            Indonesia (4th most populous nation)         │
│  • ASEAN Presence:             Working language of ASEAN                    │
│  • Diaspora:                   Malaysia, Singapore, Netherlands, etc.       │
│                                                                              │
│  LINGUISTIC CONTEXT                                                          │
│  ───────────────────                                                          │
│  • Language Family:           Austronesian                                   │
│  • Script:                    Latin (Roman) alphabet                        │
│  • Regional Languages:        700+ indigenous languages in Indonesia        │
│  • Official Status:           Sole official language (since 1928)           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!TIP] Why Indonesian Matters for AI: Indonesia is the largest economy in Southeast Asia and a rapidly growing digital market. With over 200 million internet users and a thriving startup ecosystem, Indonesian NLP capabilities have direct commercial and social impact.

2.2 Linguistic Characteristics Affecting Embeddings¶

Indonesian presents unique challenges for text embedding models due to its morphological and syntactic properties:

Linguistic Feature	Description	Embedding Challenge
Agglutinative Morphology	Words change through affixation (prefixes, suffixes, infixes, circumfixes)	Embeddings must capture morphological variants
Reduplication	Complete or partial word repetition for plurality or emphasis	Creates vocabulary explosion
Productive Affixation	Thousands of possible affix combinations	Sparse embedding space for derived forms
Loanword Integration	Extensive borrowing from Dutch, Arabic, Sanskrit, English, Javanese	Requires cross-lingual alignment
Pro-Drop Language	Subject pronouns often omitted	Embeddings must infer from context
Formal vs. Informal Registers	Significant divergence between written and colloquial forms	Domain shift challenges

Example of Agglutinative Complexity:

Root Word:     "tulis" (write)
               │
               ├── "me-" → "menulis" (to write - active)
               │      │
               │      ├── "kan" → "menuliskan" (to write for someone)
               │      │      │
               │      │      ├── "pem-" → "pemenulisan" (the act of writing)
               │      │      │      │
               │      │      │      ├── "-an" → "pemenulisanan" (documentation)
               │      │      │
               │      │      └── "di-" → "dituliskan" (be written for someone - passive)
               │      │
               │      ├── "-an" → "menulisan" (writing - noun)
               │      │
               │      └── "peng-" → "penulis" (writer)
               │
               ├── "di-" → "ditulis" (be written - passive)
               │      │
               │      └── "-an" → "ditulisan" (something written)
               │
               └── "ke-" → "ketulisan" (writability)

[!NOTE] Implication for Embedding Benchmarks: Indonesian embedding models must demonstrate robustness across these morphological variations. A comprehensive benchmark must include datasets that specifically test these phenomena.

2.3 Current NLP Infrastructure in Indonesia¶

Resource Type	Status	Notable Examples
Pretrained Language Models	Emerging	IndoBERT, IndoBART, IndoGPT
Embedding Models	Limited	LazarusNLP collections (5-10 models)
NLU Benchmarks	Available	IndoNLU (12 tasks)
Embedding Benchmarks	None	This is the gap
Translation Models	Good	NLLB, SeamlessM4T, TranslateGemma

3. Background: The MTEB Framework¶

3.1 What is MTEB?¶

MTEB (Massive Text Embedding Benchmark) is a standardized evaluation framework for text embedding models, introduced by Muennighoff et al. (2023) at EACL 2023 and significantly expanded through MMTEB at ICLR 2025.

Evolution Timeline:

┌─────────────────────────────────────────────────────────────────────────────┐
│                      MTEB EVOLUTIONARY TIMELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  2022 (October)                                                             │
│  ════════════════                                                            │
│  • Original MTEB paper released (arXiv:2210.07316)                           │
│  • 58 datasets, 112 languages, 8 task categories                            │
│  • Establishes unified evaluation protocol                                  │
│                                                                              │
│  2023 (April)                                                                │
│  ═════════════                                                               │
│  • MTEB presented at EACL 2023 (Main Conference)                            │
│  • Paper: 1,400+ citations as of 2026                                       │
│  • HuggingFace integration launched                                         │
│                                                                              │
│  2024                                                                        │
│  ════                                                                        │
│  • Regional MTEBs emerge: C-MTEB (Chinese), AfriMTEB (African)              │
│  • Dataset count exceeds 1,000                                              │
│  • Leaderboard becomes industry standard                                    │
│                                                                              │
│  2025 (January)                                                              │
│  ═══════════════                                                             │
│  • MMTEB announced: Massive Multilingual expansion                          │
│  • 500+ tasks, 1,000+ languages                                             │
│  • Community-driven governance model                                        │
│                                                                              │
│  2025 (May)                                                                  │
│  ═════════════                                                               │
│  • MMTEB presented at ICLR 2025                                             │
│  • New task categories: Instruction Following, Long-Document Retrieval      │
│  • 86+ citations and growing rapidly                                        │
│                                                                              │
│  2026 (Current)                                                              │
│  ═════════════                                                               │
│  • 1,308+ datasets in production                                            │
│  • Active model submissions: 500+ models evaluated                          │
│  • Regional expansions: VN-MTEB, SEA-BED, others                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 The 8 MTEB Task Categories¶

Indonesia-MTEB will comprehensively cover all 8 MTEB task categories. Each category evaluates different aspects of embedding quality:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB TASK CATEGORIES & EVALUATION METRICS                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 1. CLASSIFICATION                                                     │   │
│  │ ────────────────                                                     │   │
│  │ Task: Single-label text classification                              │   │
│  │ Metrics: Accuracy, F1-score (macro/micro)                           │   │
│  │ Example: Sentiment analysis, topic categorization                   │   │
│  │ Indonesian Focus: sentiment (NusaX), news classification (IndoNLU)  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 2. CLUSTERING                                                         │   │
│  │ ─────────────                                                         │   │
│  │ Task: Group similar texts without labels                             │   │
│  │ Metrics: V-measure (homogeneity + completeness), ARI                 │   │
│  │ Example: Document clustering, topic discovery                        │   │
│  │ Indonesian Focus: news clustering, social media grouping            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 3. PAIR CLASSIFICATION                                                │   │
│  │ ──────────────────────                                                │   │
│  │ Task: Binary classification of text pairs                            │   │
│  │ Metrics: Accuracy, Average Precision (AP)                            │   │
│  │ Example: Paraphrase detection, duplicate identification             │   │
│  │ Indonesian Focus: paraphrase ID, semantic equivalence               │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 4. RERANKING                                                          │   │
│  │ ─────────────                                                         │   │
│  │ Task: Reorder retrieved documents by relevance                       │   │
│  │ Metrics: MAP (Mean Average Precision), nDCG                          │   │
│  │ Example: Search result refinement                                    │   │
│  │ Indonesian Focus: document reranking, web search refinement         │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 5. RETRIEVAL                                                          │   │
│  │ ─────────────                                                         │   │
│  │ Task: Find relevant documents for queries                            │   │
│  │ Metrics: nDCG@k, Recall@k, MAP, MRR                                  │   │
│  │ Example: Search engines, RAG systems                                 │   │
│  │ Indonesian Focus: MIRACL-ID, Wikipedia retrieval, FAQ retrieval     │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 6. STS (Semantic Textual Similarity)                                  │   │
│  │ ──────────────────────────────────────                                │   │
│  │ Task: Predict similarity scores for text pairs                      │   │
│  │ Metrics: Pearson correlation, Spearman correlation                   │   │
│  │ Example: Semantic relatedness, paraphrase similarity                │   │
│  │ Indonesian Focus: translation-adapted STS datasets                  │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 7. SUMMARIZATION                                                      │   │
│  │ ────────────────────                                                  │   │
│  │ Task: Assess summary quality relative to source                      │   │
│  │ Metrics: Cosine similarity, ROUGE (as reference)                     │   │
│  │ Example: Summary relevance assessment                                │   │
│  │ Indonesian Focus: news summary evaluation                            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ 8. INSTRUCTION FOLLOWING                                              │   │
│  │ ────────────────────────                                              │   │
│  │ Task: Follow embedding-specific instructions                         │   │
│  │ Metrics: Task-specific (varies by instruction type)                  │   │
│  │ Example: Domain-specific retrieval, style-conditioned embedding     │   │
│  │ Indonesian Focus: Domain instruction datasets                        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.3 Key Evaluation Metrics Explained¶

For each task category, MTEB employs specific metrics. Understanding these is crucial for benchmark design:

[!NOTE] Metric Reference for Indonesia-MTEB Implementation:

Metric	Formula	Range	Interpretation	Use Case
Accuracy	correct / total	[0, 1]	Percentage correct	Classification
F1-Score	2·(precision·recall)/(precision+recall)	[0, 1]	Harmonic mean of precision/recall	Classification
V-Measure	2·(homogeneity·completeness)/(homogeneity+completeness)	[0, 1]	Clustering quality independent of label permutation	Clustering
ARI	(RI - Expected_RI) / (Max_RI - Expected_RI)	[-1, 1]	Adjusted Rand Index - clustering similarity to ground truth	Clustering
MAP	mean(Average_Precision)	[0, 1]	Mean of average precision across queries	Retrieval, Reranking
nDCG@k	DCG@k / IDCG@k	[0, 1]	Normalized Discounted Cumulative Gain at position k	Retrieval, Reranking
Recall@k	relevant_in_top_k / total_relevant	[0, 1]	Percentage of relevant documents found in top k	Retrieval
MRR	mean(1/rank_of_first_relevant)	[0, 1]	Mean Reciprocal Rank	Retrieval
Pearson	covariance/(σ_x·σ_y)	[-1, 1]	Linear correlation between predicted and actual	STS
Spearman	rank_correlation	[-1, 1]	Monotonic correlation between predicted and actual	STS

3.4 MTEB Leaderboard & Submission Process¶

The MTEB leaderboard, hosted on HuggingFace, serves as the central hub for embedding model evaluation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                   MTEB LEADERBOARD SUBMISSION PROCESS                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. MODEL PREPARATION                                                        │
│  ────────────────────                                                        │
│     • Upload model to HuggingFace Hub                                       │
│     • Ensure model card includes:                                          │
│       - Model architecture                                                  │
│       - Training data sources                                               │
│       - Parameter count                                                     │
│       - License information                                                 │
│                                                                              │
│  2. SUBMISSION PACKAGE                                                       │
│  ────────────────────                                                        │
│     • Fork MTEB repository                                                  │
│     • Add model metadata to models/registry.yaml                            │
│     • Format:                                                               │
│       name: "ModelName"                                                     │
│       language: ["id"]  # for Indonesian models                             │
│       open_source: true                                                     │
│       revision: "commit_hash"                                               │
│                                                                              │
│  3. AUTOMATED EVALUATION                                                     │
│  ────────────────────                                                        │
│     • MTEB CI automatically evaluates on all benchmarks                     │
│     • Results aggregated across task categories                             │
│     • Leaderboard updated automatically                                     │
│                                                                              │
│  4. TRANSPARENCY REQUIREMENTS                                                │
│  ────────────────────────────                                                │
│     • Reference implementation required                                    │
│     • Training data disclosure                                             │
│     • Reproducibility checklist                                            │
│     • Code availability                                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!TIP] For Indonesia-MTEB: We will establish integration with the MTEB leaderboard through: 1. Official dataset submission to MTEB repository 2. Indonesian-specific leaderboard sub-section 3. Automated evaluation pipeline for Indonesian models

4. The Gap Analysis¶

4.1 Current Indonesian Embedding Landscape¶

A comprehensive analysis reveals significant gaps in Indonesian embedding evaluation infrastructure:

Resource	Type	Coverage	MTEB-Compatible	Status
IndoNLU	NLU Benchmark	12 tasks, Indonesian only	❌ NLU tasks only, not embedding-specific	Established (2020)
NusaX	Sentiment Dataset	10 Indonesian local languages	❌ Single task (sentiment)	Established (2022)
IndoMMLU	Knowledge QA	Culture + language understanding	❌ Knowledge-focused, not embedding	Available
MIRACL-ID	Retrieval	Indonesian subset of 18 languages	⚠️ Partial - retrieval only	Available
LazarusNLP	Embedding Models	5-10 Indonesian embedding models	❌ Models, not benchmark	Active (2024)
SEA-BED	Regional Benchmark	10 SEA languages, 169 datasets, 9 tasks	⚠️ Multi-language, not Indonesia-focused	New (2025)
SEACrowd	Data Hub	13 tasks, 38 SEA indigenous languages	⚠️ Includes Indonesian but not embedding-specific	New (2024)
Indonesia-MTEB	Embedding Benchmark	8 tasks, 50-100+ datasets	✅ Full MTEB compatibility	This Project

Key Findings:

No Indonesia-Specific Embedding Benchmark: Existing resources are either NLU-focused (IndoNLU) or multi-language (SEA-BED, SEACrowd) without dedicated emphasis on Indonesian embeddings.
Fragmented Task Coverage: No single resource covers all 8 MTEB task categories for Indonesian.
No Centralized Evaluation: Indonesian embedding models (LazarusNLP) are evaluated on scattered datasets without unified comparison.

4.2 Comparison with Regional Benchmarks¶

Benchmark	Language	Datasets	Tasks	MTEB Integration	Indonesia Coverage
C-MTEB	Chinese	35	6	✅ Full	N/A
VN-MTEB	Vietnamese	~30+	Multi	✅ Full	N/A
AfriMTEB	African languages	Subset	Multi	✅ Full	N/A
SEA-BED	10 SEA languages	169	9	⚠️ Independent	Partial (1 of 10)
Indonesia-MTEB	Indonesian	50-100+	8	🎯 Planned	🎯 100%

Positioning: Indonesia-MTEB is the first dedicated Indonesian embedding benchmark with full MTEB methodology compatibility and comprehensive task coverage.

5. Regional MTEB Precedents¶

5.1 Successful Regional Benchmarks¶

Analysis of existing regional MTEB implementations provides valuable methodological precedents:

C-MTEB (Chinese Massive Text Embedding Benchmark)¶

Specification:

Aspect	Details
Language	Chinese (Simplified & Traditional)
Scale	35 datasets, 6 task categories
Paper	Xiao et al. (2023) - "Packed Resources For General Chinese Embeddings"
Citations	1,171+ (as of 2024)
Repository	HuggingFace C-MTEB collection
Key Innovation	Russian Doll Representational Learning for multi-grained embeddings

Methodological Insights for Indonesia-MTEB: - Emphasis on domain diversity (news, medical, legal, e-commerce) - Separate evaluation for Simplified vs. Traditional variants - Comprehensive baseline evaluation (30+ models)

VN-MTEB (Vietnamese Massive Text Embedding Benchmark)¶

Specification:

Aspect	Details
Language	Vietnamese
Scale	~30 datasets, multi-task
Paper	Pham et al. (2025) - arXiv:2507.21500
Publication Date	July 2025
Key Focus	Toxicity detection, online content moderation
Repository	GreenNode/VN-MTEB collection on HuggingFace

Methodological Insights for Indonesia-MTEB: - Recent publication demonstrates viability of new language benchmarks - Domain-specific focus (toxicity) as novel contribution - Community-driven model collection approach

SEA-BED (Southeast Asia Embedding Benchmark)¶

Specification:

Aspect	Details
Languages	10 SEA languages (Indonesian, Thai, Vietnamese, etc.)
Scale	169 datasets, 9 tasks
Paper	Ponwitayarat et al. (2025) - arXiv:2508.12243
Publication Date	August 2025
Novelty	87% of datasets not in MMTEB
Human Annotations	71% human-formulated datasets

Methodological Insights for Indonesia-MTEB: - Demonstrates regional benchmark viability - High proportion of novel (non-MMTEB) datasets validates unique regional needs - Human annotation emphasis for quality control

5.2 Lessons Learned for Indonesia-MTEB¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                   METHODOLOGICAL BEST PRACTICES FROM REGIONAL MTEBS          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FROM C-MTEB (Chinese)                                                       │
│  ────────────────────────                                                    │
│  ✓ Domain diversity is critical for comprehensive evaluation               │
│  ✓ Multi-grained evaluation (character, word, sentence level)              │
│  ✓ Comprehensive baseline evaluation establishes performance floor         │
│                                                                              │
│  FROM VN-MTEB (Vietnamese)                                                   │
│  ────────────────────────────────                                            │
│  ✓ Domain-specific focus can be a novel contribution                        │
│  ✓ Community-driven model collection accelerates adoption                  │
│  ✓ HuggingFace integration maximizes accessibility                          │
│                                                                              │
│  FROM SEA-BED (Southeast Asia)                                               │
│  ────────────────────────────────────────                                    │
│  ✓ Regional datasets often differ from global MTEB - prioritize novelty    │
│  ✓ High human annotation ratio ensures quality                              │
│  ✓ Language-specific challenges (agglutinative morphology, etc.) warrant   │
│    specialized datasets                                                     │
│                                                                              │
│  INDONESIA-MTEB SYNTHESIS                                                    │
│  ────────────────────────                                                    │
│  ✓ Combine domain diversity with Indonesian-specific focus                 │
│  ✓ Emphasize morphological complexity in dataset design                    │
│  ✓ High human validation ratio (minimum 10% of translated data)            │
│  ✓ Full HuggingFace + MTEB ecosystem integration                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6. Project Scope & Deliverables¶

6.1 In-Scope Deliverables¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                      INDONESIA-MTEB DELIVERABLES                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 1: DATASET SUITE                                           ║│
│  ║  ──────────────────────────                                             ║│
│  ║  Specification:                                                          ║│
│  ║  • All 8 MTEB task categories covered                                   ║│
│  ║  • Minimum 50 datasets (target: 100+)                                   ║│
│  ║  • Train/validation/test splits for supervised tasks                    ║│
│  ║  • Metadata documentation (license, source, creation method)            ║│
│  ║                                                                          ║│
│  ║  Data Sources:                                                           ║│
│  ║  • Aggregation: ~20-30 existing Indonesian datasets                     ║│
│  ║  • Translation: ~40-60 translated MTEB datasets                         ║│
│  ║  • AI-Generated: ~10-20 novel Indonesian datasets                      ║│
│  ║                                                                          ║│
│  ║  Format:                                                                 ║│
│  ║  • HuggingFace datasets format                                          ║│
│  ║  • MTEB-compatible metadata                                             ║│
│  ║  • Comprehensive documentation cards                                    ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 2: EVALUATION FRAMEWORK                                    ║│
│  ║  ─────────────────────────────────                                      ║│
│  ║  Components:                                                             ║│
│  ║  • MTEB-compatible evaluation script                                    ║│
│  ║  • Indonesian-specific metric calculations                              ║│
│  ║  • Baseline model evaluations (10+ models)                              ║│
│  ║  • Leaderboard integration (HuggingFace Spaces)                         ║│
│  ║  • Reproducibility guarantees                                          ║│
│  ║                                                                          ║│
│  ║  Models for Baseline Evaluation:                                         ║│
│  ║  • Multilingual: E5, BGE, GTE, jina (current SOTA)                      ║│
│  ║  • Indonesian-specific: LazarusNLP models                               ║│
│  ║  • General: sentence-transformers baselines                             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 3: PYTHON PACKAGE                                          ║│
│  ║  ─────────────────────────────                                          ║│
│  ║  Package: indonesiamteb (PyPI)                                          ║│
│  ║  Features:                                                               ║│
│  ║  • pip install indonesiamteb                                            ║│
│  ║  • Easy dataset loading: load_benchmark(task_name)                     ║│
│  ║  • One-line evaluation: evaluate(model, benchmark)                     ║│
│  ║  • Leaderboard submission tools                                        ║│
│  ║  • Comprehensive documentation                                          ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  DELIVERABLE 4: RESEARCH PAPER                                          ║│
│  ║  ──────────────────────────                                             ║│
│  ║  Target Venue: ACL/EMNLP/NAACL dataset track                            ║│
│  ║  Sections:                                                               ║│
│  ║  • Abstract & Introduction                                              ║│
│  ║  • Background & Related Work (MTEB, Indonesian NLP, regional MTEBs)     ║│
│  ║  • Methodology (data acquisition, translation, generation)             ║│
│  ║  • Dataset descriptions (all datasets with statistics)                 ║│
│  ║  • Baseline evaluation results                                         ║│
│  ║  • Cross-lingual analysis (ID ↔ EN performance)                        ║│
│  ║  • Limitations & Ethics                                                 ║│
│  ║  • Conclusion & Future Work                                             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6.2 Out-of-Scope (Explicitly Excluded)¶

Excluded	Reason	Alternative Approach
Training new embedding models	Focus is benchmark, not models	Evaluate existing models; model training is separate project
Domain-specific evaluation	Keep benchmark general-purpose	Domain-specific datasets included but benchmark remains general
Indonesian local languages (Javanese, Sundanese, etc.)	Focus on Bahasa Indonesia first	Future expansion to regional languages
Real-time leaderboard hosting	Infrastructure scope	HuggingFace Spaces integration; no independent hosting
Commercial applications	Research focus	Open-source for community use

6.3 Success Criteria¶

Metric	Target	Measurement Method
Task Coverage	All 8 MTEB categories	Dataset inventory
Dataset Count	Minimum 50, target 100+	Final dataset count
Translation Quality	≥ 85% human acceptance rate	Human validation on 10% sample
Baseline Models	≥ 10 models evaluated	Evaluation results
Publication	ACL/EMNLP/NAACL dataset paper	Acceptance notification
MTEB Integration	Official integration into MTEB	Pull request acceptance
Package Usage	≥ 50 monthly downloads (6 months post-release)	PyPI statistics
Community Adoption	≥ 5 models use Indonesia-MTEB for evaluation	Leaderboard, GitHub citations

7. Research Questions¶

7.1 Primary Research Questions¶

RQ1: Gap Analysis & State of the Art

What is the current state of Indonesian embedding evaluation, and what specific gaps exist compared to MTEB standards?

Sub-questions: - RQ1.1: Which Indonesian NLP datasets exist and what is their MTEB compatibility? - RQ1.2: What task categories are currently underrepresented for Indonesian? - RQ1.3: How do existing Indonesian embedding models perform on MTEB-style evaluations?

RQ2: Translation Methodology

How can we effectively translate MTEB datasets to Indonesian while preserving semantic equivalence and task validity?

Sub-questions: - RQ2.1: Which translation model (TranslateGemma, NLLB, mT5) achieves optimal quality for Indonesian? - RQ2.2: What quality control mechanisms (human validation, LLM-as-judge) ensure semantic preservation? - RQ2.3: How does translation impact embedding model performance relative to original English datasets?

RQ3: Novel Dataset Generation

What novel Indonesian embedding tasks can be created via AI generation that fill unique gaps not addressed by translation or aggregation?

Sub-questions: - RQ3.1: Which task categories remain underserved after aggregation and translation? - RQ3.2: How can LLMs generate high-quality Indonesian datasets with statistical consistency? - RQ3.3: What Indonesian-specific linguistic phenomena should novel datasets target?

RQ4: Baseline Evaluation

How do existing embedding models (multilingual and Indonesian-specific) perform on a unified Indonesian benchmark across all 8 task categories?

Sub-questions: - RQ4.1: Which model architectures excel on which task types for Indonesian? - RQ4.2: How does Indonesian performance correlate with performance on other languages? - RQ4.3: What performance gaps exist between multilingual and Indonesian-specific models?

RQ5: Cross-Lingual Analysis

What does Indonesia-MTEB reveal about cross-lingual embedding capabilities and transfer learning to Indonesian?

Sub-questions: - RQ5.1: How do models trained on English/other languages transfer to Indonesian? - RQ5.2: What is the performance gap between monolingual Indonesian and multilingual models? - RQ5.3: Can Indonesia-MTEB inform embedding model design for other agglutinative languages?

8. Proposed Methodology¶

8.1 Phase Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB METHODOLOGY PHASES                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHASE 1: AGGREGATION              PHASE 2: TRANSLATION                     │
│  ──────────────────────            ────────────────────                     │
│  │                              │                                          │
│  │  • Dataset discovery         │  • MTEB dataset selection              │
│  │  • Format conversion         │  • Translation model benchmark        │
│  │  • Quality assessment        │  • Batch translation                  │
│  │  • MTEB compatibility check  │  • Quality control pipeline           │
│  │                              │  • Human validation (10% sample)      │
│  │                              │                                          │
│  └──────────────────────────────┘  └──────────────────────────────────────┘
│              │                              │                               │
│              ▼                              ▼                               │
│                                                                              │
│  ╔═══════════════════════════════════════════════════════════════════════╗│
│  ║                   PHASE 3: NOVEL DATASET GENERATION                    ║│
│  ║                   ─────────────────────────────────────                 ║│
│  ║                                                                        ║│
│  ║    • Gap identification (post-aggregation + translation)              ║│
│  ║    • LLM prompt engineering for dataset generation                     ║│
│  ║    • Domain-specific dataset creation (legal, medical, etc.)          ║│
│  ║    • Statistical consistency validation                               ║│
│  ║    • Human expert review                                               ║│
│  ║                                                                        ║│
│  ╚═══════════════════════════════════════════════════════════════════════╝│
│                              │                                              │
│                              ▼                                              │
│                                                                              │
│  ╔═══════════════════════════════════════════════════════════════════════╗│
│  ║                   PHASE 4: INTEGRATION & VALIDATION                    ║│
│  ║                   ──────────────────────────────────────                ║│
│  ║                                                                        ║│
│  ║    • Unified dataset format validation                                 ║│
│  ║    • Baseline model evaluation (10+ models)                            ║│
│  ║    • Statistical analysis of results                                   ║│
│  ║    • Cross-lingual comparison                                          ║│
│  ║    • Leaderboard deployment                                            ║│
│  ║    • Paper writing and submission                                      ║│
│  ║                                                                        ║│
│  ╚═══════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Phase 1: Aggregation - Detailed Methodology¶

Objective: Identify, convert, and validate existing Indonesian datasets for MTEB compatibility.

Step 1: Dataset Discovery

Source	Datasets of Interest	MTEB Category
IndoNLU	SMSA, EmoT, etc.	Classification
NusaX	Sentiment (10 languages)	Classification
IndoMMLU	Knowledge QA	Classification
MIRACL-ID	Wikipedia retrieval	Retrieval
SEACrowd	Various tasks	Multiple

Step 2: Format Conversion

Target Format: HuggingFace datasets with MTEB-specific schema
Required Fields: text, label, split (train/validation/test)
Metadata: license, source language, domain, creation date

Step 3: Quality Assessment

Check for data leakage between splits
Verify label distribution balance
Assess text quality (encoding issues, noise)

8.3 Phase 2: Translation - Detailed Methodology¶

Objective: Translate selected MTEB datasets to Indonesian with semantic preservation.

Step 1: Translation Model Selection

Model	Parameters	Languages	Strength	Weakness
TranslateGemma	4B / 12B / 27B	55	Latest, optimized	New (2026)
NLLB-200	3.3B	200	Proven quality	Older architecture
mT5	580M / 1.1B	101	Flexible	Requires fine-tuning
SeamlessM4T	2.3B	100	Multimodal	Overkill for text-only

Step 2: Translation Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TRANSLATION QUALITY CONTROL PIPELINE                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SOURCE TEXT (English MTEB dataset)                                          │
│          │                                                                   │
│          ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │  AUTOMATED TRANSLATION (TranslateGemma 12B)                  │           │
│  └─────────────────────────────────────────────────────────────┘           │
│          │                                                                   │
│          ▼                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │  LLM-AS-JUDGE VALIDATION (GPT-4 / Claude)                    │           │
│  │  Criteria:                                                  │           │
│  │  • Semantic equivalence (1-5 scale)                         │           │
│  │  • Grammatical correctness                                  │           │
│  │  • Cultural appropriateness                                 │           │
│  └─────────────────────────────────────────────────────────────┘           │
│          │                                                                   │
│          ├──────────────┬──────────────┐                                    │
│          ▼              ▼              ▼                                    │
│    ACCEPT          REJECT          FLAG                                    │
│        │              │              │                                     │
│        │              │              ▼                                     │
│        │              │    ┌───────────────────┐                          │
│        │              │    │ HUMAN REVIEW      │                          │
│        │              │    │ (10% sample)      │                          │
│        │              │    └───────────────────┘                          │
│        │              │              │                                     │
│        ▼              ▼              ▼                                     │
│  ┌───────────────────────────────────────────────────┐                    │
│  │  FINAL INDONESIAN DATASET                          │                    │
│  │  • Accepted translations                           │                    │
│  │  • Human-reviewed corrections                      │                    │
│  │  • Quality score metadata                          │                    │
│  └───────────────────────────────────────────────────┘                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 3: Quality Metrics

Semantic Preservation Score: LLM-as-judge rating (1-5)
Acceptance Threshold: ≥ 4.0/5.0
Human Validation Rate: 10% random sample + all flagged items
Target Human Acceptance Rate: ≥ 85%

8.4 Phase 3: AI Generation - Detailed Methodology¶

Objective: Create novel Indonesian datasets for underserved tasks.

Target Domains:

Domain	Rationale	Task Category
Legal	Complex morphology in legal texts	Classification, Retrieval
Healthcare	Technical terminology, code-switching	STS, Classification
Finance	Numeral expressions, named entities	Clustering, Pair Classification
Social Media	Informal language, slang	Sentiment, STS
News	Formal Indonesian, topic diversity	Clustering, Retrieval

Generation Methodology:

Gap Identification: Analyze coverage after Phases 1-2
Prompt Engineering: Design prompts for LLM dataset generation
Iterative Generation: Generate, validate, refine
Statistical Checks: Label distribution, text length, vocabulary diversity
Human Review: Domain expert validation

8.5 Phase 4: Integration & Validation - Detailed Methodology¶

Objective: Unify all datasets, evaluate baselines, and publish.

Step 1: Unified Format Validation

Schema validation across all datasets
Consistent metadata formatting
HuggingFace dataset card generation

Step 2: Baseline Evaluation

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BASELINE MODEL EVALUATION MATRIX                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                   ┌─────────────────────────────────────┐                   │
│                   │         MODELS TO EVALUATE           │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ Multilingual:                        │                   │
│                   │   • E5-large-v2                      │                   │
│                   │   • bge-m3 (multilingual)            │                   │
│                   │   • gte-large                        │                   │
│                   │   • jina-embeddings-v3               │                   │
│                   │                                      │                   │
│                   │ Indonesian-Specific:                 │                   │
│                   │   • LazarusNLP/indonesian-sbert...  │                   │
│                   │   • (others from HuggingFace)        │                   │
│                   │                                      │                   │
│                   │ Baselines:                           │                   │
│                   │   • sentence-transformers/LaBSE     │                   │
│                   │   • sentence-transformers/distiluse │                   │
│                   └─────────────────────────────────────┘                   │
│                                  │                                           │
│                                  ▼                                           │
│                   ┌─────────────────────────────────────┐                   │
│                   │         TASK CATEGORIES              │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ 1. Classification   │ 5. Retrieval  │                   │
│                   │ 2. Clustering       │ 6. STS        │                   │
│                   │ 3. Pair Class.      │ 7. Summariz.  │                   │
│                   │ 4. Reranking        │ 8. Instr. Fol. │                   │
│                   └─────────────────────────────────────┘                   │
│                                  │                                           │
│                                  ▼                                           │
│                   ┌─────────────────────────────────────┐                   │
│                   │         OUTPUT                       │                   │
│                   ├─────────────────────────────────────┤                   │
│                   │ • Per-task performance scores        │                   │
│                   │ • Aggregate benchmark score           │                   │
│                   │ • Cross-lingual comparisons          │                   │
│                   │ • Leaderboard rankings               │                   │
│                   └─────────────────────────────────────┘                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Step 3: Statistical Analysis

Mean performance across models per task
Performance variance analysis
Correlation between tasks (task similarity)
Cross-lingual performance correlation

9. Technical Architecture¶

9.1 System Architecture¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB TECHNICAL ARCHITECTURE                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                        DATA LAYER                                     │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ HuggingFace    │  │ Source Files   │  │ Generated Data │        │   │
│  │  │ Datasets Hub   │  │ (IndoNLU, etc) │  │ (AI-created)  │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                     PROCESSING LAYER                                  │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ Format         │  │ Translation    │  │ Quality        │        │   │
│  │  │ Converters     │  │ Pipeline       │  │ Validation     │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                      EVALUATION LAYER                                 │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ MTEB Core      │  │ Custom Metrics │  │ Statistical    │        │   │
│  │  │ Evaluator      │  │ (ID-specific)  │  │ Analysis       │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                       PRESENTATION LAYER                               │   │
│  │  ┌────────────────┐  ┌────────────────┐  ┌────────────────┐        │   │
│  │  │ HuggingFace    │  │ PyPI Package   │  │ Documentation  │        │   │
│  │  │ Spaces         │  │ CLI/API        │  │ Site           │        │   │
│  │  └────────────────┘  └────────────────┘  └────────────────┘        │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

9.2 Python Package Structure¶

indonesiamteb/
├── indonesiamteb/
│   ├── __init__.py
│   ├── data/
│   │   ├── __init__.py
│   │   ├── loading.py           # Dataset loading utilities
│   │   └── metadata.py          # Dataset metadata registry
│   ├── tasks/
│   │   ├── __init__.py
│   │   ├── classification.py    # Classification task wrappers
│   │   ├── clustering.py        # Clustering task wrappers
│   │   ├── retrieval.py         # Retrieval task wrappers
│   │   ├── sts.py               # STS task wrappers
│   │   └── ...                  # Other task types
│   ├── evaluation/
│   │   ├── __init__.py
│   │   ├── evaluator.py         # Main evaluation class
│   │   ├── metrics.py           # Custom metrics
│   │   └── leaderboard.py       # Leaderboard utilities
│   └── utils/
│       ├── __init__.py
│       ├── translation.py       # Translation utilities
│       └── validation.py        # Quality validation
├── benchmarks/
│   ├── classification/
│   ├── clustering/
│   └── ...                      # Dataset implementations
├── tests/
│   ├── test_data.py
│   ├── test_evaluation.py
│   └── test_tasks.py
├── examples/
│   ├── basic_usage.py
│   └── custom_evaluation.py
├── setup.py
├── pyproject.toml
└── README.md

10. Success Criteria¶

10.1 Quantitative Metrics¶

Metric	Minimum	Target	Stretch
Total Datasets	50	100	150+
Task Coverage	8/8 categories	8/8 categories	8/8 categories
Translation Acceptance Rate	85%	90%	95%
Baseline Models Evaluated	10	15	20+
MTEB Integration	Official submission	Accepted	Featured
PyPI Downloads (6 months)	50	500	1000+
Community Adoptions	3 models	10 models	20+ models
Paper Citations (1 year)	5	20	50+

10.2 Qualitative Milestones¶

All datasets pass quality validation
Baseline evaluation complete with documented results
Python package published on PyPI
HuggingFace Spaces leaderboard deployed
Research paper submitted to top-tier venue
Community engagement (GitHub stars, forks, discussions)
Integration with MTEB main repository

11. Timeline & Milestones¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INDONESIA-MTEB PROJECT TIMELINE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  MONTH 1-2: FOUNDATION                                                       │
│  ═══════════════════                                                         │
│  ✓ Literature review complete                                               │
│  ✓ Dataset inventory finalized                                              │
│  ✓ Translation model benchmark selected                                     │
│  ✓ Technical architecture designed                                          │
│                                                                              │
│  MONTH 3-4: DATA ACQUISITION                                                  │
│  ════════════════════════                                                    │
│  ✓ Phase 1: Aggregation complete (20-30 datasets)                           │
│  ✓ Phase 2: Translation pipeline operational                                │
│  ✓ Phase 3: AI generation begins                                            │
│                                                                              │
│  MONTH 5-6: DATASET COMPLETION                                               │
│  ═════════════════════════                                                   │
│  ✓ Translation complete (40-60 datasets)                                    │
│  ✓ AI-generated datasets complete (10-20 datasets)                          │
│  ✓ Quality validation complete                                              │
│                                                                              │
│  MONTH 7-8: EVALUATION                                                       │
│  ═════════════════                                                          │
│  ✓ Baseline model evaluations (10+ models)                                  │
│  ✓ Statistical analysis complete                                            │
│  ✓ Cross-lingual comparison complete                                        │
│                                                                              │
│  MONTH 9-10: PACKAGE & PAPER                                                │
│  ════════════════════════                                                    │
│  ✓ Python package development complete                                      │
│  ✓ HuggingFace integration complete                                         │
│  ✓ Research paper drafted                                                  │
│                                                                              │
│  MONTH 11-12: PUBLICATION                                                    │
│  ═════════════════════                                                       │
│  ✓ Paper submitted to target venue                                          │
│  ✓ PyPI package released                                                   │
│  ✓ Leaderboard deployed                                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

12. References¶

12.1 Primary Sources¶

Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17^th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316
Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595
Xiao, S., et al. (2023). "Packed Resources For General Chinese Embeddings: C-MTEB and C-MTP". Findings of the Association for Computational Linguistics (ACL 2023). arXiv:2309.07597

12.2 Regional Benchmarks¶

Ponwitayarat, W., et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark". arXiv. arXiv:2508.12243
Pham, L., et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark". arXiv. arXiv:2507.21500

12.3 Indonesian NLP Resources¶

Winata, G., et al. (2020). "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding". arXiv. arXiv:2009.05387
Winata, G., et al. (2022). "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages". arXiv. arXiv:2205.15960
Lovenia, H., et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark for Southeast Asian Languages". Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024).

12.4 Translation Models¶

Finkelstein, M., et al. (2026). "TranslateGemma Technical Report". arXiv. arXiv:2601.09012
NLLB Team (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv. arXiv:2207.04872

12.5 Evaluation Methodology¶

Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.
Humeun, L., et al. (2025). "HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks". arXiv. arXiv:2510.10062

13. Document Status¶

[!NOTE] Next Document: Document 02 - MTEB Structure Analysis

This document provides detailed analysis of MTEB's internal structure, dataset formats, evaluation protocols, and integration requirements for Indonesia-MTEB.

Change Log:

Version	Date	Changes	Author
1.0	2026-01-25	Initial version	Research Team
2.0	2026-01-25	Enhanced edition with expanded sections, latest research	Research Team

This document is a living record. Updated as research progresses.

Document Navigation¶

Indonesia-MTEB: A Comprehensive Text Embedding Benchmark for Indonesian¶

Table of Contents¶

1. Executive Summary¶

1.1 The Problem Statement¶

1.2 Research Objective¶

1.3 Key Contributions¶

2. The Indonesian Language Context¶

2.1 Demographic Significance¶

2.2 Linguistic Characteristics Affecting Embeddings¶

2.3 Current NLP Infrastructure in Indonesia¶

3. Background: The MTEB Framework¶

3.1 What is MTEB?¶

3.2 The 8 MTEB Task Categories¶

3.3 Key Evaluation Metrics Explained¶

3.4 MTEB Leaderboard & Submission Process¶

4. The Gap Analysis¶

4.1 Current Indonesian Embedding Landscape¶

4.2 Comparison with Regional Benchmarks¶

5. Regional MTEB Precedents¶

5.1 Successful Regional Benchmarks¶

C-MTEB (Chinese Massive Text Embedding Benchmark)¶

VN-MTEB (Vietnamese Massive Text Embedding Benchmark)¶

SEA-BED (Southeast Asia Embedding Benchmark)¶

5.2 Lessons Learned for Indonesia-MTEB¶

6. Project Scope & Deliverables¶

6.1 In-Scope Deliverables¶

6.2 Out-of-Scope (Explicitly Excluded)¶

6.3 Success Criteria¶

7. Research Questions¶

7.1 Primary Research Questions¶

8. Proposed Methodology¶

8.1 Phase Overview¶

8.2 Phase 1: Aggregation - Detailed Methodology¶

8.3 Phase 2: Translation - Detailed Methodology¶

8.4 Phase 3: AI Generation - Detailed Methodology¶

8.5 Phase 4: Integration & Validation - Detailed Methodology¶

9. Technical Architecture¶

9.1 System Architecture¶

9.2 Python Package Structure¶

10. Success Criteria¶

10.1 Quantitative Metrics¶

10.2 Qualitative Milestones¶

11. Timeline & Milestones¶

12. References¶

12.1 Primary Sources¶

12.2 Regional Benchmarks¶

12.3 Indonesian NLP Resources¶

12.4 Translation Models¶

12.5 Evaluation Methodology¶

13. Document Status¶