Project: Indonesia-MTEB Benchmark Document: 04 - Regional MTEB Methodologies Analysis (ENHANCED) Last Updated: 2026-01-25 Version: 2.0 - Enhanced with Latest Research (2024-2025)

Regional MTEB Methodologies: A Comprehensive Comparative Analysis¶

"Understanding how successful regional MTEBs were constructed provides the blueprint for Indonesia-MTEB. This document analyzes 10+ regional benchmarks with latest research findings from 2024-2025."

Executive Summary¶

Key Findings

10+ regional MTEBs analyzed with latest 2024-2025 research
C-MTEB leads with 1,171+ citations (SIGIR 2024)
VN-MTEB's 3-stage pipeline sets standard for translation quality
ArabicMTEB introduces dialect-aware and cross-cultural evaluation
Indonesia-MTEB can leverage multilingual approaches + SEA-BED integration

graph TD
    A[Regional MTEB Landscape 2024-2025] --> B[Asian Languages]
    A --> C[European Languages]
    A --> D[Middle Eastern/African]
    A --> E[Regional/SEA]

    B --> B1[C-MTEB: Chinese - 1,171+ citations]
    B --> B2[VN-MTEB: Vietnamese - 3-stage pipeline]
    B --> B3[TR-MTEB: Turkish - calibrated LLM judge]
    B --> B4[KorFinMTEB: Korean Financial]

    C --> C1[PL-MTEB: Polish - BEIR translation]
    C --> C2[MTEB-French: 22+ datasets]
    C --> C3[DE-MTEB: German clustering]

    D --> D1[ArabicMTEB: Dialect-aware - 94 datasets]
    D --> D2[AfriMTEB: 59 languages - contrastive distillation]

    E --> E1[SEA-BED: 10 SEA languages - 169 datasets]
    E --> E2[MMTEB: 1,090 languages - 500+ tasks]

    style B1 fill:#ff6b6b,color:#fff
    style D1 fill:#ffd93d,color:#333
    style E1 fill:#51cf66,color:#fff

Table of Contents¶

The Regional MTEB Landscape
C-MTEB (Chinese): Curated Aggregation
VN-MTEB (Vietnamese): Automated Translation Pipeline
TR-MTEB (Turkish): Hybrid Approach
ArabicMTEB: Dialect-Aware Evaluation
SEA-BED: Human-Centric Regional Benchmark
AfriMTEB: Cross-Lingual Contrastive Distillation
European MTEBs (PL, FR, DE)
Comparative Analysis Matrix
Best Practices Extraction
Recommended Methodology for Indonesia-MTEB
MTEB Integration Strategy

1. The Regional MTEB Landscape¶

1.1 Complete Benchmark Overview (2024-2025)¶

Benchmark	Language	Scale	Publication	Citations	Methodology
C-MTEB	Chinese	35 datasets, 6 tasks	SIGIR 2024	1,171+	Curated aggregation + 100M pairs
ArabicMTEB	Arabic	94 datasets, 8 tasks	NAACL 2025	8+	Dialect-aware + cultural evaluation
MTEB-French	French	30+ datasets, 8 tasks	arXiv 2024	17+	Aggregated French resources
VN-MTEB	Vietnamese	41 datasets, 6 tasks	arXiv 2025	New	Automated LLM translation pipeline
TR-MTEB	Turkish	26 datasets, 6 tasks	EMNLP 2025	2+	Translation + native corpus
PL-MTEB	Polish	29 datasets, 5 tasks	arXiv 2024	4+	BEIR translation
SEA-BED	10 SEA langs	169 datasets, 9 tasks	arXiv 2025	1+	Human-centric (71% native)
AfriMTEB	59 African langs	38 datasets	arXiv 2024	New	Cross-lingual distillation
KorFinMTEB	Korean	26 datasets, 7 tasks	arXiv 2025	4+	Domain-specific (financial)
DE-MTEB	German	Clustering focus	GitHub	-	Clustering specialization

1.2 Methodology Categories¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REGIONAL MTEB METHODOLOGY TAXONOMY                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CATEGORY 1: NATIVE-FIRST APPROACH                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • C-MTEB (Chinese): Prioritized native datasets, minimal translation   │  │
│  │ • SEA-BED (SEA): 71% human-formulated, native-focused                  │  │
│  │ • ArabicMTEB: Native Arabic + dialectal data                           │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 2: FULL TRANSLATION PIPELINE                                       │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • VN-MTEB: Complete MTEB translation with 3-stage QC                   │  │
│  │ • PL-MTEB: BEIR datasets translated + native aggregation              │  │
│  │ • MTEB-French: MTEB subsets + native French resources                 │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 3: HYBRID APPROACH                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • TR-MTEB: BEIR translation + native Turkish datasets                  │  │
│  │ • ArabicMTEB: Native + selective translation + synthetic              │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 4: SPECIALIZED APPROACH                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • KorFinMTEB: Domain-specific (financial) Korean                       │  │
│  │ • DE-MTEB: Task-specific (clustering) German                           │  │
│  │ • AfriMTEB: Cross-lingual distillation from 9 resource languages      │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2. C-MTEB (Chinese): Curated Aggregation¶

C-MTEB Impact

"C-Pack: Packaged Resources To Advance General Chinese Embeddings" (SIGIR 2024) - 1,171+ citations (highest among regional MTEBs) - Link: arxiv.org/abs/2309.07597 - HuggingFace: huggingface.co/C-MTEB

2.1 Methodology Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                        C-MTEB Methodology                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  DATA SOURCES                                                    │
│  ├─ Web corpora (public Chinese websites)                       │
│  ├─ Question-Answer forums (Zhihu, Baidu Knows)                 │
│  ├─ Encyclopedia content (Baidu Baike, Wikipedia)               │
│  └─ News articles from multiple sources                         │
│                                                                   │
│  CONSTRUCTION                                                    │
│  ├─ 100M+ sentence pairs (C-MTP corpus)                         │
│  ├─ Symmetric + asymmetric pair types                           │
│  └─ Multi-stage filtering                                      │
│                                                                   │
│  BENCHMARK COMPOSITION                                           │
│  ├─ 35 datasets                                                  │
│  ├─ 6 task types (no reranking, no instruction following)      │
│  └─ Native Chinese datasets only (no translation)              │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

2.2 C-MTP Training Corpus Details¶

Component	Size	Description
Total Pairs	100M+	Chinese sentence pairs
Symmetric Pairs	60%	Paraphrase, NLI, STS
Asymmetric Pairs	40%	Query-document, QA
Sources	Web, QA, Encyclopedia, News	Diverse domains
Filtering	Multi-stage	Quality control

2.3 Task Distribution¶

C-MTEB Task Composition (35 datasets):

Classification:      ████████████████████ 13 datasets (37%)
Retrieval:           ████████████████     11 datasets (31%)
Clustering:          ████                 4 datasets (11%)
Pair Classification: ███                  3 datasets (9%)
STS:                 ███                  3 datasets (9%)
Reranking:           █                    1 dataset (3%)

2.4 Lessons for Indonesia-MTEB¶

Insight	Application to Indonesia
Native-First Approach	Prioritize existing Indonesian datasets (IndoNLU, NusaX, etc.)
Large Training Corpus	Create ID-Pack with 50M+ Indonesian sentence pairs
Domain Diversity	Include Kaskus (forum), detik.com (news), Wikipedia ID (encyclopedia)
BGE Model Family	Consider fine-tuning BGE for Indonesian (BGE-ID)

2.5 Implementation: Loading C-MTEB¶

from mteb import MTEB

# C-MTEB evaluation example
evaluation = MTEB(tasks=["T2Retrieval", "Retrieval"])

# Run on Chinese model
results = evaluation.run(
    model="BAAI/bge-large-zh-v1.5",
    eval_splits=["test"],
    output_folder="results/c-mteb"
)

# Access specific C-MTEB datasets
from datasets import load_dataset

# Load C-MTEB classification dataset
c_mteb = load_dataset("C-MTEB/CLSClusteringS2S", "default")
print(c_mteb)

3. VN-MTEB (Vietnamese): Automated Translation Pipeline¶

VN-MTEB Innovation (2025)

"VN-MTEB: Vietnamese Massive Text Embedding Benchmark" (arXiv 2025) - First comprehensive automated translation pipeline - 3-stage quality control with LLM-as-judge - Link: arxiv.org/abs/2507.21500

3.1 Three-Stage Translation Pipeline¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    VN-MTEB TRANSLATION PIPELINE (DETAILED)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STAGE 1: LANGUAGE DETECTION                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  Method: LLM-based detection (Qwen2.5-3B-Instruct)                      │  │
│  │  Purpose: Filter source language samples from mixed content            │  │
│  │  Why not FastText: Interleaved languages cause detection errors       │  │
│  │                                                                          │  │
│  │  Accuracy: >99% on clean samples, ~95% on mixed content                │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                       ↓                                        │
│  STAGE 2: TRANSLATION                                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  Model: Aya-23-35B (Cohere For AI)                                    │  │
│  │  Selected via: SEA-HELM leaderboard (top performer for Vietnamese)    │  │
│  │  Temperature: 0.0 (deterministic for consistency)                      │  │
│  │  Max tokens: 4096                                                      │  │
│  │  Prompt Engineering: Optimized for EN-VI translation quality           │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                       ↓                                        │
│  STAGE 3: THREE-STEP VALIDATION                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  3a. LANGUAGE DETECTION                                                 │  │
│  │      └─ Verify output is Vietnamese (Qwen2.5-3B)                       │  │
│  │                                                                          │  │
│  │  3b. SEMANTIC SIMILARITY                                                │  │
│  │      ├─ Model: gte-Qwen2-7B-instruct                                   │  │
│  │      ├─ Metric: Cosine similarity                                     │  │
│  │      ├─ Threshold: 0.8                                                │  │
│  │      └─ Context length: 32,768 tokens                                  │  │
│  │                                                                          │  │
│  │  3c. LLM-AS-A-JUDGE                                                     │  │
│  │      ├─ Judge Model: Llama-SEA-LION-v3-70B-IT                         │  │
│  │      ├─ Evaluation Criteria (5 dimensions):                           │  │
│  │      │  ├─ Grammar and Syntax                                        │  │
│  │      │  ├─ Named Entity Recognition (NER) preservation               │  │
│  │      │  ├─ Numbers/Links/Special Characters preservation             │  │
│  │      │  ├─ Fluency and Naturalness                                    │  │
│  │      │  └─ Meaning Preservation                                      │  │
│  │      ├─ Scoring: 1-5 scale per criterion, weighted average            │  │
│  │      ├─ Technique: Chain-of-Thought prompting                         │  │
│  │      └─ Agreement: 85.2% with human judgments                         │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 LLM-as-a-Judge Scoring Formula¶

score_LLM_judge = (∑ α_i × score_i) / |S|

Where:
- S = {Grammar, NER, Numbers/Links, Fluency, Meaning}
- α_i = importance weight (∑ α_i = 1)
- score_i ∈ [1, 5] for each criterion
- ξ_threshold = 3.5/5.0 (samples below are discarded)

Weight Distribution (VN-MTEB):
α_grammar = 0.20
α_ner = 0.15
α_numbers = 0.15
α_fluency = 0.20
α_meaning = 0.30

3.3 Kept Ratios by Task Type¶

Task Category	Datasets	Kept Ratio	Interpretation
Clustering	5	71.98%	Highest retention, structural preservation
Classification	13	70.11%	Relatively preserved meaning
Pair Classification	3	67.2%	Entailment relationships mostly intact
Retrieval	15	66.03%	Moderate difficulty, domain-specific terms
Reranking	3	65.2%	Nuanced ranking criteria challenging
STS	3	53.4%	Lowest—semantic similarity hardest to preserve

Kept Ratio Visualization:

Clustering:       ████████████████████████ 71.98%
Classification:   ███████████████████████  70.11%
Pair Class:       ██████████████████████   67.2%
Retrieval:        █████████████████████    66.03%
Reranking:        ████████████████████     65.2%
STS:              ████████████████         53.4% ⚠️

3.4 Statistical Validation¶

VN-MTEB introduced word length distribution analysis as a novel validation:

# Word length distribution validation
import numpy as np
from scipy import stats

def compute_word_length_distribution(sentences):
    """Compute distribution of word lengths in sentences."""
    lengths = [len(word.split()[0]) for sent in sentences for word in sent.split()]
    return np.array(lengths)

# English and Vietnamese word lengths
en_lengths = compute_word_length_distribution(english_sentences)
vi_lengths = compute_word_length_distribution(vietnamese_sentences)

# Correlation analysis
correlation = np.corrcoef(
    np.histogram(en_lengths, bins=range(1, 20))[0],
    np.histogram(vi_lengths, bins=range(1, 20))[0]
)[0, 1]

print(f"Word length correlation: r = {correlation:.3f}")
# VN-MTEB achieved: r > 0.85

3.5 Compute Requirements¶

Resource	Specification	Notes
GPUs	4 × NVIDIA H100 (700W each)	High-end GPU cluster
Output Rate	3,800 tokens/second	Aya-23-35B inference
Total Time	~28 days (675.54 hours)	Full MTEB translation
Token Throughput	2× processing time	Input + output counted

3.6 Lessons for Indonesia-MTEB¶

Lesson	Application
Semantic Similarity Threshold	Use 0.8 threshold for filtering
Task-Specific Expectations	STS will have lowest kept ratio (~50-60%)
Language Detection	Use LLMs (not FastText) for multilingual detection
LLM-as-Judge	Chain-of-thought with 5 criteria achieves 85.2% human agreement
Resource Estimation	4 H100s × 20-25 days for EN-ID (faster than EN-VN)
Indonesia Advantage	EN-ID may have higher kept ratios (both Latin-script)

3.7 Implementation: VN-MTEB Pipeline Adaptation¶

# Adapted VN-MTEB pipeline for Indonesian
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class VNStyleTranslationPipeline:
    """Indonesian adaptation of VN-MTEB translation pipeline."""

    def __init__(self):
        # Stage 1: Language detection
        self.detector = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen2.5-3B-Instruct"
        )

        # Stage 2: Translation (use model good for Indonesian)
        self.translator = AutoModelForCausalLM.from_pretrained(
            "google/gemma-2-27b-it"  # TranslateGemma
        )

        # Stage 3b: Semantic similarity
        self.sim_model = AutoModel.from_pretrained(
            "Alibaba-NLP/gte-Qwen2-7B-instruct"
        )

        # Stage 3c: LLM judge
        self.judge = AutoModelForCausalLM.from_pretrained(
            "SEA-LION-LM/SEA-LION-v3-70B-IT"
        )

    def stage1_detect_language(self, texts):
        """Detect if texts are English."""
        # Implementation...
        pass

    def stage2_translate(self, texts):
        """Translate to Indonesian."""
        # Implementation...
        pass

    def stage3_validate(self, original, translated):
        """Three-step validation."""
        # 3a: Language detection
        # 3b: Semantic similarity (threshold: 0.8)
        # 3c: LLM-as-judge (threshold: 3.5/5.0)
        pass

4. TR-MTEB (Turkish): Hybrid Approach¶

TR-MTEB (EMNLP 2025)

"TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations" (EMNLP 2025 Findings) - 2+ citations - Link: aclanthology.org/2025.findings-emnlp.471

4.1 Benchmark Composition¶

Task	Datasets	Native	Translated	Source
Classification	8	7	1	News, sentiment, irony, offensive
Clustering	2	2	0	Academic abstracts, opinions
Pair Classification	3	0	3	MNLI-TR, SNLI-TR, XNLI-TR
Bitext Mining	1	1	0	WMT16 EN-TR
STS	1	0	1	STS-Benchmark-TR
Retrieval	11	2	9	SQuAD-TR, TQuAD, MS MARCO-TR

4.2 LLM-as-a-Judge Calibration¶

TR-MTEB implemented a calibrated LLM-as-a-Judge pipeline:

Calibration Process:
┌─────────────────────────────────────────────────────────────┐
│ 1. Human Annotation                                         │
│    └─ 115 examples manually labeled (PASS/FAIL)             │
│                                                              │
│ 2. Prompt Iteration                                         │
│    └─ Refined evaluation prompt to align with humans        │
│                                                              │
│ 3. Final Performance                                        │
│    ├─ Agreement: 85.2%                                      │
│    ├─ Precision: 92.9%                                      │
│    ├─ Recall: 84.4%                                         │
│    └─ F1 Score: 88.4%                                       │
└─────────────────────────────────────────────────────────────┘

Confusion Matrix (Human vs LLM):
                Actual PASS    Actual FAIL
Predicted PASS      98              9
Predicted FAIL      8              0

4.3 Training Corpus Construction¶

TR-MTEB created 34.2M Turkish sentence pairs:

Source Type	Examples	Notes
Question-Answer	Medical QA, Wiki QA, GSM8K-TR	Domain-specific
Title-Content	News headlines, Wikipedia	Asymmetric pairs
Paraphrase	TaPaCo-TR, multilingual NLI	Symmetric pairs
Synthetic	LLM-generated instruction data	Quality filtered

Filtering Pipeline:

Initial: 62.5M pairs
    ↓ Similarity filtering (custom model, fine-tuned e5-base)
    ↓ Threshold: 0.4 cosine similarity
Final: 34.2M high-quality pairs

4.4 Lessons for Indonesia-MTEB¶

Insight	Application
Hybrid Approach	Combine native Indonesian + translated datasets
Training Corpus	34.2M pairs sufficient for competitive models
Calibration	Always calibrate LLM-judge with 100+ human labels
Similarity Threshold	0.4 effective for training data filtering
Domain Coverage	Include medical, legal, news, conversational

5. ArabicMTEB: Dialect-Aware Evaluation¶

ArabicMTEB Innovation (NAACL 2025)

"Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks" (NAACL 2025) - 8+ citations - 94 datasets across multiple evaluation dimensions - Link: arxiv.org/abs/2411.01192

5.1 Multi-Dimensional Benchmark Structure¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                       ArabicMTEB MULTI-DIMENSIONAL STRUCTURE                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DIMENSION 1: MAIN ARABICMTEB (94 datasets)                                 │
│  ├─ Retrieval: 35 datasets                                                  │
│  ├─ Bitext Mining: 12 datasets                                              │
│  ├─ Cross-Lingual Retrieval: 11 language pairs                              │
│  ├─ Re-Ranking: 5 datasets                                                   │
│  ├─ STS: 5 datasets (2 synthetic via GPT-4)                                │
│  ├─ Classification: 18 datasets                                              │
│  ├─ Pair Classification: 3 datasets                                           │
│  └─ Clustering: 4 datasets                                                   │
│                                                                              │
│  DIMENSION 2: DIALECTAL FORK (19 datasets)                                   │
│  ├─ Bitext Mining: 8 dialect datasets                                       │
│  ├─ Retrieval: 5 dialect datasets                                            │
│  ├─ Classification: 5 dialect ID datasets                                    │
│  └─ STS: 1 Egyptian dialect synthetic dataset                               │
│                                                                              │
│  DIMENSION 3: DOMAIN-SPECIFIC FORK (ArabicMTEB Lite)                         │
│  ├─ 10k queries, 100k documents                                              │
│  ├─ Domains: News, Finance, Legal, Medical, Wikipedia                        │
│  └─ Generated via GPT-4o-mini from Wikipedia chunks                          │
│                                                                              │
│  DIMENSION 4: CULTURAL FORK (Country-level)                                   │
│  ├─ 20 Arab countries                                                         │
│  ├─ 1k queries, ~15k documents per country                                    │
│  ├─ Source: Country-specific Wikipedia portals                               │
│  └─ Generated via GPT-4o-mini for cultural queries                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

5.2 Novel Evaluation Dimensions¶

Dimension	Description	Indonesian Parallel
Dialectal	Gulf, Egyptian, Moroccan, Levantine varieties	Regional Indonesian (Javanese-influenced, Sundanese-influenced)
Cross-Lingual	11 language pairs	EN-ID, ID-JV, ID-SU
Domain-Specific	News, Finance, Legal, Medical	Same domains for Indonesia
Cultural	Country-specific cultural knowledge	Provincial cultural knowledge

5.3 Synthetic Data Generation¶

ArabicMTEB uses Command R+ for synthetic data:

Synthetic Data Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. MSA Data                                                 │
│    ├─ 100k general domain samples                          │
│    └─ 20k domain-specific samples                          │
│                                                              │
│ 2. Dialectal Data                                            │
│    ├─ 15k Egyptian dialect samples                          │
│    └─ 15k Moroccan dialect samples                          │
│                                                              │
│ 3. Domain Queries                                            │
│    └─ 5 query styles per document chunk                     │
│                                                              │
│ 4. Cultural Queries                                          │
│    └─ Country-specific Wikipedia passages                  │
└─────────────────────────────────────────────────────────────┘

Performance Impact:
Swan-Small:  32.46 → 48.42 (+16 points with MSA synthetic)
Swan-Large:  55.39 → 61.91 (+6.5 points with MSA synthetic)

5.4 Lessons for Indonesia-MTEB¶

ArabicMTEB Feature	Indonesia-MTEB Adaptation
Dialectal Fork	Regional language influence (Javanese, Sundanese, Minangkabau)
Domain-Specific Fork	Legal Indonesian (UU docs), Medical, Financial
Cultural Evaluation	Provincial cultural knowledge (34 provinces)
Synthetic Data	LLM-generated data for missing tasks

6. SEA-BED: Human-Centric Regional Benchmark¶

SEA-BED (2025)

"SEA-BED: Southeast Asia Embedding Benchmark" (arXiv 2025) - 169 datasets across 9 tasks - 10 SEA languages including Indonesian - 71% human-formulated datasets - Link: arxiv.org/abs/2508.12243

6.1 Key Characteristics¶

Aspect	SEA-BED Approach	Relevance to Indonesia
Scale	169 datasets, 9 tasks, 10 languages	Indonesian included
Human-Formulated	71% vs 29% translation/machine-generated	Quality-first approach
Tasks	Classification, Clustering, Pair Classification, Retrieval, Reranking, STS, Summarization, Instruction Following, Bitext Mining	Comprehensive task coverage
Indonesian Coverage	Included but not the focus	Can be expanded

6.2 Data Sources¶

┌─────────────────────────────────────────────────────────────────┐
│                    SEA-BED DATA SOURCES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Human-Formulated (71%)                                          │
│  ├─ Native datasets from each SEA country                       │
│  │  └─ Indonesian: IndoNLU, NusaX, IndoMMLU, etc.              │
│  ├─ Academic benchmarks                                         │
│  └─ Domain-specific corpora                                      │
│                                                                   │
│  Translation-Based (29%)                                         │
│  ├─ Carefully translated MTEB subsets                            │
│  └─ Quality validation integrated                                │
│                                                                   │
│  Validation Strategy                                             │
│  ├─ Native speaker review for key datasets                      │
│  ├─ Statistical consistency checks                              │
│  └─ Inter-annotator agreement tracking                          │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

6.3 Lessons for Indonesia-MTEB¶

Lesson	Application
Human-First Priority	71% human-formulated validates quality-over-quantity
Indonesia Opportunity	SEA-BED Indonesian datasets can be aggregated + expanded
Regional Integration	Consider Indonesia-MTEB compatibility with SEA evaluation
Task Coverage	Include Instruction Following (SEA-BED has this)

7. AfriMTEB: Cross-Lingual Contrastive Distillation¶

AfriMTEB (2024)

"AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages" (arXiv 2024) - 59 African languages - 38 datasets from MMTEB - Cross-lingual contrastive distillation - Link: arxiv.org/abs/2510.23896

7.1 Methodology¶

┌─────────────────────────────────────────────────────────────────┐
│                    AfriMTEB Approach                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Cross-Lingual Distillation                                      │
│  ├─ Train on 9 well-resourced languages                        │
│  ├─ Transfer to 59 languages via alignment                     │
│  └─ Use NLI/SNLI multilingual data                             │
│                                                                   │
│  Quality Estimation                                              │
│  ├─ SSA-COMET-MTL quality estimation model                     │
│  ├─ Threshold 0.75 retains ~60K from 430K samples              │
│  └─ Filter low-quality translations                           │
│                                                                   │
│  Benchmark Composition                                           │
│  ├─ 38 datasets from MMTEB                                     │
│  ├─ Focus on African language tasks                            │
│  └─ Cross-lingual retrieval emphasis                           │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

7.2 Lessons for Indonesia-MTEB¶

Insight	Application
Cross-Lingual Transfer	EN-ID alignment straightforward (both well-resourced)
Quality Estimation	COMET-style filtering effective
Resource Efficiency	Indonesian has more resources than African languages—full translation viable

8. European MTEBs (PL, FR, DE)¶

8.1 PL-MTEB (Polish)¶

PL-MTEB (2024)

"PL-MTEB: Polish Massive Text Embedding Benchmark" (arXiv 2024) - 4+ citations - 29 datasets, 5 task groups - BEIR translation + native aggregation

Metric	Value
Datasets	29 (28 tasks)
Task Groups	Classification, Clustering, Pair Classification, Retrieval, STS
Approach	BEIR translation + native Polish datasets

8.2 MTEB-French¶

MTEB-French (2024)

"MTEB-French: Resources for French Sentence Embedding" (arXiv 2024) - 17+ citations - 30+ datasets, 8 tasks - 22 existing + 3 new datasets created

Feature	Description
Model Evaluation	51 embedding models compared
Statistical Tests	Comprehensive statistical analysis
Correlation Study	Model-benchmark correlation analyzed

8.3 DE-MTEB (German Clustering)¶

German Text Clustering Benchmark

"German Text Embedding Clustering Benchmark" (arXiv 2024) - Specialized in clustering evaluation - Focus on different domains

8.4 European MTEB Lessons¶

Benchmark	Key Lesson	Indonesia Application
PL-MTEB	BEIR translation effective	Consider BEIR-ID translation
MTEB-French	Statistical analysis crucial	Include statistical validation
DE-MTEB	Task specialization valuable	Consider specialized forks

9. Comparative Analysis Matrix¶

9.1 Methodology Comparison¶

Benchmark	Translation Approach	Validation	Native Data	Training Corpus	Scale	Citations
C-MTEB	Minimal (native-first)	Peer-reviewed	Yes	100M+ pairs	35	1,171+
VN-MTEB	Full MTEB translation	3-stage LLM judge	No	N/A	41	New
TR-MTEB	BEIR + native	Calibrated LLM judge	Yes	34.2M pairs	26	2+
ArabicMTEB	Selective + synthetic	Multi-dimensional	Yes	122K + 135K	94	8+
SEA-BED	29%	Human review	71%	N/A	169	1+
AfriMTEB	Cross-lingual	COMET quality	Limited	Cross-lingual	38	New
MTEB-French	Aggregated	Statistical	Yes	N/A	30+	17+
PL-MTEB	BEIR translation	Standard	Yes	N/A	29	4+

9.2 Kept Ratio Comparison (Translation Quality)¶

Translation Kept Ratios by Task:

ArabicMTEB:    ████████████████████████ ~75% average
TR-MTEB:       ███████████████████████  ~70% average
VN-MTEB:       ██████████████████████   ~65% average
Estimated ID:  ████████████████████████ ~70-75% average

By Task (VN-MTEB data):
Classification: ████████████████████████ 70.11%
Clustering:      █████████████████████████ 71.98%
Pair Class:      ██████████████████████   67.2%
Retrieval:       █████████████████████    66.03%
Reranking:       ████████████████████     65.2%
STS:             ████████████████         53.4% ⚠️

9.3 Resource Comparison¶

Benchmark	GPUs	Time	Total Compute	Tokens/Sec
VN-MTEB	4×H100	28 days	~2,700 GPU-hours	3,800
TR-MTEB	1×A100	82 hours	~82 GPU-hours	N/A
Estimated ID	4×H100	20-25 days	~2,000-2,400 GPU-hours	4,000+

9.4 Task Coverage Comparison¶

MTEB Category Coverage by Benchmark:

Classification:       ████████████████████████████████ All 8
Pair Classification:  ████████████████████████████████ All 8
Retrieval:            ████████████████████████████████ All 8
Clustering:           ████████████████████████ 6/8
Reranking:            ████████████████████ 5/8
STS:                  ████████████████████ 5/8
Summarization:        ████ 2/8
Instruction Following: ████ 2/8
Bitext Mining:        ████████ 3/8

10. Best Practices Extraction¶

10.1 Translation Pipeline Best Practices¶

┌─────────────────────────────────────────────────────────────────────────────┐
│              RECOMMENDED TRANSLATION PIPELINE (Enhanced)                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. MODEL SELECTION                                                          │
│     ├─ Use regional leaderboard for selection (e.g., SEA-HELM for SEA)    │
│     ├─ Prefer models with strong target language performance               │
│     ├─ For Indonesian: TranslateGemma (27B), Aya-23 (35B/8B)             │
│     └─ Consider cost/quality tradeoff                                      │
│                                                                              │
│  2. QUALITY CONTROL (Multi-Stage)                                          │
│     ├─ Stage 1: Language detection (LLM, not FastText)                     │
│     │  └─ Model: Qwen2.5-3B-Instruct or similar                          │
│     ├─ Stage 2: Semantic similarity (threshold 0.75-0.80)                   │
│     │  └─ Model: gte-Qwen2-7B-instruct or similar                        │
│     ├─ Stage 3: LLM-as-judge (CoT prompting, 5 criteria)                 │
│     │  └─ Model: 70B+ parameter model with CoT capability                │
│     │  └─ Criteria: Grammar, NER, Numbers, Fluency, Meaning             │
│     └─ Stage 4: Human validation on 10% sample                             │
│                                                                              │
│  3. TASK-SPECIFIC EXPECTATIONS                                              │
│     ├─ Classification: ~70-75% kept ratio                                   │
│     ├─ Clustering: ~70-75% kept ratio                                       │
│     ├─ Pair Classification: ~65-70% kept ratio                              │
│     ├─ Retrieval: ~65-70% kept ratio                                        │
│     ├─ Reranking: ~65-70% kept ratio                                        │
│     └─ STS: ~50-60% kept ratio (plan for low retention)                     │
│                                                                              │
│  4. STATISTICAL VALIDATION                                                  │
│     ├─ Word length distribution analysis (target: r > 0.85)                 │
│     ├─ Kept ratio tracking by task and domain                               │
│     ├─ Semantic similarity distribution analysis                             │
│     └─ Inter-annotator agreement (target: >0.8)                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

10.2 Validation Best Practices Summary¶

Practice	Description	Source
Calibrate LLM Judge	100+ human-labeled examples for prompt tuning	TR-MTEB
Multi-Criteria Scoring	Grammar, NER, fluency, meaning preservation	VN-MTEB
Chain-of-Thought	CoT prompting for improved LLM judgment	VN-MTEB
Semantic Similarity Threshold	0.75-0.80 for filtering	VN-MTEB
Statistical Analysis	Word length distribution correlation	VN-MTEB
Domain-Specific Evaluation	Separate forks for different domains	ArabicMTEB
Cultural Awareness	Cultural knowledge evaluation	ArabicMTEB
Human Validation	10% sample human review	All benchmarks

10.3 Dataset Construction Best Practices¶

Practice	Description	Source
Hybrid Approach	Native datasets + high-quality translations	TR-MTEB, ArabicMTEB
Domain Diversity	News, finance, legal, medical, conversational	All
Pair Type Balance	Symmetric (paraphrase) + asymmetric (query-doc)	C-MTEB
Deduplication	Semantic deduplication (PolyDeDupe-style)	C-MTEB
Quality Filtering	Similarity threshold for training corpora	TR-MTEB
Synthetic Data	LLM-generated domain-specific data	ArabicMTEB
License Compliance	Track and respect dataset licenses	All

10.4 Novel Innovations by Benchmark¶

Benchmark	Innovation	Indonesia-MTEB Potential
VN-MTEB	3-stage LLM-based translation QC	Adapt for EN-ID translation
ArabicMTEB	Dialectal evaluation	Regional Indonesian varieties
ArabicMTEB	Cultural fork	Provincial cultural knowledge
TR-MTEB	Calibrated LLM judge	Apply to Indonesian validation
C-MTEB	Large training corpus	Create ID-Pack (50M+ pairs)
SEA-BED	Human-first ratio (71%)	Prioritize native Indonesian data
KorFinMTEB	Domain-specific (financial)	Create Indonesian financial fork
AfriMTEB	Cross-lingual distillation	EN-ID cross-lingual alignment

11. Recommended Methodology for Indonesia-MTEB¶

11.1 Three-Phase Approach¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIA-MTEB METHODOLOGY (ENHANCED)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHASE 1: AGGREGATION (Document 03 ✅ Complete)                             │
│  ├─ 70+ existing Indonesian datasets catalogued                             │
│  ├─ Coverage: Classification ✓, Pair Classification ✓, Retrieval ✓         │
│  ├─ Gaps: Clustering ✗, Reranking ✗, STS limited                          │
│  └─ Native dataset inventory: Complete                                     │
│                                                                              │
│  PHASE 2: TRANSLATION (Document 05)                                         │
│  ├─ Model: TranslateGemma-2-27B or Aya-23-35B                             │
│  ├─ Pipeline: 3-stage (Detection → Translation → QC)                       │
│  │  └─ Stage 1: Qwen2.5-3B-Instruct (language detection)                 │
│  │  └─ Stage 2: TranslateGemma/Aya-23 (translation)                      │
│  │  └─ Stage 3: LLM-as-judge + semantic similarity                        │
│  ├─ Target Datasets: Clustering, Reranking, STS gaps                      │
│  ├─ Expected Kept Ratio: 70-75% (higher than VN-MTEB)                     │
│  └─ Estimated Time: 4×H100 × 20-25 days                                    │
│                                                                              │
│  PHASE 3: AI GENERATION (Document 06)                                       │
│  ├─ Target: Domain-specific tasks, cultural queries                        │
│  ├─ Method: LLM-as-generator + LLM-as-judge                                │
│  ├─ Validation: Statistical consistency + human spot-check                 │
│  └─ Cultural Fork: Wikipedia Indonesia for 34 provinces                   │
│                                                                              │
│  ADDITIONAL DIMENSIONS (Novelty)                                            │
│  ├─ Archipelago-Aware: Regional language influence evaluation              │
│  ├─ Formal-Register: Informal (slang) → Formal → Academic continuum        │
│  ├─ Code-Mixing: Indonesian-English code-mixing evaluation                 │
│  └─ Domain-Specific: Legal (UU), Medical, Financial forks                  │
│                                                                              │
│  INTEGRATION                                                                 │
│  ├─ MTEB-compatible format                                                 │
│  ├─ Metadata documentation                                                  │
│  └─ HuggingFace upload with proper licensing                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

11.2 Resource Estimation¶

Task	GPUs	Time	GPU-Hours	Rationale
Full MTEB Translation	4×H100	20-25 days	2,000-2,400	EN-ID closer than EN-VN
AI Dataset Generation	2×H100	5-7 days	240-336	Clustering + Reranking
Validation	1×H100	3-5 days	72-120	LLM-as-judge evaluation
Total	-	-	~2,500-3,000	Conservative estimate

11.3 Translation Model Selection for Indonesian¶

Model	Parameters	ID Performance	Cost Efficiency	Recommendation
TranslateGemma-2-27B	27B	Excellent (55 langs)	Medium	Primary
Aya-23-35B	35B	Excellent (SEA focus)	Low	Alternative
Aya-23-8B	8B	Very good	High	Cost-efficient
NLLB-200	3.3B	Good	Very High	Smaller option
SEA-LION-v3	-	N/A	N/A	Judge model only

11.4 Quality Validation Thresholds¶

Metric	Threshold	Justification
Semantic Similarity	≥0.80	VN-MTEB used 0.8
LLM Judge Score	≥3.5/5.0	Calibrated threshold
Kept Ratio Target	65-75%	By task type
Word Length Correlation	r ≥ 0.85	Statistical quality check
Human Validation	10% sample	Final quality check

11.5 Novel Dimensions for Indonesia-MTEB¶

Based on regional MTEB gaps:

┌─────────────────────────────────────────────────────────────────┐
│          NOVEL DIMENSIONS FOR INDONESIA-MTEB                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  1. ARCHIPELAGO-AWARE EVALUATION                                   │
│     ├─ Javanese-influenced Indonesian                             │
│     ├─ Sundanese-influenced Indonesian                            │
│     ├─ Minangkabau-influenced Indonesian                         │
│     └─ Other regional varieties                                   │
│                                                                   │
│  2. FORMAL-REGISTER CONTINUUM                                     │
│     ├─ Informal/Slang (social media, Kaskus)                     │
│     ├─ Semi-formal (news articles)                               │
│     ├─ Formal (academic papers, legal documents)                 │
│     └─ Administrative (government regulations)                    │
│                                                                   │
│  3. CODE-MIXING EVALUATION                                        │
│     ├─ Indonesian-English code-mixing                            │
│     ├─ Prevalent in urban social media                           │
│     └─ Real-world use case evaluation                            │
│                                                                   │
│  4. CULTURAL KNOWLEDGE (34 Provinces)                             │
│     ├─ Province-specific cultural queries                        │
│     ├─ Source: Wikipedia Indonesia + provincial portals         │
│     └─ Generated via LLM with human validation                   │
│                                                                   │
│  5. DOMAIN-SPECIFIC FORKS                                         │
│     ├─ Legal Indonesian (UU documents, court decisions)          │
│     ├─ Medical Indonesian                                         │
│     ├─ Financial Indonesian                                       │
│     └─ Religious Indonesian (Islamic contexts)                    │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

12. MTEB Integration Strategy¶

12.1 Adding a Benchmark to MTEB Official¶

┌─────────────────────────────────────────────────────────────────────────────┐
│              MTEB OFFICIAL INTEGRATION PROCESS (Updated)                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. DATASET FORMAT REQUIREMENTS                                              │
│     ├─ Implement mteb.AbsTask subclass                                     │
│     ├─ Load data with .load_data() method                                  │
│     ├─ Define metadata (name, description, license, eval_langs)            │
│     ├─ Implement evaluation for your task type                             │
│     └─ Follow MTEB dataset card format                                     │
│                                                                              │
│  2. SUBMISSION CHECKLIST                                                     │
│     ├─ Fork: github.com/embeddings-benchmark/mteb                         │
│     ├─ Add: mteb/datasets/your_dataset/                                     │
│     ├─ Register: mteb/datasets/__init__.py                                 │
│     ├─ Test: CI/CD must pass                                               │
│     ├─ PR: Create with detailed description                                │
│     └─ Address: Reviewer feedback                                          │
│                                                                              │
│  3. HUGGINGFACE UPLOAD                                                        │
│     ├─ Upload to: huggingface.co/datasets/                                 │
│     ├─ Use MTEB dataset card format                                         │
│     ├─ Include: License, size, task metadata                               │
│     └─ Link: Original sources                                               │
│                                                                              │
│  4. LEADERBOARD SUBMISSION                                                    │
│     ├─ Run: Evaluation on baseline models                                   │
│     ├─ Submit: mteb/results dataset                                         │
│     ├─ Create: Benchmark discussion on leaderboard                         │
│     └─ Request: Leaderboard integration                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

12.2 Implementation Example¶

# Indonesia-MTEB dataset implementation example
from mteb import AbsTask, TaskMetadata

class IndonesianSentiment(AbsTaskClassification):
    """Indonesian sentiment analysis task for MTEB."""

    metadata = TaskMetadata(
        name="IndonesianSentiment",
        description="Indonesian sentiment analysis from social media",
        dataset={
            "path": "indonlp/indonlu",
            "name": "smsa",
            "revision": "main"
        },
        type="Classification",
        category="s2s",
        eval_splits=["test"],
        eval_langs=["ind"],  # Indonesian language code
        main_score="accuracy",
        date=None,
        form=None,
        domains=["Social", "Written"],
        task_subtypes=["Sentiment"],
        license="CC-BY-SA-4.0",
        annotations_creators="human-verified",
        dialect=[],
        sample_creation="found",
        bibtex_citation="""
        @article{wilie2020indonlu,
          title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
          author={Wilie, Bryan and Vincentio, Bryan and et al.},
          journal={arXiv preprint arXiv:2009.05387},
          year={2020}
        }
        """
    )

    def load_data(self, **kwargs):
        """Load Indonesian sentiment data."""
        from datasets import load_dataset
        return load_dataset("indonlp/indonlu", "smsa")

12.3 MTEB Integration Links¶

Resource	URL
GitHub Repository	github.com/embeddings-benchmark/mteb
Leaderboard	huggingface.co/spaces/mteb/leaderboard
Results Dataset	huggingface.co/datasets/mteb/results
Adding Datasets	github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md
Adding Models	github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md

13. Key Takeaways for Indonesia-MTEB¶

13.1 Methodology Recommendations¶

Priority	Recommendation	Rationale
1	Adopt VN-MTEB's 3-stage translation pipeline	Proven automated QC
2	Use TranslateGemma or Aya-23 for translation	Strong ID support
3	Calibrate LLM judge with 100+ human samples	TR-MTEB: 88.4% F1
4	Create ID-specific training corpus (ID-Pack)	C-MTEB approach
5	Add domain-specific + cultural forks	ArabicMTEB innovation
6	Target 70-75% kept ratio	Higher than VN-MTEB

13.2 Novelty Opportunities¶

Based on regional MTEB analysis:

Archipelago-Aware Evaluation: Regional language influence on Indonesian
Formal-Register Continuum: Informal → Formal → Academic Indonesian
Code-Mixing Evaluation: Indonesian-English code-mixing (social media)
Cultural Knowledge: 34 provincial cultural queries
Domain-Specific Forks: Legal, Medical, Financial Indonesian

13.3 Success Criteria Alignment¶

Criterion	Target	Benchmark Reference
Task Coverage	All 8 MTEB categories	VN-MTEB: 6, ArabicMTEB: 8
Dataset Count	60-100 datasets	ArabicMTEB: 94, SEA-BED: 169
Quality	≥70% kept ratio, 10% human validation	VN-MTEB: 65% avg
Publication	ACL/EMNLP/NAACL dataset paper	C-MTEB: SIGIR, TR-MTEB: EMNLP
Adoption	MTEB leaderboard integration	All regional MTEBs

14. References¶

Regional MTEB Papers (2024-2025)¶

C-MTEB: Xiao et al. (2024). "C-Pack: Packaged Resources To Advance General Chinese Embeddings." SIGIR 2024. arxiv.org/abs/2309.07597 - 1,171+ citations
ArabicMTEB: Bhatia et al. (2025). "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks." NAACL 2025. arxiv.org/abs/2411.01192 - 8+ citations
MTEB-French: Ciancone et al. (2024). "MTEB-French: Resources for French Sentence Embedding." arXiv:2405.20468. arxiv.org/abs/2405.20468 - 17+ citations
VN-MTEB: Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500. arxiv.org/abs/2507.21500
TR-MTEB: Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations." EMNLP 2025 Findings. aclanthology.org/2025.findings-emnlp.471 - 2+ citations
SEA-BED: Ponwitayarat et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark." arXiv:2508.12243. arxiv.org/abs/2508.12243 - 1+ citation
AfriMTEB: Uemura et al. (2024). "AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages." arXiv:2510.23896. arxiv.org/abs/2510.23896
PL-MTEB: Poświata et al. (2024). "PL-MTEB: Polish Massive Text Embedding Benchmark." arXiv:2405.10138. arxiv.org/abs/2405.10138 - 4+ citations
KorFinMTEB: Hwang et al. (2025). "What Advantages Can Low-Resource Domain-Specific Instruction Tuning Bring to Large Language Models? A Case Study on Korean Financial Texts." arXiv:2502.07131. arxiv.org/abs/2502.07131 - 4+ citations

Original MTEB¶

Muennighoff et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023. arxiv.org/abs/2210.07316 - 1,488+ citations

Translation Models¶

Google (2024). "TranslateGemma: A new suite of open translation models." blog.google/technology/ai/translategemma/
Cohere For AI (2024). "Aya 23: Open weight releases to further multilingual progress." arXiv:2405.15032. arxiv.org/abs/2405.15032

MTEB Resources¶

MTEB GitHub: github.com/embeddings-benchmark/mteb
MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
MTEB Datasets: huggingface.co/mteb

15. Document Roadmap¶

Document	Content	Status
01	Project Overview	✅ Enhanced
02	MTEB Structure Analysis	✅ Enhanced
03	Existing Indonesian Datasets	✅ Enhanced
04	Regional MTEB Methodologies	✅ Enhanced
05	Translation Models Benchmark	🔲 Next
06	AI Dataset Generation Methods	Pending
07	Validation Strategies	Pending
08	ACL Dataset Paper Standards	Pending
09	Novelty Angle & Publication	Pending
10	Implementation Roadmap	Pending

"The most successful regional MTEBs combine three elements: rigorous quality control, linguistic/cultural awareness, and comprehensive task coverage. Indonesia-MTEB will synthesize these approaches while introducing archipelago-aware and formal-register evaluation dimensions unique to the Indonesian context."