Project: Indonesia-MTEB Benchmark Document: 04 - Regional MTEB Methodologies Analysis (ENHANCED) Last Updated: 2026-01-25 Version: 2.0 - Enhanced with Latest Research (2024-2025)
Regional MTEB Methodologies: A Comprehensive Comparative Analysis¶
"Understanding how successful regional MTEBs were constructed provides the blueprint for Indonesia-MTEB. This document analyzes 10+ regional benchmarks with latest research findings from 2024-2025."
Executive Summary¶
Key Findings
- 10+ regional MTEBs analyzed with latest 2024-2025 research
- C-MTEB leads with 1,171+ citations (SIGIR 2024)
- VN-MTEB's 3-stage pipeline sets standard for translation quality
- ArabicMTEB introduces dialect-aware and cross-cultural evaluation
- Indonesia-MTEB can leverage multilingual approaches + SEA-BED integration
graph TD
A[Regional MTEB Landscape 2024-2025] --> B[Asian Languages]
A --> C[European Languages]
A --> D[Middle Eastern/African]
A --> E[Regional/SEA]
B --> B1[C-MTEB: Chinese - 1,171+ citations]
B --> B2[VN-MTEB: Vietnamese - 3-stage pipeline]
B --> B3[TR-MTEB: Turkish - calibrated LLM judge]
B --> B4[KorFinMTEB: Korean Financial]
C --> C1[PL-MTEB: Polish - BEIR translation]
C --> C2[MTEB-French: 22+ datasets]
C --> C3[DE-MTEB: German clustering]
D --> D1[ArabicMTEB: Dialect-aware - 94 datasets]
D --> D2[AfriMTEB: 59 languages - contrastive distillation]
E --> E1[SEA-BED: 10 SEA languages - 169 datasets]
E --> E2[MMTEB: 1,090 languages - 500+ tasks]
style B1 fill:#ff6b6b,color:#fff
style D1 fill:#ffd93d,color:#333
style E1 fill:#51cf66,color:#fff
Table of Contents¶
- The Regional MTEB Landscape
- C-MTEB (Chinese): Curated Aggregation
- VN-MTEB (Vietnamese): Automated Translation Pipeline
- TR-MTEB (Turkish): Hybrid Approach
- ArabicMTEB: Dialect-Aware Evaluation
- SEA-BED: Human-Centric Regional Benchmark
- AfriMTEB: Cross-Lingual Contrastive Distillation
- European MTEBs (PL, FR, DE)
- Comparative Analysis Matrix
- Best Practices Extraction
- Recommended Methodology for Indonesia-MTEB
- MTEB Integration Strategy
1. The Regional MTEB Landscape¶
1.1 Complete Benchmark Overview (2024-2025)¶
| Benchmark | Language | Scale | Publication | Citations | Methodology |
|---|---|---|---|---|---|
| C-MTEB | Chinese | 35 datasets, 6 tasks | SIGIR 2024 | 1,171+ | Curated aggregation + 100M pairs |
| ArabicMTEB | Arabic | 94 datasets, 8 tasks | NAACL 2025 | 8+ | Dialect-aware + cultural evaluation |
| MTEB-French | French | 30+ datasets, 8 tasks | arXiv 2024 | 17+ | Aggregated French resources |
| VN-MTEB | Vietnamese | 41 datasets, 6 tasks | arXiv 2025 | New | Automated LLM translation pipeline |
| TR-MTEB | Turkish | 26 datasets, 6 tasks | EMNLP 2025 | 2+ | Translation + native corpus |
| PL-MTEB | Polish | 29 datasets, 5 tasks | arXiv 2024 | 4+ | BEIR translation |
| SEA-BED | 10 SEA langs | 169 datasets, 9 tasks | arXiv 2025 | 1+ | Human-centric (71% native) |
| AfriMTEB | 59 African langs | 38 datasets | arXiv 2024 | New | Cross-lingual distillation |
| KorFinMTEB | Korean | 26 datasets, 7 tasks | arXiv 2025 | 4+ | Domain-specific (financial) |
| DE-MTEB | German | Clustering focus | GitHub | - | Clustering specialization |
1.2 Methodology Categories¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ REGIONAL MTEB METHODOLOGY TAXONOMY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CATEGORY 1: NATIVE-FIRST APPROACH │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • C-MTEB (Chinese): Prioritized native datasets, minimal translation │ │
│ │ • SEA-BED (SEA): 71% human-formulated, native-focused │ │
│ │ • ArabicMTEB: Native Arabic + dialectal data │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ CATEGORY 2: FULL TRANSLATION PIPELINE │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • VN-MTEB: Complete MTEB translation with 3-stage QC │ │
│ │ • PL-MTEB: BEIR datasets translated + native aggregation │ │
│ │ • MTEB-French: MTEB subsets + native French resources │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ CATEGORY 3: HYBRID APPROACH │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • TR-MTEB: BEIR translation + native Turkish datasets │ │
│ │ • ArabicMTEB: Native + selective translation + synthetic │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ CATEGORY 4: SPECIALIZED APPROACH │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • KorFinMTEB: Domain-specific (financial) Korean │ │
│ │ • DE-MTEB: Task-specific (clustering) German │ │
│ │ • AfriMTEB: Cross-lingual distillation from 9 resource languages │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2. C-MTEB (Chinese): Curated Aggregation¶
C-MTEB Impact
"C-Pack: Packaged Resources To Advance General Chinese Embeddings" (SIGIR 2024) - 1,171+ citations (highest among regional MTEBs) - Link: arxiv.org/abs/2309.07597 - HuggingFace: huggingface.co/C-MTEB
2.1 Methodology Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ C-MTEB Methodology │
├─────────────────────────────────────────────────────────────────┤
│ │
│ DATA SOURCES │
│ ├─ Web corpora (public Chinese websites) │
│ ├─ Question-Answer forums (Zhihu, Baidu Knows) │
│ ├─ Encyclopedia content (Baidu Baike, Wikipedia) │
│ └─ News articles from multiple sources │
│ │
│ CONSTRUCTION │
│ ├─ 100M+ sentence pairs (C-MTP corpus) │
│ ├─ Symmetric + asymmetric pair types │
│ └─ Multi-stage filtering │
│ │
│ BENCHMARK COMPOSITION │
│ ├─ 35 datasets │
│ ├─ 6 task types (no reranking, no instruction following) │
│ └─ Native Chinese datasets only (no translation) │
│ │
└─────────────────────────────────────────────────────────────────┘
2.2 C-MTP Training Corpus Details¶
| Component | Size | Description |
|---|---|---|
| Total Pairs | 100M+ | Chinese sentence pairs |
| Symmetric Pairs | 60% | Paraphrase, NLI, STS |
| Asymmetric Pairs | 40% | Query-document, QA |
| Sources | Web, QA, Encyclopedia, News | Diverse domains |
| Filtering | Multi-stage | Quality control |
2.3 Task Distribution¶
C-MTEB Task Composition (35 datasets):
Classification: ████████████████████ 13 datasets (37%)
Retrieval: ████████████████ 11 datasets (31%)
Clustering: ████ 4 datasets (11%)
Pair Classification: ███ 3 datasets (9%)
STS: ███ 3 datasets (9%)
Reranking: █ 1 dataset (3%)
2.4 Lessons for Indonesia-MTEB¶
| Insight | Application to Indonesia |
|---|---|
| Native-First Approach | Prioritize existing Indonesian datasets (IndoNLU, NusaX, etc.) |
| Large Training Corpus | Create ID-Pack with 50M+ Indonesian sentence pairs |
| Domain Diversity | Include Kaskus (forum), detik.com (news), Wikipedia ID (encyclopedia) |
| BGE Model Family | Consider fine-tuning BGE for Indonesian (BGE-ID) |
2.5 Implementation: Loading C-MTEB¶
from mteb import MTEB
# C-MTEB evaluation example
evaluation = MTEB(tasks=["T2Retrieval", "Retrieval"])
# Run on Chinese model
results = evaluation.run(
model="BAAI/bge-large-zh-v1.5",
eval_splits=["test"],
output_folder="results/c-mteb"
)
# Access specific C-MTEB datasets
from datasets import load_dataset
# Load C-MTEB classification dataset
c_mteb = load_dataset("C-MTEB/CLSClusteringS2S", "default")
print(c_mteb)
3. VN-MTEB (Vietnamese): Automated Translation Pipeline¶
VN-MTEB Innovation (2025)
"VN-MTEB: Vietnamese Massive Text Embedding Benchmark" (arXiv 2025) - First comprehensive automated translation pipeline - 3-stage quality control with LLM-as-judge - Link: arxiv.org/abs/2507.21500
3.1 Three-Stage Translation Pipeline¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ VN-MTEB TRANSLATION PIPELINE (DETAILED) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: LANGUAGE DETECTION │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Method: LLM-based detection (Qwen2.5-3B-Instruct) │ │
│ │ Purpose: Filter source language samples from mixed content │ │
│ │ Why not FastText: Interleaved languages cause detection errors │ │
│ │ │ │
│ │ Accuracy: >99% on clean samples, ~95% on mixed content │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 2: TRANSLATION │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Model: Aya-23-35B (Cohere For AI) │ │
│ │ Selected via: SEA-HELM leaderboard (top performer for Vietnamese) │ │
│ │ Temperature: 0.0 (deterministic for consistency) │ │
│ │ Max tokens: 4096 │ │
│ │ Prompt Engineering: Optimized for EN-VI translation quality │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 3: THREE-STEP VALIDATION │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ 3a. LANGUAGE DETECTION │ │
│ │ └─ Verify output is Vietnamese (Qwen2.5-3B) │ │
│ │ │ │
│ │ 3b. SEMANTIC SIMILARITY │ │
│ │ ├─ Model: gte-Qwen2-7B-instruct │ │
│ │ ├─ Metric: Cosine similarity │ │
│ │ ├─ Threshold: 0.8 │ │
│ │ └─ Context length: 32,768 tokens │ │
│ │ │ │
│ │ 3c. LLM-AS-A-JUDGE │ │
│ │ ├─ Judge Model: Llama-SEA-LION-v3-70B-IT │ │
│ │ ├─ Evaluation Criteria (5 dimensions): │ │
│ │ │ ├─ Grammar and Syntax │ │
│ │ │ ├─ Named Entity Recognition (NER) preservation │ │
│ │ │ ├─ Numbers/Links/Special Characters preservation │ │
│ │ │ ├─ Fluency and Naturalness │ │
│ │ │ └─ Meaning Preservation │ │
│ │ ├─ Scoring: 1-5 scale per criterion, weighted average │ │
│ │ ├─ Technique: Chain-of-Thought prompting │ │
│ │ └─ Agreement: 85.2% with human judgments │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.2 LLM-as-a-Judge Scoring Formula¶
score_LLM_judge = (∑ α_i × score_i) / |S|
Where:
- S = {Grammar, NER, Numbers/Links, Fluency, Meaning}
- α_i = importance weight (∑ α_i = 1)
- score_i ∈ [1, 5] for each criterion
- ξ_threshold = 3.5/5.0 (samples below are discarded)
Weight Distribution (VN-MTEB):
α_grammar = 0.20
α_ner = 0.15
α_numbers = 0.15
α_fluency = 0.20
α_meaning = 0.30
3.3 Kept Ratios by Task Type¶
| Task Category | Datasets | Kept Ratio | Interpretation |
|---|---|---|---|
| Clustering | 5 | 71.98% | Highest retention, structural preservation |
| Classification | 13 | 70.11% | Relatively preserved meaning |
| Pair Classification | 3 | 67.2% | Entailment relationships mostly intact |
| Retrieval | 15 | 66.03% | Moderate difficulty, domain-specific terms |
| Reranking | 3 | 65.2% | Nuanced ranking criteria challenging |
| STS | 3 | 53.4% | Lowest—semantic similarity hardest to preserve |
Kept Ratio Visualization:
Clustering: ████████████████████████ 71.98%
Classification: ███████████████████████ 70.11%
Pair Class: ██████████████████████ 67.2%
Retrieval: █████████████████████ 66.03%
Reranking: ████████████████████ 65.2%
STS: ████████████████ 53.4% ⚠️
3.4 Statistical Validation¶
VN-MTEB introduced word length distribution analysis as a novel validation:
# Word length distribution validation
import numpy as np
from scipy import stats
def compute_word_length_distribution(sentences):
"""Compute distribution of word lengths in sentences."""
lengths = [len(word.split()[0]) for sent in sentences for word in sent.split()]
return np.array(lengths)
# English and Vietnamese word lengths
en_lengths = compute_word_length_distribution(english_sentences)
vi_lengths = compute_word_length_distribution(vietnamese_sentences)
# Correlation analysis
correlation = np.corrcoef(
np.histogram(en_lengths, bins=range(1, 20))[0],
np.histogram(vi_lengths, bins=range(1, 20))[0]
)[0, 1]
print(f"Word length correlation: r = {correlation:.3f}")
# VN-MTEB achieved: r > 0.85
3.5 Compute Requirements¶
| Resource | Specification | Notes |
|---|---|---|
| GPUs | 4 × NVIDIA H100 (700W each) | High-end GPU cluster |
| Output Rate | 3,800 tokens/second | Aya-23-35B inference |
| Total Time | ~28 days (675.54 hours) | Full MTEB translation |
| Token Throughput | 2× processing time | Input + output counted |
3.6 Lessons for Indonesia-MTEB¶
| Lesson | Application |
|---|---|
| Semantic Similarity Threshold | Use 0.8 threshold for filtering |
| Task-Specific Expectations | STS will have lowest kept ratio (~50-60%) |
| Language Detection | Use LLMs (not FastText) for multilingual detection |
| LLM-as-Judge | Chain-of-thought with 5 criteria achieves 85.2% human agreement |
| Resource Estimation | 4 H100s × 20-25 days for EN-ID (faster than EN-VN) |
| Indonesia Advantage | EN-ID may have higher kept ratios (both Latin-script) |
3.7 Implementation: VN-MTEB Pipeline Adaptation¶
# Adapted VN-MTEB pipeline for Indonesian
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class VNStyleTranslationPipeline:
"""Indonesian adaptation of VN-MTEB translation pipeline."""
def __init__(self):
# Stage 1: Language detection
self.detector = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct"
)
# Stage 2: Translation (use model good for Indonesian)
self.translator = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-27b-it" # TranslateGemma
)
# Stage 3b: Semantic similarity
self.sim_model = AutoModel.from_pretrained(
"Alibaba-NLP/gte-Qwen2-7B-instruct"
)
# Stage 3c: LLM judge
self.judge = AutoModelForCausalLM.from_pretrained(
"SEA-LION-LM/SEA-LION-v3-70B-IT"
)
def stage1_detect_language(self, texts):
"""Detect if texts are English."""
# Implementation...
pass
def stage2_translate(self, texts):
"""Translate to Indonesian."""
# Implementation...
pass
def stage3_validate(self, original, translated):
"""Three-step validation."""
# 3a: Language detection
# 3b: Semantic similarity (threshold: 0.8)
# 3c: LLM-as-judge (threshold: 3.5/5.0)
pass
4. TR-MTEB (Turkish): Hybrid Approach¶
TR-MTEB (EMNLP 2025)
"TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations" (EMNLP 2025 Findings) - 2+ citations - Link: aclanthology.org/2025.findings-emnlp.471
4.1 Benchmark Composition¶
| Task | Datasets | Native | Translated | Source |
|---|---|---|---|---|
| Classification | 8 | 7 | 1 | News, sentiment, irony, offensive |
| Clustering | 2 | 2 | 0 | Academic abstracts, opinions |
| Pair Classification | 3 | 0 | 3 | MNLI-TR, SNLI-TR, XNLI-TR |
| Bitext Mining | 1 | 1 | 0 | WMT16 EN-TR |
| STS | 1 | 0 | 1 | STS-Benchmark-TR |
| Retrieval | 11 | 2 | 9 | SQuAD-TR, TQuAD, MS MARCO-TR |
4.2 LLM-as-a-Judge Calibration¶
TR-MTEB implemented a calibrated LLM-as-a-Judge pipeline:
Calibration Process:
┌─────────────────────────────────────────────────────────────┐
│ 1. Human Annotation │
│ └─ 115 examples manually labeled (PASS/FAIL) │
│ │
│ 2. Prompt Iteration │
│ └─ Refined evaluation prompt to align with humans │
│ │
│ 3. Final Performance │
│ ├─ Agreement: 85.2% │
│ ├─ Precision: 92.9% │
│ ├─ Recall: 84.4% │
│ └─ F1 Score: 88.4% │
└─────────────────────────────────────────────────────────────┘
Confusion Matrix (Human vs LLM):
Actual PASS Actual FAIL
Predicted PASS 98 9
Predicted FAIL 8 0
4.3 Training Corpus Construction¶
TR-MTEB created 34.2M Turkish sentence pairs:
| Source Type | Examples | Notes |
|---|---|---|
| Question-Answer | Medical QA, Wiki QA, GSM8K-TR | Domain-specific |
| Title-Content | News headlines, Wikipedia | Asymmetric pairs |
| Paraphrase | TaPaCo-TR, multilingual NLI | Symmetric pairs |
| Synthetic | LLM-generated instruction data | Quality filtered |
Filtering Pipeline:
Initial: 62.5M pairs
↓ Similarity filtering (custom model, fine-tuned e5-base)
↓ Threshold: 0.4 cosine similarity
Final: 34.2M high-quality pairs
4.4 Lessons for Indonesia-MTEB¶
| Insight | Application |
|---|---|
| Hybrid Approach | Combine native Indonesian + translated datasets |
| Training Corpus | 34.2M pairs sufficient for competitive models |
| Calibration | Always calibrate LLM-judge with 100+ human labels |
| Similarity Threshold | 0.4 effective for training data filtering |
| Domain Coverage | Include medical, legal, news, conversational |
5. ArabicMTEB: Dialect-Aware Evaluation¶
ArabicMTEB Innovation (NAACL 2025)
"Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks" (NAACL 2025) - 8+ citations - 94 datasets across multiple evaluation dimensions - Link: arxiv.org/abs/2411.01192
5.1 Multi-Dimensional Benchmark Structure¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ ArabicMTEB MULTI-DIMENSIONAL STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ DIMENSION 1: MAIN ARABICMTEB (94 datasets) │
│ ├─ Retrieval: 35 datasets │
│ ├─ Bitext Mining: 12 datasets │
│ ├─ Cross-Lingual Retrieval: 11 language pairs │
│ ├─ Re-Ranking: 5 datasets │
│ ├─ STS: 5 datasets (2 synthetic via GPT-4) │
│ ├─ Classification: 18 datasets │
│ ├─ Pair Classification: 3 datasets │
│ └─ Clustering: 4 datasets │
│ │
│ DIMENSION 2: DIALECTAL FORK (19 datasets) │
│ ├─ Bitext Mining: 8 dialect datasets │
│ ├─ Retrieval: 5 dialect datasets │
│ ├─ Classification: 5 dialect ID datasets │
│ └─ STS: 1 Egyptian dialect synthetic dataset │
│ │
│ DIMENSION 3: DOMAIN-SPECIFIC FORK (ArabicMTEB Lite) │
│ ├─ 10k queries, 100k documents │
│ ├─ Domains: News, Finance, Legal, Medical, Wikipedia │
│ └─ Generated via GPT-4o-mini from Wikipedia chunks │
│ │
│ DIMENSION 4: CULTURAL FORK (Country-level) │
│ ├─ 20 Arab countries │
│ ├─ 1k queries, ~15k documents per country │
│ ├─ Source: Country-specific Wikipedia portals │
│ └─ Generated via GPT-4o-mini for cultural queries │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.2 Novel Evaluation Dimensions¶
| Dimension | Description | Indonesian Parallel |
|---|---|---|
| Dialectal | Gulf, Egyptian, Moroccan, Levantine varieties | Regional Indonesian (Javanese-influenced, Sundanese-influenced) |
| Cross-Lingual | 11 language pairs | EN-ID, ID-JV, ID-SU |
| Domain-Specific | News, Finance, Legal, Medical | Same domains for Indonesia |
| Cultural | Country-specific cultural knowledge | Provincial cultural knowledge |
5.3 Synthetic Data Generation¶
ArabicMTEB uses Command R+ for synthetic data:
Synthetic Data Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. MSA Data │
│ ├─ 100k general domain samples │
│ └─ 20k domain-specific samples │
│ │
│ 2. Dialectal Data │
│ ├─ 15k Egyptian dialect samples │
│ └─ 15k Moroccan dialect samples │
│ │
│ 3. Domain Queries │
│ └─ 5 query styles per document chunk │
│ │
│ 4. Cultural Queries │
│ └─ Country-specific Wikipedia passages │
└─────────────────────────────────────────────────────────────┘
Performance Impact:
Swan-Small: 32.46 → 48.42 (+16 points with MSA synthetic)
Swan-Large: 55.39 → 61.91 (+6.5 points with MSA synthetic)
5.4 Lessons for Indonesia-MTEB¶
| ArabicMTEB Feature | Indonesia-MTEB Adaptation |
|---|---|
| Dialectal Fork | Regional language influence (Javanese, Sundanese, Minangkabau) |
| Domain-Specific Fork | Legal Indonesian (UU docs), Medical, Financial |
| Cultural Evaluation | Provincial cultural knowledge (34 provinces) |
| Synthetic Data | LLM-generated data for missing tasks |
6. SEA-BED: Human-Centric Regional Benchmark¶
SEA-BED (2025)
"SEA-BED: Southeast Asia Embedding Benchmark" (arXiv 2025) - 169 datasets across 9 tasks - 10 SEA languages including Indonesian - 71% human-formulated datasets - Link: arxiv.org/abs/2508.12243
6.1 Key Characteristics¶
| Aspect | SEA-BED Approach | Relevance to Indonesia |
|---|---|---|
| Scale | 169 datasets, 9 tasks, 10 languages | Indonesian included |
| Human-Formulated | 71% vs 29% translation/machine-generated | Quality-first approach |
| Tasks | Classification, Clustering, Pair Classification, Retrieval, Reranking, STS, Summarization, Instruction Following, Bitext Mining | Comprehensive task coverage |
| Indonesian Coverage | Included but not the focus | Can be expanded |
6.2 Data Sources¶
┌─────────────────────────────────────────────────────────────────┐
│ SEA-BED DATA SOURCES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Human-Formulated (71%) │
│ ├─ Native datasets from each SEA country │
│ │ └─ Indonesian: IndoNLU, NusaX, IndoMMLU, etc. │
│ ├─ Academic benchmarks │
│ └─ Domain-specific corpora │
│ │
│ Translation-Based (29%) │
│ ├─ Carefully translated MTEB subsets │
│ └─ Quality validation integrated │
│ │
│ Validation Strategy │
│ ├─ Native speaker review for key datasets │
│ ├─ Statistical consistency checks │
│ └─ Inter-annotator agreement tracking │
│ │
└─────────────────────────────────────────────────────────────────┘
6.3 Lessons for Indonesia-MTEB¶
| Lesson | Application |
|---|---|
| Human-First Priority | 71% human-formulated validates quality-over-quantity |
| Indonesia Opportunity | SEA-BED Indonesian datasets can be aggregated + expanded |
| Regional Integration | Consider Indonesia-MTEB compatibility with SEA evaluation |
| Task Coverage | Include Instruction Following (SEA-BED has this) |
7. AfriMTEB: Cross-Lingual Contrastive Distillation¶
AfriMTEB (2024)
"AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages" (arXiv 2024) - 59 African languages - 38 datasets from MMTEB - Cross-lingual contrastive distillation - Link: arxiv.org/abs/2510.23896
7.1 Methodology¶
┌─────────────────────────────────────────────────────────────────┐
│ AfriMTEB Approach │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Cross-Lingual Distillation │
│ ├─ Train on 9 well-resourced languages │
│ ├─ Transfer to 59 languages via alignment │
│ └─ Use NLI/SNLI multilingual data │
│ │
│ Quality Estimation │
│ ├─ SSA-COMET-MTL quality estimation model │
│ ├─ Threshold 0.75 retains ~60K from 430K samples │
│ └─ Filter low-quality translations │
│ │
│ Benchmark Composition │
│ ├─ 38 datasets from MMTEB │
│ ├─ Focus on African language tasks │
│ └─ Cross-lingual retrieval emphasis │
│ │
└─────────────────────────────────────────────────────────────────┘
7.2 Lessons for Indonesia-MTEB¶
| Insight | Application |
|---|---|
| Cross-Lingual Transfer | EN-ID alignment straightforward (both well-resourced) |
| Quality Estimation | COMET-style filtering effective |
| Resource Efficiency | Indonesian has more resources than African languages—full translation viable |
8. European MTEBs (PL, FR, DE)¶
8.1 PL-MTEB (Polish)¶
PL-MTEB (2024)
"PL-MTEB: Polish Massive Text Embedding Benchmark" (arXiv 2024) - 4+ citations - 29 datasets, 5 task groups - BEIR translation + native aggregation
| Metric | Value |
|---|---|
| Datasets | 29 (28 tasks) |
| Task Groups | Classification, Clustering, Pair Classification, Retrieval, STS |
| Approach | BEIR translation + native Polish datasets |
8.2 MTEB-French¶
MTEB-French (2024)
"MTEB-French: Resources for French Sentence Embedding" (arXiv 2024) - 17+ citations - 30+ datasets, 8 tasks - 22 existing + 3 new datasets created
| Feature | Description |
|---|---|
| Model Evaluation | 51 embedding models compared |
| Statistical Tests | Comprehensive statistical analysis |
| Correlation Study | Model-benchmark correlation analyzed |
8.3 DE-MTEB (German Clustering)¶
German Text Clustering Benchmark
"German Text Embedding Clustering Benchmark" (arXiv 2024) - Specialized in clustering evaluation - Focus on different domains
8.4 European MTEB Lessons¶
| Benchmark | Key Lesson | Indonesia Application |
|---|---|---|
| PL-MTEB | BEIR translation effective | Consider BEIR-ID translation |
| MTEB-French | Statistical analysis crucial | Include statistical validation |
| DE-MTEB | Task specialization valuable | Consider specialized forks |
9. Comparative Analysis Matrix¶
9.1 Methodology Comparison¶
| Benchmark | Translation Approach | Validation | Native Data | Training Corpus | Scale | Citations |
|---|---|---|---|---|---|---|
| C-MTEB | Minimal (native-first) | Peer-reviewed | Yes | 100M+ pairs | 35 | 1,171+ |
| VN-MTEB | Full MTEB translation | 3-stage LLM judge | No | N/A | 41 | New |
| TR-MTEB | BEIR + native | Calibrated LLM judge | Yes | 34.2M pairs | 26 | 2+ |
| ArabicMTEB | Selective + synthetic | Multi-dimensional | Yes | 122K + 135K | 94 | 8+ |
| SEA-BED | 29% | Human review | 71% | N/A | 169 | 1+ |
| AfriMTEB | Cross-lingual | COMET quality | Limited | Cross-lingual | 38 | New |
| MTEB-French | Aggregated | Statistical | Yes | N/A | 30+ | 17+ |
| PL-MTEB | BEIR translation | Standard | Yes | N/A | 29 | 4+ |
9.2 Kept Ratio Comparison (Translation Quality)¶
Translation Kept Ratios by Task:
ArabicMTEB: ████████████████████████ ~75% average
TR-MTEB: ███████████████████████ ~70% average
VN-MTEB: ██████████████████████ ~65% average
Estimated ID: ████████████████████████ ~70-75% average
By Task (VN-MTEB data):
Classification: ████████████████████████ 70.11%
Clustering: █████████████████████████ 71.98%
Pair Class: ██████████████████████ 67.2%
Retrieval: █████████████████████ 66.03%
Reranking: ████████████████████ 65.2%
STS: ████████████████ 53.4% ⚠️
9.3 Resource Comparison¶
| Benchmark | GPUs | Time | Total Compute | Tokens/Sec |
|---|---|---|---|---|
| VN-MTEB | 4×H100 | 28 days | ~2,700 GPU-hours | 3,800 |
| TR-MTEB | 1×A100 | 82 hours | ~82 GPU-hours | N/A |
| Estimated ID | 4×H100 | 20-25 days | ~2,000-2,400 GPU-hours | 4,000+ |
9.4 Task Coverage Comparison¶
MTEB Category Coverage by Benchmark:
Classification: ████████████████████████████████ All 8
Pair Classification: ████████████████████████████████ All 8
Retrieval: ████████████████████████████████ All 8
Clustering: ████████████████████████ 6/8
Reranking: ████████████████████ 5/8
STS: ████████████████████ 5/8
Summarization: ████ 2/8
Instruction Following: ████ 2/8
Bitext Mining: ████████ 3/8
10. Best Practices Extraction¶
10.1 Translation Pipeline Best Practices¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ RECOMMENDED TRANSLATION PIPELINE (Enhanced) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. MODEL SELECTION │
│ ├─ Use regional leaderboard for selection (e.g., SEA-HELM for SEA) │
│ ├─ Prefer models with strong target language performance │
│ ├─ For Indonesian: TranslateGemma (27B), Aya-23 (35B/8B) │
│ └─ Consider cost/quality tradeoff │
│ │
│ 2. QUALITY CONTROL (Multi-Stage) │
│ ├─ Stage 1: Language detection (LLM, not FastText) │
│ │ └─ Model: Qwen2.5-3B-Instruct or similar │
│ ├─ Stage 2: Semantic similarity (threshold 0.75-0.80) │
│ │ └─ Model: gte-Qwen2-7B-instruct or similar │
│ ├─ Stage 3: LLM-as-judge (CoT prompting, 5 criteria) │
│ │ └─ Model: 70B+ parameter model with CoT capability │
│ │ └─ Criteria: Grammar, NER, Numbers, Fluency, Meaning │
│ └─ Stage 4: Human validation on 10% sample │
│ │
│ 3. TASK-SPECIFIC EXPECTATIONS │
│ ├─ Classification: ~70-75% kept ratio │
│ ├─ Clustering: ~70-75% kept ratio │
│ ├─ Pair Classification: ~65-70% kept ratio │
│ ├─ Retrieval: ~65-70% kept ratio │
│ ├─ Reranking: ~65-70% kept ratio │
│ └─ STS: ~50-60% kept ratio (plan for low retention) │
│ │
│ 4. STATISTICAL VALIDATION │
│ ├─ Word length distribution analysis (target: r > 0.85) │
│ ├─ Kept ratio tracking by task and domain │
│ ├─ Semantic similarity distribution analysis │
│ └─ Inter-annotator agreement (target: >0.8) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
10.2 Validation Best Practices Summary¶
| Practice | Description | Source |
|---|---|---|
| Calibrate LLM Judge | 100+ human-labeled examples for prompt tuning | TR-MTEB |
| Multi-Criteria Scoring | Grammar, NER, fluency, meaning preservation | VN-MTEB |
| Chain-of-Thought | CoT prompting for improved LLM judgment | VN-MTEB |
| Semantic Similarity Threshold | 0.75-0.80 for filtering | VN-MTEB |
| Statistical Analysis | Word length distribution correlation | VN-MTEB |
| Domain-Specific Evaluation | Separate forks for different domains | ArabicMTEB |
| Cultural Awareness | Cultural knowledge evaluation | ArabicMTEB |
| Human Validation | 10% sample human review | All benchmarks |
10.3 Dataset Construction Best Practices¶
| Practice | Description | Source |
|---|---|---|
| Hybrid Approach | Native datasets + high-quality translations | TR-MTEB, ArabicMTEB |
| Domain Diversity | News, finance, legal, medical, conversational | All |
| Pair Type Balance | Symmetric (paraphrase) + asymmetric (query-doc) | C-MTEB |
| Deduplication | Semantic deduplication (PolyDeDupe-style) | C-MTEB |
| Quality Filtering | Similarity threshold for training corpora | TR-MTEB |
| Synthetic Data | LLM-generated domain-specific data | ArabicMTEB |
| License Compliance | Track and respect dataset licenses | All |
10.4 Novel Innovations by Benchmark¶
| Benchmark | Innovation | Indonesia-MTEB Potential |
|---|---|---|
| VN-MTEB | 3-stage LLM-based translation QC | Adapt for EN-ID translation |
| ArabicMTEB | Dialectal evaluation | Regional Indonesian varieties |
| ArabicMTEB | Cultural fork | Provincial cultural knowledge |
| TR-MTEB | Calibrated LLM judge | Apply to Indonesian validation |
| C-MTEB | Large training corpus | Create ID-Pack (50M+ pairs) |
| SEA-BED | Human-first ratio (71%) | Prioritize native Indonesian data |
| KorFinMTEB | Domain-specific (financial) | Create Indonesian financial fork |
| AfriMTEB | Cross-lingual distillation | EN-ID cross-lingual alignment |
11. Recommended Methodology for Indonesia-MTEB¶
11.1 Three-Phase Approach¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB METHODOLOGY (ENHANCED) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PHASE 1: AGGREGATION (Document 03 ✅ Complete) │
│ ├─ 70+ existing Indonesian datasets catalogued │
│ ├─ Coverage: Classification ✓, Pair Classification ✓, Retrieval ✓ │
│ ├─ Gaps: Clustering ✗, Reranking ✗, STS limited │
│ └─ Native dataset inventory: Complete │
│ │
│ PHASE 2: TRANSLATION (Document 05) │
│ ├─ Model: TranslateGemma-2-27B or Aya-23-35B │
│ ├─ Pipeline: 3-stage (Detection → Translation → QC) │
│ │ └─ Stage 1: Qwen2.5-3B-Instruct (language detection) │
│ │ └─ Stage 2: TranslateGemma/Aya-23 (translation) │
│ │ └─ Stage 3: LLM-as-judge + semantic similarity │
│ ├─ Target Datasets: Clustering, Reranking, STS gaps │
│ ├─ Expected Kept Ratio: 70-75% (higher than VN-MTEB) │
│ └─ Estimated Time: 4×H100 × 20-25 days │
│ │
│ PHASE 3: AI GENERATION (Document 06) │
│ ├─ Target: Domain-specific tasks, cultural queries │
│ ├─ Method: LLM-as-generator + LLM-as-judge │
│ ├─ Validation: Statistical consistency + human spot-check │
│ └─ Cultural Fork: Wikipedia Indonesia for 34 provinces │
│ │
│ ADDITIONAL DIMENSIONS (Novelty) │
│ ├─ Archipelago-Aware: Regional language influence evaluation │
│ ├─ Formal-Register: Informal (slang) → Formal → Academic continuum │
│ ├─ Code-Mixing: Indonesian-English code-mixing evaluation │
│ └─ Domain-Specific: Legal (UU), Medical, Financial forks │
│ │
│ INTEGRATION │
│ ├─ MTEB-compatible format │
│ ├─ Metadata documentation │
│ └─ HuggingFace upload with proper licensing │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
11.2 Resource Estimation¶
| Task | GPUs | Time | GPU-Hours | Rationale |
|---|---|---|---|---|
| Full MTEB Translation | 4×H100 | 20-25 days | 2,000-2,400 | EN-ID closer than EN-VN |
| AI Dataset Generation | 2×H100 | 5-7 days | 240-336 | Clustering + Reranking |
| Validation | 1×H100 | 3-5 days | 72-120 | LLM-as-judge evaluation |
| Total | - | - | ~2,500-3,000 | Conservative estimate |
11.3 Translation Model Selection for Indonesian¶
| Model | Parameters | ID Performance | Cost Efficiency | Recommendation |
|---|---|---|---|---|
| TranslateGemma-2-27B | 27B | Excellent (55 langs) | Medium | Primary |
| Aya-23-35B | 35B | Excellent (SEA focus) | Low | Alternative |
| Aya-23-8B | 8B | Very good | High | Cost-efficient |
| NLLB-200 | 3.3B | Good | Very High | Smaller option |
| SEA-LION-v3 | - | N/A | N/A | Judge model only |
11.4 Quality Validation Thresholds¶
| Metric | Threshold | Justification |
|---|---|---|
| Semantic Similarity | ≥0.80 | VN-MTEB used 0.8 |
| LLM Judge Score | ≥3.5/5.0 | Calibrated threshold |
| Kept Ratio Target | 65-75% | By task type |
| Word Length Correlation | r ≥ 0.85 | Statistical quality check |
| Human Validation | 10% sample | Final quality check |
11.5 Novel Dimensions for Indonesia-MTEB¶
Based on regional MTEB gaps:
┌─────────────────────────────────────────────────────────────────┐
│ NOVEL DIMENSIONS FOR INDONESIA-MTEB │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. ARCHIPELAGO-AWARE EVALUATION │
│ ├─ Javanese-influenced Indonesian │
│ ├─ Sundanese-influenced Indonesian │
│ ├─ Minangkabau-influenced Indonesian │
│ └─ Other regional varieties │
│ │
│ 2. FORMAL-REGISTER CONTINUUM │
│ ├─ Informal/Slang (social media, Kaskus) │
│ ├─ Semi-formal (news articles) │
│ ├─ Formal (academic papers, legal documents) │
│ └─ Administrative (government regulations) │
│ │
│ 3. CODE-MIXING EVALUATION │
│ ├─ Indonesian-English code-mixing │
│ ├─ Prevalent in urban social media │
│ └─ Real-world use case evaluation │
│ │
│ 4. CULTURAL KNOWLEDGE (34 Provinces) │
│ ├─ Province-specific cultural queries │
│ ├─ Source: Wikipedia Indonesia + provincial portals │
│ └─ Generated via LLM with human validation │
│ │
│ 5. DOMAIN-SPECIFIC FORKS │
│ ├─ Legal Indonesian (UU documents, court decisions) │
│ ├─ Medical Indonesian │
│ ├─ Financial Indonesian │
│ └─ Religious Indonesian (Islamic contexts) │
│ │
└─────────────────────────────────────────────────────────────────┘
12. MTEB Integration Strategy¶
12.1 Adding a Benchmark to MTEB Official¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB OFFICIAL INTEGRATION PROCESS (Updated) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DATASET FORMAT REQUIREMENTS │
│ ├─ Implement mteb.AbsTask subclass │
│ ├─ Load data with .load_data() method │
│ ├─ Define metadata (name, description, license, eval_langs) │
│ ├─ Implement evaluation for your task type │
│ └─ Follow MTEB dataset card format │
│ │
│ 2. SUBMISSION CHECKLIST │
│ ├─ Fork: github.com/embeddings-benchmark/mteb │
│ ├─ Add: mteb/datasets/your_dataset/ │
│ ├─ Register: mteb/datasets/__init__.py │
│ ├─ Test: CI/CD must pass │
│ ├─ PR: Create with detailed description │
│ └─ Address: Reviewer feedback │
│ │
│ 3. HUGGINGFACE UPLOAD │
│ ├─ Upload to: huggingface.co/datasets/ │
│ ├─ Use MTEB dataset card format │
│ ├─ Include: License, size, task metadata │
│ └─ Link: Original sources │
│ │
│ 4. LEADERBOARD SUBMISSION │
│ ├─ Run: Evaluation on baseline models │
│ ├─ Submit: mteb/results dataset │
│ ├─ Create: Benchmark discussion on leaderboard │
│ └─ Request: Leaderboard integration │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
12.2 Implementation Example¶
# Indonesia-MTEB dataset implementation example
from mteb import AbsTask, TaskMetadata
class IndonesianSentiment(AbsTaskClassification):
"""Indonesian sentiment analysis task for MTEB."""
metadata = TaskMetadata(
name="IndonesianSentiment",
description="Indonesian sentiment analysis from social media",
dataset={
"path": "indonlp/indonlu",
"name": "smsa",
"revision": "main"
},
type="Classification",
category="s2s",
eval_splits=["test"],
eval_langs=["ind"], # Indonesian language code
main_score="accuracy",
date=None,
form=None,
domains=["Social", "Written"],
task_subtypes=["Sentiment"],
license="CC-BY-SA-4.0",
annotations_creators="human-verified",
dialect=[],
sample_creation="found",
bibtex_citation="""
@article{wilie2020indonlu,
title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
author={Wilie, Bryan and Vincentio, Bryan and et al.},
journal={arXiv preprint arXiv:2009.05387},
year={2020}
}
"""
)
def load_data(self, **kwargs):
"""Load Indonesian sentiment data."""
from datasets import load_dataset
return load_dataset("indonlp/indonlu", "smsa")
12.3 MTEB Integration Links¶
| Resource | URL |
|---|---|
| GitHub Repository | github.com/embeddings-benchmark/mteb |
| Leaderboard | huggingface.co/spaces/mteb/leaderboard |
| Results Dataset | huggingface.co/datasets/mteb/results |
| Adding Datasets | github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md |
| Adding Models | github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md |
13. Key Takeaways for Indonesia-MTEB¶
13.1 Methodology Recommendations¶
| Priority | Recommendation | Rationale |
|---|---|---|
| 1 | Adopt VN-MTEB's 3-stage translation pipeline | Proven automated QC |
| 2 | Use TranslateGemma or Aya-23 for translation | Strong ID support |
| 3 | Calibrate LLM judge with 100+ human samples | TR-MTEB: 88.4% F1 |
| 4 | Create ID-specific training corpus (ID-Pack) | C-MTEB approach |
| 5 | Add domain-specific + cultural forks | ArabicMTEB innovation |
| 6 | Target 70-75% kept ratio | Higher than VN-MTEB |
13.2 Novelty Opportunities¶
Based on regional MTEB analysis:
- Archipelago-Aware Evaluation: Regional language influence on Indonesian
- Formal-Register Continuum: Informal → Formal → Academic Indonesian
- Code-Mixing Evaluation: Indonesian-English code-mixing (social media)
- Cultural Knowledge: 34 provincial cultural queries
- Domain-Specific Forks: Legal, Medical, Financial Indonesian
13.3 Success Criteria Alignment¶
| Criterion | Target | Benchmark Reference |
|---|---|---|
| Task Coverage | All 8 MTEB categories | VN-MTEB: 6, ArabicMTEB: 8 |
| Dataset Count | 60-100 datasets | ArabicMTEB: 94, SEA-BED: 169 |
| Quality | ≥70% kept ratio, 10% human validation | VN-MTEB: 65% avg |
| Publication | ACL/EMNLP/NAACL dataset paper | C-MTEB: SIGIR, TR-MTEB: EMNLP |
| Adoption | MTEB leaderboard integration | All regional MTEBs |
14. References¶
Regional MTEB Papers (2024-2025)¶
-
C-MTEB: Xiao et al. (2024). "C-Pack: Packaged Resources To Advance General Chinese Embeddings." SIGIR 2024. arxiv.org/abs/2309.07597 - 1,171+ citations
-
ArabicMTEB: Bhatia et al. (2025). "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks." NAACL 2025. arxiv.org/abs/2411.01192 - 8+ citations
-
MTEB-French: Ciancone et al. (2024). "MTEB-French: Resources for French Sentence Embedding." arXiv:2405.20468. arxiv.org/abs/2405.20468 - 17+ citations
-
VN-MTEB: Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500. arxiv.org/abs/2507.21500
-
TR-MTEB: Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations." EMNLP 2025 Findings. aclanthology.org/2025.findings-emnlp.471 - 2+ citations
-
SEA-BED: Ponwitayarat et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark." arXiv:2508.12243. arxiv.org/abs/2508.12243 - 1+ citation
-
AfriMTEB: Uemura et al. (2024). "AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages." arXiv:2510.23896. arxiv.org/abs/2510.23896
-
PL-MTEB: Poświata et al. (2024). "PL-MTEB: Polish Massive Text Embedding Benchmark." arXiv:2405.10138. arxiv.org/abs/2405.10138 - 4+ citations
-
KorFinMTEB: Hwang et al. (2025). "What Advantages Can Low-Resource Domain-Specific Instruction Tuning Bring to Large Language Models? A Case Study on Korean Financial Texts." arXiv:2502.07131. arxiv.org/abs/2502.07131 - 4+ citations
Original MTEB¶
- Muennighoff et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023. arxiv.org/abs/2210.07316 - 1,488+ citations
Translation Models¶
-
Google (2024). "TranslateGemma: A new suite of open translation models." blog.google/technology/ai/translategemma/
-
Cohere For AI (2024). "Aya 23: Open weight releases to further multilingual progress." arXiv:2405.15032. arxiv.org/abs/2405.15032
MTEB Resources¶
- MTEB GitHub: github.com/embeddings-benchmark/mteb
- MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
- MTEB Datasets: huggingface.co/mteb
15. Document Roadmap¶
| Document | Content | Status |
|---|---|---|
| 01 | Project Overview | ✅ Enhanced |
| 02 | MTEB Structure Analysis | ✅ Enhanced |
| 03 | Existing Indonesian Datasets | ✅ Enhanced |
| 04 | Regional MTEB Methodologies | ✅ Enhanced |
| 05 | Translation Models Benchmark | 🔲 Next |
| 06 | AI Dataset Generation Methods | Pending |
| 07 | Validation Strategies | Pending |
| 08 | ACL Dataset Paper Standards | Pending |
| 09 | Novelty Angle & Publication | Pending |
| 10 | Implementation Roadmap | Pending |
"The most successful regional MTEBs combine three elements: rigorous quality control, linguistic/cultural awareness, and comprehensive task coverage. Indonesia-MTEB will synthesize these approaches while introducing archipelago-aware and formal-register evaluation dimensions unique to the Indonesian context."