Skip to content

Project: Indonesia-MTEB Benchmark Document: 04 - Regional MTEB Methodologies Analysis (ENHANCED) Last Updated: 2026-01-25 Version: 2.0 - Enhanced with Latest Research (2024-2025)


Regional MTEB Methodologies: A Comprehensive Comparative Analysis

"Understanding how successful regional MTEBs were constructed provides the blueprint for Indonesia-MTEB. This document analyzes 10+ regional benchmarks with latest research findings from 2024-2025."


Executive Summary

Key Findings

  • 10+ regional MTEBs analyzed with latest 2024-2025 research
  • C-MTEB leads with 1,171+ citations (SIGIR 2024)
  • VN-MTEB's 3-stage pipeline sets standard for translation quality
  • ArabicMTEB introduces dialect-aware and cross-cultural evaluation
  • Indonesia-MTEB can leverage multilingual approaches + SEA-BED integration
graph TD
    A[Regional MTEB Landscape 2024-2025] --> B[Asian Languages]
    A --> C[European Languages]
    A --> D[Middle Eastern/African]
    A --> E[Regional/SEA]

    B --> B1[C-MTEB: Chinese - 1,171+ citations]
    B --> B2[VN-MTEB: Vietnamese - 3-stage pipeline]
    B --> B3[TR-MTEB: Turkish - calibrated LLM judge]
    B --> B4[KorFinMTEB: Korean Financial]

    C --> C1[PL-MTEB: Polish - BEIR translation]
    C --> C2[MTEB-French: 22+ datasets]
    C --> C3[DE-MTEB: German clustering]

    D --> D1[ArabicMTEB: Dialect-aware - 94 datasets]
    D --> D2[AfriMTEB: 59 languages - contrastive distillation]

    E --> E1[SEA-BED: 10 SEA languages - 169 datasets]
    E --> E2[MMTEB: 1,090 languages - 500+ tasks]

    style B1 fill:#ff6b6b,color:#fff
    style D1 fill:#ffd93d,color:#333
    style E1 fill:#51cf66,color:#fff

Table of Contents

  1. The Regional MTEB Landscape
  2. C-MTEB (Chinese): Curated Aggregation
  3. VN-MTEB (Vietnamese): Automated Translation Pipeline
  4. TR-MTEB (Turkish): Hybrid Approach
  5. ArabicMTEB: Dialect-Aware Evaluation
  6. SEA-BED: Human-Centric Regional Benchmark
  7. AfriMTEB: Cross-Lingual Contrastive Distillation
  8. European MTEBs (PL, FR, DE)
  9. Comparative Analysis Matrix
  10. Best Practices Extraction
  11. Recommended Methodology for Indonesia-MTEB
  12. MTEB Integration Strategy

1. The Regional MTEB Landscape

1.1 Complete Benchmark Overview (2024-2025)

Benchmark Language Scale Publication Citations Methodology
C-MTEB Chinese 35 datasets, 6 tasks SIGIR 2024 1,171+ Curated aggregation + 100M pairs
ArabicMTEB Arabic 94 datasets, 8 tasks NAACL 2025 8+ Dialect-aware + cultural evaluation
MTEB-French French 30+ datasets, 8 tasks arXiv 2024 17+ Aggregated French resources
VN-MTEB Vietnamese 41 datasets, 6 tasks arXiv 2025 New Automated LLM translation pipeline
TR-MTEB Turkish 26 datasets, 6 tasks EMNLP 2025 2+ Translation + native corpus
PL-MTEB Polish 29 datasets, 5 tasks arXiv 2024 4+ BEIR translation
SEA-BED 10 SEA langs 169 datasets, 9 tasks arXiv 2025 1+ Human-centric (71% native)
AfriMTEB 59 African langs 38 datasets arXiv 2024 New Cross-lingual distillation
KorFinMTEB Korean 26 datasets, 7 tasks arXiv 2025 4+ Domain-specific (financial)
DE-MTEB German Clustering focus GitHub - Clustering specialization

1.2 Methodology Categories

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REGIONAL MTEB METHODOLOGY TAXONOMY                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CATEGORY 1: NATIVE-FIRST APPROACH                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • C-MTEB (Chinese): Prioritized native datasets, minimal translation   │  │
│  │ • SEA-BED (SEA): 71% human-formulated, native-focused                  │  │
│  │ • ArabicMTEB: Native Arabic + dialectal data                           │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 2: FULL TRANSLATION PIPELINE                                       │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • VN-MTEB: Complete MTEB translation with 3-stage QC                   │  │
│  │ • PL-MTEB: BEIR datasets translated + native aggregation              │  │
│  │ • MTEB-French: MTEB subsets + native French resources                 │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 3: HYBRID APPROACH                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • TR-MTEB: BEIR translation + native Turkish datasets                  │  │
│  │ • ArabicMTEB: Native + selective translation + synthetic              │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  CATEGORY 4: SPECIALIZED APPROACH                                            │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │ • KorFinMTEB: Domain-specific (financial) Korean                       │  │
│  │ • DE-MTEB: Task-specific (clustering) German                           │  │
│  │ • AfriMTEB: Cross-lingual distillation from 9 resource languages      │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

2. C-MTEB (Chinese): Curated Aggregation

C-MTEB Impact

"C-Pack: Packaged Resources To Advance General Chinese Embeddings" (SIGIR 2024) - 1,171+ citations (highest among regional MTEBs) - Link: arxiv.org/abs/2309.07597 - HuggingFace: huggingface.co/C-MTEB

2.1 Methodology Overview

┌─────────────────────────────────────────────────────────────────┐
│                        C-MTEB Methodology                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  DATA SOURCES                                                    │
│  ├─ Web corpora (public Chinese websites)                       │
│  ├─ Question-Answer forums (Zhihu, Baidu Knows)                 │
│  ├─ Encyclopedia content (Baidu Baike, Wikipedia)               │
│  └─ News articles from multiple sources                         │
│                                                                   │
│  CONSTRUCTION                                                    │
│  ├─ 100M+ sentence pairs (C-MTP corpus)                         │
│  ├─ Symmetric + asymmetric pair types                           │
│  └─ Multi-stage filtering                                      │
│                                                                   │
│  BENCHMARK COMPOSITION                                           │
│  ├─ 35 datasets                                                  │
│  ├─ 6 task types (no reranking, no instruction following)      │
│  └─ Native Chinese datasets only (no translation)              │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

2.2 C-MTP Training Corpus Details

Component Size Description
Total Pairs 100M+ Chinese sentence pairs
Symmetric Pairs 60% Paraphrase, NLI, STS
Asymmetric Pairs 40% Query-document, QA
Sources Web, QA, Encyclopedia, News Diverse domains
Filtering Multi-stage Quality control

2.3 Task Distribution

C-MTEB Task Composition (35 datasets):

Classification:      ████████████████████ 13 datasets (37%)
Retrieval:           ████████████████     11 datasets (31%)
Clustering:          ████                 4 datasets (11%)
Pair Classification: ███                  3 datasets (9%)
STS:                 ███                  3 datasets (9%)
Reranking:           █                    1 dataset (3%)

2.4 Lessons for Indonesia-MTEB

Insight Application to Indonesia
Native-First Approach Prioritize existing Indonesian datasets (IndoNLU, NusaX, etc.)
Large Training Corpus Create ID-Pack with 50M+ Indonesian sentence pairs
Domain Diversity Include Kaskus (forum), detik.com (news), Wikipedia ID (encyclopedia)
BGE Model Family Consider fine-tuning BGE for Indonesian (BGE-ID)

2.5 Implementation: Loading C-MTEB

from mteb import MTEB

# C-MTEB evaluation example
evaluation = MTEB(tasks=["T2Retrieval", "Retrieval"])

# Run on Chinese model
results = evaluation.run(
    model="BAAI/bge-large-zh-v1.5",
    eval_splits=["test"],
    output_folder="results/c-mteb"
)

# Access specific C-MTEB datasets
from datasets import load_dataset

# Load C-MTEB classification dataset
c_mteb = load_dataset("C-MTEB/CLSClusteringS2S", "default")
print(c_mteb)

3. VN-MTEB (Vietnamese): Automated Translation Pipeline

VN-MTEB Innovation (2025)

"VN-MTEB: Vietnamese Massive Text Embedding Benchmark" (arXiv 2025) - First comprehensive automated translation pipeline - 3-stage quality control with LLM-as-judge - Link: arxiv.org/abs/2507.21500

3.1 Three-Stage Translation Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                    VN-MTEB TRANSLATION PIPELINE (DETAILED)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STAGE 1: LANGUAGE DETECTION                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  Method: LLM-based detection (Qwen2.5-3B-Instruct)                      │  │
│  │  Purpose: Filter source language samples from mixed content            │  │
│  │  Why not FastText: Interleaved languages cause detection errors       │  │
│  │                                                                          │  │
│  │  Accuracy: >99% on clean samples, ~95% on mixed content                │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                       ↓                                        │
│  STAGE 2: TRANSLATION                                                        │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  Model: Aya-23-35B (Cohere For AI)                                    │  │
│  │  Selected via: SEA-HELM leaderboard (top performer for Vietnamese)    │  │
│  │  Temperature: 0.0 (deterministic for consistency)                      │  │
│  │  Max tokens: 4096                                                      │  │
│  │  Prompt Engineering: Optimized for EN-VI translation quality           │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                       ↓                                        │
│  STAGE 3: THREE-STEP VALIDATION                                             │
│  ┌─────────────────────────────────────────────────────────────────────────┐  │
│  │  3a. LANGUAGE DETECTION                                                 │  │
│  │      └─ Verify output is Vietnamese (Qwen2.5-3B)                       │  │
│  │                                                                          │  │
│  │  3b. SEMANTIC SIMILARITY                                                │  │
│  │      ├─ Model: gte-Qwen2-7B-instruct                                   │  │
│  │      ├─ Metric: Cosine similarity                                     │  │
│  │      ├─ Threshold: 0.8                                                │  │
│  │      └─ Context length: 32,768 tokens                                  │  │
│  │                                                                          │  │
│  │  3c. LLM-AS-A-JUDGE                                                     │  │
│  │      ├─ Judge Model: Llama-SEA-LION-v3-70B-IT                         │  │
│  │      ├─ Evaluation Criteria (5 dimensions):                           │  │
│  │      │  ├─ Grammar and Syntax                                        │  │
│  │      │  ├─ Named Entity Recognition (NER) preservation               │  │
│  │      │  ├─ Numbers/Links/Special Characters preservation             │  │
│  │      │  ├─ Fluency and Naturalness                                    │  │
│  │      │  └─ Meaning Preservation                                      │  │
│  │      ├─ Scoring: 1-5 scale per criterion, weighted average            │  │
│  │      ├─ Technique: Chain-of-Thought prompting                         │  │
│  │      └─ Agreement: 85.2% with human judgments                         │  │
│  └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 LLM-as-a-Judge Scoring Formula

score_LLM_judge = (∑ α_i × score_i) / |S|

Where:
- S = {Grammar, NER, Numbers/Links, Fluency, Meaning}
- α_i = importance weight (∑ α_i = 1)
- score_i ∈ [1, 5] for each criterion
- ξ_threshold = 3.5/5.0 (samples below are discarded)

Weight Distribution (VN-MTEB):
α_grammar = 0.20
α_ner = 0.15
α_numbers = 0.15
α_fluency = 0.20
α_meaning = 0.30

3.3 Kept Ratios by Task Type

Task Category Datasets Kept Ratio Interpretation
Clustering 5 71.98% Highest retention, structural preservation
Classification 13 70.11% Relatively preserved meaning
Pair Classification 3 67.2% Entailment relationships mostly intact
Retrieval 15 66.03% Moderate difficulty, domain-specific terms
Reranking 3 65.2% Nuanced ranking criteria challenging
STS 3 53.4% Lowest—semantic similarity hardest to preserve
Kept Ratio Visualization:

Clustering:       ████████████████████████ 71.98%
Classification:   ███████████████████████  70.11%
Pair Class:       ██████████████████████   67.2%
Retrieval:        █████████████████████    66.03%
Reranking:        ████████████████████     65.2%
STS:              ████████████████         53.4% ⚠️

3.4 Statistical Validation

VN-MTEB introduced word length distribution analysis as a novel validation:

# Word length distribution validation
import numpy as np
from scipy import stats

def compute_word_length_distribution(sentences):
    """Compute distribution of word lengths in sentences."""
    lengths = [len(word.split()[0]) for sent in sentences for word in sent.split()]
    return np.array(lengths)

# English and Vietnamese word lengths
en_lengths = compute_word_length_distribution(english_sentences)
vi_lengths = compute_word_length_distribution(vietnamese_sentences)

# Correlation analysis
correlation = np.corrcoef(
    np.histogram(en_lengths, bins=range(1, 20))[0],
    np.histogram(vi_lengths, bins=range(1, 20))[0]
)[0, 1]

print(f"Word length correlation: r = {correlation:.3f}")
# VN-MTEB achieved: r > 0.85

3.5 Compute Requirements

Resource Specification Notes
GPUs 4 × NVIDIA H100 (700W each) High-end GPU cluster
Output Rate 3,800 tokens/second Aya-23-35B inference
Total Time ~28 days (675.54 hours) Full MTEB translation
Token Throughput 2× processing time Input + output counted

3.6 Lessons for Indonesia-MTEB

Lesson Application
Semantic Similarity Threshold Use 0.8 threshold for filtering
Task-Specific Expectations STS will have lowest kept ratio (~50-60%)
Language Detection Use LLMs (not FastText) for multilingual detection
LLM-as-Judge Chain-of-thought with 5 criteria achieves 85.2% human agreement
Resource Estimation 4 H100s × 20-25 days for EN-ID (faster than EN-VN)
Indonesia Advantage EN-ID may have higher kept ratios (both Latin-script)

3.7 Implementation: VN-MTEB Pipeline Adaptation

# Adapted VN-MTEB pipeline for Indonesian
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class VNStyleTranslationPipeline:
    """Indonesian adaptation of VN-MTEB translation pipeline."""

    def __init__(self):
        # Stage 1: Language detection
        self.detector = AutoModelForCausalLM.from_pretrained(
            "Qwen/Qwen2.5-3B-Instruct"
        )

        # Stage 2: Translation (use model good for Indonesian)
        self.translator = AutoModelForCausalLM.from_pretrained(
            "google/gemma-2-27b-it"  # TranslateGemma
        )

        # Stage 3b: Semantic similarity
        self.sim_model = AutoModel.from_pretrained(
            "Alibaba-NLP/gte-Qwen2-7B-instruct"
        )

        # Stage 3c: LLM judge
        self.judge = AutoModelForCausalLM.from_pretrained(
            "SEA-LION-LM/SEA-LION-v3-70B-IT"
        )

    def stage1_detect_language(self, texts):
        """Detect if texts are English."""
        # Implementation...
        pass

    def stage2_translate(self, texts):
        """Translate to Indonesian."""
        # Implementation...
        pass

    def stage3_validate(self, original, translated):
        """Three-step validation."""
        # 3a: Language detection
        # 3b: Semantic similarity (threshold: 0.8)
        # 3c: LLM-as-judge (threshold: 3.5/5.0)
        pass

4. TR-MTEB (Turkish): Hybrid Approach

TR-MTEB (EMNLP 2025)

"TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations" (EMNLP 2025 Findings) - 2+ citations - Link: aclanthology.org/2025.findings-emnlp.471

4.1 Benchmark Composition

Task Datasets Native Translated Source
Classification 8 7 1 News, sentiment, irony, offensive
Clustering 2 2 0 Academic abstracts, opinions
Pair Classification 3 0 3 MNLI-TR, SNLI-TR, XNLI-TR
Bitext Mining 1 1 0 WMT16 EN-TR
STS 1 0 1 STS-Benchmark-TR
Retrieval 11 2 9 SQuAD-TR, TQuAD, MS MARCO-TR

4.2 LLM-as-a-Judge Calibration

TR-MTEB implemented a calibrated LLM-as-a-Judge pipeline:

Calibration Process:
┌─────────────────────────────────────────────────────────────┐
│ 1. Human Annotation                                         │
│    └─ 115 examples manually labeled (PASS/FAIL)             │
│                                                              │
│ 2. Prompt Iteration                                         │
│    └─ Refined evaluation prompt to align with humans        │
│                                                              │
│ 3. Final Performance                                        │
│    ├─ Agreement: 85.2%                                      │
│    ├─ Precision: 92.9%                                      │
│    ├─ Recall: 84.4%                                         │
│    └─ F1 Score: 88.4%                                       │
└─────────────────────────────────────────────────────────────┘

Confusion Matrix (Human vs LLM):
                Actual PASS    Actual FAIL
Predicted PASS      98              9
Predicted FAIL      8              0

4.3 Training Corpus Construction

TR-MTEB created 34.2M Turkish sentence pairs:

Source Type Examples Notes
Question-Answer Medical QA, Wiki QA, GSM8K-TR Domain-specific
Title-Content News headlines, Wikipedia Asymmetric pairs
Paraphrase TaPaCo-TR, multilingual NLI Symmetric pairs
Synthetic LLM-generated instruction data Quality filtered

Filtering Pipeline:

Initial: 62.5M pairs
    ↓ Similarity filtering (custom model, fine-tuned e5-base)
    ↓ Threshold: 0.4 cosine similarity
Final: 34.2M high-quality pairs

4.4 Lessons for Indonesia-MTEB

Insight Application
Hybrid Approach Combine native Indonesian + translated datasets
Training Corpus 34.2M pairs sufficient for competitive models
Calibration Always calibrate LLM-judge with 100+ human labels
Similarity Threshold 0.4 effective for training data filtering
Domain Coverage Include medical, legal, news, conversational

5. ArabicMTEB: Dialect-Aware Evaluation

ArabicMTEB Innovation (NAACL 2025)

"Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks" (NAACL 2025) - 8+ citations - 94 datasets across multiple evaluation dimensions - Link: arxiv.org/abs/2411.01192

5.1 Multi-Dimensional Benchmark Structure

┌─────────────────────────────────────────────────────────────────────────────┐
│                       ArabicMTEB MULTI-DIMENSIONAL STRUCTURE                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  DIMENSION 1: MAIN ARABICMTEB (94 datasets)                                 │
│  ├─ Retrieval: 35 datasets                                                  │
│  ├─ Bitext Mining: 12 datasets                                              │
│  ├─ Cross-Lingual Retrieval: 11 language pairs                              │
│  ├─ Re-Ranking: 5 datasets                                                   │
│  ├─ STS: 5 datasets (2 synthetic via GPT-4)                                │
│  ├─ Classification: 18 datasets                                              │
│  ├─ Pair Classification: 3 datasets                                           │
│  └─ Clustering: 4 datasets                                                   │
│                                                                              │
│  DIMENSION 2: DIALECTAL FORK (19 datasets)                                   │
│  ├─ Bitext Mining: 8 dialect datasets                                       │
│  ├─ Retrieval: 5 dialect datasets                                            │
│  ├─ Classification: 5 dialect ID datasets                                    │
│  └─ STS: 1 Egyptian dialect synthetic dataset                               │
│                                                                              │
│  DIMENSION 3: DOMAIN-SPECIFIC FORK (ArabicMTEB Lite)                         │
│  ├─ 10k queries, 100k documents                                              │
│  ├─ Domains: News, Finance, Legal, Medical, Wikipedia                        │
│  └─ Generated via GPT-4o-mini from Wikipedia chunks                          │
│                                                                              │
│  DIMENSION 4: CULTURAL FORK (Country-level)                                   │
│  ├─ 20 Arab countries                                                         │
│  ├─ 1k queries, ~15k documents per country                                    │
│  ├─ Source: Country-specific Wikipedia portals                               │
│  └─ Generated via GPT-4o-mini for cultural queries                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

5.2 Novel Evaluation Dimensions

Dimension Description Indonesian Parallel
Dialectal Gulf, Egyptian, Moroccan, Levantine varieties Regional Indonesian (Javanese-influenced, Sundanese-influenced)
Cross-Lingual 11 language pairs EN-ID, ID-JV, ID-SU
Domain-Specific News, Finance, Legal, Medical Same domains for Indonesia
Cultural Country-specific cultural knowledge Provincial cultural knowledge

5.3 Synthetic Data Generation

ArabicMTEB uses Command R+ for synthetic data:

Synthetic Data Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. MSA Data                                                 │
│    ├─ 100k general domain samples                          │
│    └─ 20k domain-specific samples                          │
│                                                              │
│ 2. Dialectal Data                                            │
│    ├─ 15k Egyptian dialect samples                          │
│    └─ 15k Moroccan dialect samples                          │
│                                                              │
│ 3. Domain Queries                                            │
│    └─ 5 query styles per document chunk                     │
│                                                              │
│ 4. Cultural Queries                                          │
│    └─ Country-specific Wikipedia passages                  │
└─────────────────────────────────────────────────────────────┘

Performance Impact:
Swan-Small:  32.46 → 48.42 (+16 points with MSA synthetic)
Swan-Large:  55.39 → 61.91 (+6.5 points with MSA synthetic)

5.4 Lessons for Indonesia-MTEB

ArabicMTEB Feature Indonesia-MTEB Adaptation
Dialectal Fork Regional language influence (Javanese, Sundanese, Minangkabau)
Domain-Specific Fork Legal Indonesian (UU docs), Medical, Financial
Cultural Evaluation Provincial cultural knowledge (34 provinces)
Synthetic Data LLM-generated data for missing tasks

6. SEA-BED: Human-Centric Regional Benchmark

SEA-BED (2025)

"SEA-BED: Southeast Asia Embedding Benchmark" (arXiv 2025) - 169 datasets across 9 tasks - 10 SEA languages including Indonesian - 71% human-formulated datasets - Link: arxiv.org/abs/2508.12243

6.1 Key Characteristics

Aspect SEA-BED Approach Relevance to Indonesia
Scale 169 datasets, 9 tasks, 10 languages Indonesian included
Human-Formulated 71% vs 29% translation/machine-generated Quality-first approach
Tasks Classification, Clustering, Pair Classification, Retrieval, Reranking, STS, Summarization, Instruction Following, Bitext Mining Comprehensive task coverage
Indonesian Coverage Included but not the focus Can be expanded

6.2 Data Sources

┌─────────────────────────────────────────────────────────────────┐
│                    SEA-BED DATA SOURCES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Human-Formulated (71%)                                          │
│  ├─ Native datasets from each SEA country                       │
│  │  └─ Indonesian: IndoNLU, NusaX, IndoMMLU, etc.              │
│  ├─ Academic benchmarks                                         │
│  └─ Domain-specific corpora                                      │
│                                                                   │
│  Translation-Based (29%)                                         │
│  ├─ Carefully translated MTEB subsets                            │
│  └─ Quality validation integrated                                │
│                                                                   │
│  Validation Strategy                                             │
│  ├─ Native speaker review for key datasets                      │
│  ├─ Statistical consistency checks                              │
│  └─ Inter-annotator agreement tracking                          │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

6.3 Lessons for Indonesia-MTEB

Lesson Application
Human-First Priority 71% human-formulated validates quality-over-quantity
Indonesia Opportunity SEA-BED Indonesian datasets can be aggregated + expanded
Regional Integration Consider Indonesia-MTEB compatibility with SEA evaluation
Task Coverage Include Instruction Following (SEA-BED has this)

7. AfriMTEB: Cross-Lingual Contrastive Distillation

AfriMTEB (2024)

"AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages" (arXiv 2024) - 59 African languages - 38 datasets from MMTEB - Cross-lingual contrastive distillation - Link: arxiv.org/abs/2510.23896

7.1 Methodology

┌─────────────────────────────────────────────────────────────────┐
│                    AfriMTEB Approach                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Cross-Lingual Distillation                                      │
│  ├─ Train on 9 well-resourced languages                        │
│  ├─ Transfer to 59 languages via alignment                     │
│  └─ Use NLI/SNLI multilingual data                             │
│                                                                   │
│  Quality Estimation                                              │
│  ├─ SSA-COMET-MTL quality estimation model                     │
│  ├─ Threshold 0.75 retains ~60K from 430K samples              │
│  └─ Filter low-quality translations                           │
│                                                                   │
│  Benchmark Composition                                           │
│  ├─ 38 datasets from MMTEB                                     │
│  ├─ Focus on African language tasks                            │
│  └─ Cross-lingual retrieval emphasis                           │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

7.2 Lessons for Indonesia-MTEB

Insight Application
Cross-Lingual Transfer EN-ID alignment straightforward (both well-resourced)
Quality Estimation COMET-style filtering effective
Resource Efficiency Indonesian has more resources than African languages—full translation viable

8. European MTEBs (PL, FR, DE)

8.1 PL-MTEB (Polish)

PL-MTEB (2024)

"PL-MTEB: Polish Massive Text Embedding Benchmark" (arXiv 2024) - 4+ citations - 29 datasets, 5 task groups - BEIR translation + native aggregation

Metric Value
Datasets 29 (28 tasks)
Task Groups Classification, Clustering, Pair Classification, Retrieval, STS
Approach BEIR translation + native Polish datasets

8.2 MTEB-French

MTEB-French (2024)

"MTEB-French: Resources for French Sentence Embedding" (arXiv 2024) - 17+ citations - 30+ datasets, 8 tasks - 22 existing + 3 new datasets created

Feature Description
Model Evaluation 51 embedding models compared
Statistical Tests Comprehensive statistical analysis
Correlation Study Model-benchmark correlation analyzed

8.3 DE-MTEB (German Clustering)

German Text Clustering Benchmark

"German Text Embedding Clustering Benchmark" (arXiv 2024) - Specialized in clustering evaluation - Focus on different domains

8.4 European MTEB Lessons

Benchmark Key Lesson Indonesia Application
PL-MTEB BEIR translation effective Consider BEIR-ID translation
MTEB-French Statistical analysis crucial Include statistical validation
DE-MTEB Task specialization valuable Consider specialized forks

9. Comparative Analysis Matrix

9.1 Methodology Comparison

Benchmark Translation Approach Validation Native Data Training Corpus Scale Citations
C-MTEB Minimal (native-first) Peer-reviewed Yes 100M+ pairs 35 1,171+
VN-MTEB Full MTEB translation 3-stage LLM judge No N/A 41 New
TR-MTEB BEIR + native Calibrated LLM judge Yes 34.2M pairs 26 2+
ArabicMTEB Selective + synthetic Multi-dimensional Yes 122K + 135K 94 8+
SEA-BED 29% Human review 71% N/A 169 1+
AfriMTEB Cross-lingual COMET quality Limited Cross-lingual 38 New
MTEB-French Aggregated Statistical Yes N/A 30+ 17+
PL-MTEB BEIR translation Standard Yes N/A 29 4+

9.2 Kept Ratio Comparison (Translation Quality)

Translation Kept Ratios by Task:

ArabicMTEB:    ████████████████████████ ~75% average
TR-MTEB:       ███████████████████████  ~70% average
VN-MTEB:       ██████████████████████   ~65% average
Estimated ID:  ████████████████████████ ~70-75% average

By Task (VN-MTEB data):
Classification: ████████████████████████ 70.11%
Clustering:      █████████████████████████ 71.98%
Pair Class:      ██████████████████████   67.2%
Retrieval:       █████████████████████    66.03%
Reranking:       ████████████████████     65.2%
STS:             ████████████████         53.4% ⚠️

9.3 Resource Comparison

Benchmark GPUs Time Total Compute Tokens/Sec
VN-MTEB 4×H100 28 days ~2,700 GPU-hours 3,800
TR-MTEB 1×A100 82 hours ~82 GPU-hours N/A
Estimated ID 4×H100 20-25 days ~2,000-2,400 GPU-hours 4,000+

9.4 Task Coverage Comparison

MTEB Category Coverage by Benchmark:

Classification:       ████████████████████████████████ All 8
Pair Classification:  ████████████████████████████████ All 8
Retrieval:            ████████████████████████████████ All 8
Clustering:           ████████████████████████ 6/8
Reranking:            ████████████████████ 5/8
STS:                  ████████████████████ 5/8
Summarization:        ████ 2/8
Instruction Following: ████ 2/8
Bitext Mining:        ████████ 3/8

10. Best Practices Extraction

10.1 Translation Pipeline Best Practices

┌─────────────────────────────────────────────────────────────────────────────┐
│              RECOMMENDED TRANSLATION PIPELINE (Enhanced)                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. MODEL SELECTION                                                          │
│     ├─ Use regional leaderboard for selection (e.g., SEA-HELM for SEA)    │
│     ├─ Prefer models with strong target language performance               │
│     ├─ For Indonesian: TranslateGemma (27B), Aya-23 (35B/8B)             │
│     └─ Consider cost/quality tradeoff                                      │
│                                                                              │
│  2. QUALITY CONTROL (Multi-Stage)                                          │
│     ├─ Stage 1: Language detection (LLM, not FastText)                     │
│     │  └─ Model: Qwen2.5-3B-Instruct or similar                          │
│     ├─ Stage 2: Semantic similarity (threshold 0.75-0.80)                   │
│     │  └─ Model: gte-Qwen2-7B-instruct or similar                        │
│     ├─ Stage 3: LLM-as-judge (CoT prompting, 5 criteria)                 │
│     │  └─ Model: 70B+ parameter model with CoT capability                │
│     │  └─ Criteria: Grammar, NER, Numbers, Fluency, Meaning             │
│     └─ Stage 4: Human validation on 10% sample                             │
│                                                                              │
│  3. TASK-SPECIFIC EXPECTATIONS                                              │
│     ├─ Classification: ~70-75% kept ratio                                   │
│     ├─ Clustering: ~70-75% kept ratio                                       │
│     ├─ Pair Classification: ~65-70% kept ratio                              │
│     ├─ Retrieval: ~65-70% kept ratio                                        │
│     ├─ Reranking: ~65-70% kept ratio                                        │
│     └─ STS: ~50-60% kept ratio (plan for low retention)                     │
│                                                                              │
│  4. STATISTICAL VALIDATION                                                  │
│     ├─ Word length distribution analysis (target: r > 0.85)                 │
│     ├─ Kept ratio tracking by task and domain                               │
│     ├─ Semantic similarity distribution analysis                             │
│     └─ Inter-annotator agreement (target: >0.8)                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

10.2 Validation Best Practices Summary

Practice Description Source
Calibrate LLM Judge 100+ human-labeled examples for prompt tuning TR-MTEB
Multi-Criteria Scoring Grammar, NER, fluency, meaning preservation VN-MTEB
Chain-of-Thought CoT prompting for improved LLM judgment VN-MTEB
Semantic Similarity Threshold 0.75-0.80 for filtering VN-MTEB
Statistical Analysis Word length distribution correlation VN-MTEB
Domain-Specific Evaluation Separate forks for different domains ArabicMTEB
Cultural Awareness Cultural knowledge evaluation ArabicMTEB
Human Validation 10% sample human review All benchmarks

10.3 Dataset Construction Best Practices

Practice Description Source
Hybrid Approach Native datasets + high-quality translations TR-MTEB, ArabicMTEB
Domain Diversity News, finance, legal, medical, conversational All
Pair Type Balance Symmetric (paraphrase) + asymmetric (query-doc) C-MTEB
Deduplication Semantic deduplication (PolyDeDupe-style) C-MTEB
Quality Filtering Similarity threshold for training corpora TR-MTEB
Synthetic Data LLM-generated domain-specific data ArabicMTEB
License Compliance Track and respect dataset licenses All

10.4 Novel Innovations by Benchmark

Benchmark Innovation Indonesia-MTEB Potential
VN-MTEB 3-stage LLM-based translation QC Adapt for EN-ID translation
ArabicMTEB Dialectal evaluation Regional Indonesian varieties
ArabicMTEB Cultural fork Provincial cultural knowledge
TR-MTEB Calibrated LLM judge Apply to Indonesian validation
C-MTEB Large training corpus Create ID-Pack (50M+ pairs)
SEA-BED Human-first ratio (71%) Prioritize native Indonesian data
KorFinMTEB Domain-specific (financial) Create Indonesian financial fork
AfriMTEB Cross-lingual distillation EN-ID cross-lingual alignment

11.1 Three-Phase Approach

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIA-MTEB METHODOLOGY (ENHANCED)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHASE 1: AGGREGATION (Document 03 ✅ Complete)                             │
│  ├─ 70+ existing Indonesian datasets catalogued                             │
│  ├─ Coverage: Classification ✓, Pair Classification ✓, Retrieval ✓         │
│  ├─ Gaps: Clustering ✗, Reranking ✗, STS limited                          │
│  └─ Native dataset inventory: Complete                                     │
│                                                                              │
│  PHASE 2: TRANSLATION (Document 05)                                         │
│  ├─ Model: TranslateGemma-2-27B or Aya-23-35B                             │
│  ├─ Pipeline: 3-stage (Detection → Translation → QC)                       │
│  │  └─ Stage 1: Qwen2.5-3B-Instruct (language detection)                 │
│  │  └─ Stage 2: TranslateGemma/Aya-23 (translation)                      │
│  │  └─ Stage 3: LLM-as-judge + semantic similarity                        │
│  ├─ Target Datasets: Clustering, Reranking, STS gaps                      │
│  ├─ Expected Kept Ratio: 70-75% (higher than VN-MTEB)                     │
│  └─ Estimated Time: 4×H100 × 20-25 days                                    │
│                                                                              │
│  PHASE 3: AI GENERATION (Document 06)                                       │
│  ├─ Target: Domain-specific tasks, cultural queries                        │
│  ├─ Method: LLM-as-generator + LLM-as-judge                                │
│  ├─ Validation: Statistical consistency + human spot-check                 │
│  └─ Cultural Fork: Wikipedia Indonesia for 34 provinces                   │
│                                                                              │
│  ADDITIONAL DIMENSIONS (Novelty)                                            │
│  ├─ Archipelago-Aware: Regional language influence evaluation              │
│  ├─ Formal-Register: Informal (slang) → Formal → Academic continuum        │
│  ├─ Code-Mixing: Indonesian-English code-mixing evaluation                 │
│  └─ Domain-Specific: Legal (UU), Medical, Financial forks                  │
│                                                                              │
│  INTEGRATION                                                                 │
│  ├─ MTEB-compatible format                                                 │
│  ├─ Metadata documentation                                                  │
│  └─ HuggingFace upload with proper licensing                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

11.2 Resource Estimation

Task GPUs Time GPU-Hours Rationale
Full MTEB Translation 4×H100 20-25 days 2,000-2,400 EN-ID closer than EN-VN
AI Dataset Generation 2×H100 5-7 days 240-336 Clustering + Reranking
Validation 1×H100 3-5 days 72-120 LLM-as-judge evaluation
Total - - ~2,500-3,000 Conservative estimate

11.3 Translation Model Selection for Indonesian

Model Parameters ID Performance Cost Efficiency Recommendation
TranslateGemma-2-27B 27B Excellent (55 langs) Medium Primary
Aya-23-35B 35B Excellent (SEA focus) Low Alternative
Aya-23-8B 8B Very good High Cost-efficient
NLLB-200 3.3B Good Very High Smaller option
SEA-LION-v3 - N/A N/A Judge model only

11.4 Quality Validation Thresholds

Metric Threshold Justification
Semantic Similarity ≥0.80 VN-MTEB used 0.8
LLM Judge Score ≥3.5/5.0 Calibrated threshold
Kept Ratio Target 65-75% By task type
Word Length Correlation r ≥ 0.85 Statistical quality check
Human Validation 10% sample Final quality check

11.5 Novel Dimensions for Indonesia-MTEB

Based on regional MTEB gaps:

┌─────────────────────────────────────────────────────────────────┐
│          NOVEL DIMENSIONS FOR INDONESIA-MTEB                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  1. ARCHIPELAGO-AWARE EVALUATION                                   │
│     ├─ Javanese-influenced Indonesian                             │
│     ├─ Sundanese-influenced Indonesian                            │
│     ├─ Minangkabau-influenced Indonesian                         │
│     └─ Other regional varieties                                   │
│                                                                   │
│  2. FORMAL-REGISTER CONTINUUM                                     │
│     ├─ Informal/Slang (social media, Kaskus)                     │
│     ├─ Semi-formal (news articles)                               │
│     ├─ Formal (academic papers, legal documents)                 │
│     └─ Administrative (government regulations)                    │
│                                                                   │
│  3. CODE-MIXING EVALUATION                                        │
│     ├─ Indonesian-English code-mixing                            │
│     ├─ Prevalent in urban social media                           │
│     └─ Real-world use case evaluation                            │
│                                                                   │
│  4. CULTURAL KNOWLEDGE (34 Provinces)                             │
│     ├─ Province-specific cultural queries                        │
│     ├─ Source: Wikipedia Indonesia + provincial portals         │
│     └─ Generated via LLM with human validation                   │
│                                                                   │
│  5. DOMAIN-SPECIFIC FORKS                                         │
│     ├─ Legal Indonesian (UU documents, court decisions)          │
│     ├─ Medical Indonesian                                         │
│     ├─ Financial Indonesian                                       │
│     └─ Religious Indonesian (Islamic contexts)                    │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

12. MTEB Integration Strategy

12.1 Adding a Benchmark to MTEB Official

┌─────────────────────────────────────────────────────────────────────────────┐
│              MTEB OFFICIAL INTEGRATION PROCESS (Updated)                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. DATASET FORMAT REQUIREMENTS                                              │
│     ├─ Implement mteb.AbsTask subclass                                     │
│     ├─ Load data with .load_data() method                                  │
│     ├─ Define metadata (name, description, license, eval_langs)            │
│     ├─ Implement evaluation for your task type                             │
│     └─ Follow MTEB dataset card format                                     │
│                                                                              │
│  2. SUBMISSION CHECKLIST                                                     │
│     ├─ Fork: github.com/embeddings-benchmark/mteb                         │
│     ├─ Add: mteb/datasets/your_dataset/                                     │
│     ├─ Register: mteb/datasets/__init__.py                                 │
│     ├─ Test: CI/CD must pass                                               │
│     ├─ PR: Create with detailed description                                │
│     └─ Address: Reviewer feedback                                          │
│                                                                              │
│  3. HUGGINGFACE UPLOAD                                                        │
│     ├─ Upload to: huggingface.co/datasets/                                 │
│     ├─ Use MTEB dataset card format                                         │
│     ├─ Include: License, size, task metadata                               │
│     └─ Link: Original sources                                               │
│                                                                              │
│  4. LEADERBOARD SUBMISSION                                                    │
│     ├─ Run: Evaluation on baseline models                                   │
│     ├─ Submit: mteb/results dataset                                         │
│     ├─ Create: Benchmark discussion on leaderboard                         │
│     └─ Request: Leaderboard integration                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

12.2 Implementation Example

# Indonesia-MTEB dataset implementation example
from mteb import AbsTask, TaskMetadata

class IndonesianSentiment(AbsTaskClassification):
    """Indonesian sentiment analysis task for MTEB."""

    metadata = TaskMetadata(
        name="IndonesianSentiment",
        description="Indonesian sentiment analysis from social media",
        dataset={
            "path": "indonlp/indonlu",
            "name": "smsa",
            "revision": "main"
        },
        type="Classification",
        category="s2s",
        eval_splits=["test"],
        eval_langs=["ind"],  # Indonesian language code
        main_score="accuracy",
        date=None,
        form=None,
        domains=["Social", "Written"],
        task_subtypes=["Sentiment"],
        license="CC-BY-SA-4.0",
        annotations_creators="human-verified",
        dialect=[],
        sample_creation="found",
        bibtex_citation="""
        @article{wilie2020indonlu,
          title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
          author={Wilie, Bryan and Vincentio, Bryan and et al.},
          journal={arXiv preprint arXiv:2009.05387},
          year={2020}
        }
        """
    )

    def load_data(self, **kwargs):
        """Load Indonesian sentiment data."""
        from datasets import load_dataset
        return load_dataset("indonlp/indonlu", "smsa")
Resource URL
GitHub Repository github.com/embeddings-benchmark/mteb
Leaderboard huggingface.co/spaces/mteb/leaderboard
Results Dataset huggingface.co/datasets/mteb/results
Adding Datasets github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md
Adding Models github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_model.md

13. Key Takeaways for Indonesia-MTEB

13.1 Methodology Recommendations

Priority Recommendation Rationale
1 Adopt VN-MTEB's 3-stage translation pipeline Proven automated QC
2 Use TranslateGemma or Aya-23 for translation Strong ID support
3 Calibrate LLM judge with 100+ human samples TR-MTEB: 88.4% F1
4 Create ID-specific training corpus (ID-Pack) C-MTEB approach
5 Add domain-specific + cultural forks ArabicMTEB innovation
6 Target 70-75% kept ratio Higher than VN-MTEB

13.2 Novelty Opportunities

Based on regional MTEB analysis:

  1. Archipelago-Aware Evaluation: Regional language influence on Indonesian
  2. Formal-Register Continuum: Informal → Formal → Academic Indonesian
  3. Code-Mixing Evaluation: Indonesian-English code-mixing (social media)
  4. Cultural Knowledge: 34 provincial cultural queries
  5. Domain-Specific Forks: Legal, Medical, Financial Indonesian

13.3 Success Criteria Alignment

Criterion Target Benchmark Reference
Task Coverage All 8 MTEB categories VN-MTEB: 6, ArabicMTEB: 8
Dataset Count 60-100 datasets ArabicMTEB: 94, SEA-BED: 169
Quality ≥70% kept ratio, 10% human validation VN-MTEB: 65% avg
Publication ACL/EMNLP/NAACL dataset paper C-MTEB: SIGIR, TR-MTEB: EMNLP
Adoption MTEB leaderboard integration All regional MTEBs

14. References

Regional MTEB Papers (2024-2025)

  1. C-MTEB: Xiao et al. (2024). "C-Pack: Packaged Resources To Advance General Chinese Embeddings." SIGIR 2024. arxiv.org/abs/2309.07597 - 1,171+ citations

  2. ArabicMTEB: Bhatia et al. (2025). "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks." NAACL 2025. arxiv.org/abs/2411.01192 - 8+ citations

  3. MTEB-French: Ciancone et al. (2024). "MTEB-French: Resources for French Sentence Embedding." arXiv:2405.20468. arxiv.org/abs/2405.20468 - 17+ citations

  4. VN-MTEB: Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500. arxiv.org/abs/2507.21500

  5. TR-MTEB: Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations." EMNLP 2025 Findings. aclanthology.org/2025.findings-emnlp.471 - 2+ citations

  6. SEA-BED: Ponwitayarat et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark." arXiv:2508.12243. arxiv.org/abs/2508.12243 - 1+ citation

  7. AfriMTEB: Uemura et al. (2024). "AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages." arXiv:2510.23896. arxiv.org/abs/2510.23896

  8. PL-MTEB: Poświata et al. (2024). "PL-MTEB: Polish Massive Text Embedding Benchmark." arXiv:2405.10138. arxiv.org/abs/2405.10138 - 4+ citations

  9. KorFinMTEB: Hwang et al. (2025). "What Advantages Can Low-Resource Domain-Specific Instruction Tuning Bring to Large Language Models? A Case Study on Korean Financial Texts." arXiv:2502.07131. arxiv.org/abs/2502.07131 - 4+ citations

Original MTEB

  1. Muennighoff et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023. arxiv.org/abs/2210.07316 - 1,488+ citations

Translation Models

  1. Google (2024). "TranslateGemma: A new suite of open translation models." blog.google/technology/ai/translategemma/

  2. Cohere For AI (2024). "Aya 23: Open weight releases to further multilingual progress." arXiv:2405.15032. arxiv.org/abs/2405.15032

MTEB Resources

  1. MTEB GitHub: github.com/embeddings-benchmark/mteb
  2. MTEB Leaderboard: huggingface.co/spaces/mteb/leaderboard
  3. MTEB Datasets: huggingface.co/mteb

15. Document Roadmap

Document Content Status
01 Project Overview ✅ Enhanced
02 MTEB Structure Analysis ✅ Enhanced
03 Existing Indonesian Datasets ✅ Enhanced
04 Regional MTEB Methodologies ✅ Enhanced
05 Translation Models Benchmark 🔲 Next
06 AI Dataset Generation Methods Pending
07 Validation Strategies Pending
08 ACL Dataset Paper Standards Pending
09 Novelty Angle & Publication Pending
10 Implementation Roadmap Pending

"The most successful regional MTEBs combine three elements: rigorous quality control, linguistic/cultural awareness, and comprehensive task coverage. Indonesia-MTEB will synthesize these approaches while introducing archipelago-aware and formal-register evaluation dimensions unique to the Indonesian context."