Skip to content

Project: Indonesia-MTEB Benchmark Document: 02 - MTEB Structure & Task Categories Analysis Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning


[!NOTE]

Document Navigation

This is the second of twelve documents comprising the Indonesia-MTEB Benchmark research foundation.

Document Title Focus Area
01 Project Overview & Scope Foundation document
02 MTEB Structure Analysis Current Document
03 Existing Indonesian Datasets Data aggregation sources
04 Regional MTEB Methodologies Precedent analysis
05 Translation Models Benchmark Model selection & evaluation
06 AI Dataset Generation Methods Novel data creation
07 Validation Strategies Quality assurance protocols
08 ACL Dataset Paper Standards Publication requirements
09 Novelty Angle & Publication Research contribution
10 Implementation Roadmap Technical execution plan
11 Python Package Development Software architecture
12 Summary & Quick Reference Consolidated reference

MTEB Structure & Task Categories Analysis

"Understanding MTEB's internal architecture, evaluation protocols, and dataset formats is the foundation for building Indonesia-MTEB. This document provides a comprehensive technical deep-dive into the Massive Text Embedding Benchmark framework."


Table of Contents

  1. MTEB Architecture Overview
  2. The 8 Core Task Categories
  3. Detailed Task Analysis
  4. Evaluation Metrics Deep-Dive
  5. MTEB Dataset Format Standards
  6. MTEB v2 & MMTEB Updates
  7. Implementation Guide
  8. Indonesia-MTEB Task Mapping
  9. Technical Considerations
  10. References

1. MTEB Architecture Overview

1.1 Framework Specification

MTEB (Massive Text Embedding Benchmark) is a Python-based evaluation framework that provides standardized protocols for assessing text embedding models across diverse NLP tasks.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB FRAMEWORK ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                         INPUT LAYER                                       ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Model Input: Text / Sentence Pair / Query-Document Pair           │  ║│
│  ║  │  Format: Raw text strings                                         │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                      ENCODING LAYER                                       ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Embedding Model Interface                                         │  ║│
│  ║  │  - SentenceTransformer                                            │  ║│
│  ║  │  - Custom encoder with .encode() method                            │  ║│
│  ║  │  - Output: Dense vectors (typically 384-4096 dimensions)           │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                    TASK-SPECIFIC EVALUATORS                             ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  │  CLASSIFICATION  │  CLUSTERING  │  RETRIEVAL  │  STS         │  ║│
│  ║  │  ├──────────────────┼──────────────┼─────────────┼──────────────┤  ║│
│  ║  │  │  PAIR CLASS.     │  RERANKING   │  INSTRUCT.  │  SUMMARIZATION║  ║│
│  ║  │  └──────────────────┴──────────────┴─────────────┴──────────────┘  ║│
│  ║  │                                                                      │  ║│
│  ║  │  Each evaluator:                                                     │  ║│
│  ║  │  1. Loads dataset (train/validation/test splits)                     │  ║│
│  ║  │  2. Encodes text using model                                         │  ║│
│  ║  │  3. Computes task-specific metrics                                   │  ║│
│  ║  │  4. Returns structured results                                      │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                       AGGREGATION LAYER                                   ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Results Collection & Aggregation                                   │  ║│
│  ║  │  - Per-dataset scores                                               │  ║│
│  ║  │  - Per-task averages                                                │  ║│
│  ║  │  - Overall benchmark score                                           │  ║│
│  ║  │  - Leaderboard formatting                                           │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 MTEB Evolution Timeline

Version Code Name Datasets Languages Tasks Year Key Innovation
MTEB v1 Original 58 112 8 2022 (Oct) Initial unification
MTEB v1.1 EACL 58 112 8 2023 (Apr) EACL publication
MTEB v2 Multilingual 284 112 8 2024 Expanded multilingual
MMTEB ICLR 2025 500+ 1000+ 9+ 2025 Instruction Following
Current Production 1,308+ 1000+ 8+ 2026 Community-driven

Sources: - MTEB Original: arxiv.org/abs/2210.07316 - MMTEB: arxiv.org/abs/2502.13595 - MTEB v2: huggingface.co/blog/isaacchung/mteb-v2

1.3 Repository Structure

mteb/
├── mteb/
│   ├── __init__.py
│   ├── benchmark.py            # Main benchmark class
│   ├── encoder_interface.py    # Model interface
│   ├── abstasks/               # Abstract task definitions
│   │   ├── __init__.py
│   │   ├── AbsTask.py          # Base task class
│   │   ├── classification/     # Classification tasks
│   │   ├── clustering/         # Clustering tasks
│   │   ├── retrieval/          # Retrieval tasks
│   │   ├── sts/                # STS tasks
│   │   ├── pair_classification/ # Pair classification
│   │   ├── reranking/          # Reranking tasks
│   │   ├── summarization/      # Summarization tasks
│   │   └── instruction_following/  # Instruction tasks
│   └── models/                 # Model registry
├── scripts/                    # Evaluation scripts
├── tests/                      # Test suite
└── docs/                       # Documentation

[!TIP] For Indonesia-MTEB: We will follow the same structure, creating tasks in parallel categories while maintaining full API compatibility.


2. The 8 Core Task Categories

2.1 Task Summary Matrix

# Task MTEB Type Code Primary Metric Secondary Metrics Typical Input Typical Output
1 Classification s2s / t2c Accuracy F1 (macro/micro) Single text Class label
2 Clustering s2s V-measure ARI, NMI Multiple texts Cluster assignment
3 Pair Classification s2s Average Precision Accuracy, F1 Text pair Binary label
4 Reranking s2s MAP MRR, nDCG Query + doc list Reordered list
5 Retrieval s2p / s2s nDCG@k Recall@k, MAP Query + corpus Ranked docs
6 STS s2s Spearman ρ Pearson r Text pair Similarity score
7 Summarization s2s Cosine Similarity ROUGE Text + summary Similarity
8 Instruction Following s2p Task-specific nDCG@k Instruction + query Retrieved result

Type Code Legend: - s2s: Sentence-to-Sentence (both inputs are sentence-length) - s2p: Sentence-to-Paragraph (query sentence, doc paragraph) - t2c: Text-to-Category (text to class label)

2.2 Task Distribution in MTEB

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB TASK DISTRIBUTION (1,308+ DATASETS)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CLASSIFICATION      ████████████████████████████████████░░░░  ~35%         │
│  RETRIEVAL          ████████████████████████░░░░░░░░░░░░░░░░  ~25%         │
│  CLUSTERING         ████████████████░░░░░░░░░░░░░░░░░░░░░░░  ~15%         │
│  STS                ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~10%         │
│  PAIR CLASS.        ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~7%          │
│  RERANKING          ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~5%          │
│  INSTRUCTION FOLL.  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~3%          │
│  SUMMARIZATION      ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~2%          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!NOTE] Implication for Indonesia-MTEB: Classification and Retrieval tasks dominate the benchmark. Our Indonesian datasets should reflect similar proportions for comparability.


3. Detailed Task Analysis

3.1 Classification

Purpose: Assign predefined category labels to individual text instances.

3.1.1 Task Definition

Classification tasks evaluate an embedding model's ability to capture semantic features that distinguish between predefined categories. The model must encode text such that similar texts cluster in embedding space.

3.1.2 Data Format Specification

{
  "text": "Bank Indonesia menaikkan suku bunga acuan sebesar 25 basis point",
  "label": 0,
  "split": "train"
}

HuggingFace Dataset Structure:

DatasetDict({
    'train': Dataset({
        'text': ['...', '...', ...],
        'label': [0, 1, 2, ...],
    }),
    'validation': Dataset({...}),
    'test': Dataset({...}),
})

3.1.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLASSIFICATION EVALUATION PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     embeddings = model.encode(test_texts)                                   │
│                                                                              │
│  2. CLASSIFIER TRAINING                                                       │
│     classifier = LogisticRegression()                                       │
│     classifier.fit(train_embeddings, train_labels)                          │
│                                                                              │
│  3. PREDICTION                                                                │
│     predictions = classifier.predict(test_embeddings)                        │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     accuracy = accuracy_score(true_labels, predictions)                     │
│     f1_macro = f1_score(true_labels, predictions, average='macro')         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.1.4 Example Datasets in MTEB

Dataset Domain Classes Train Size Test Size Link
ArxivClassification Academic 11 35,056 4,382 mteb/ArxivClassification
Banking77Classification Banking 77 10,003 3,080 mteb/Banking77Classification
EmotionClassification Social Media 6 16,000 2,000 mteb/EmotionClassification
MassiveIntentClassification E-commerce 60 164,603 10,000 mteb/amazon_massive_intent
TweetSentimentMultilingual Social Media 3 10,000 1,229 mteb/tweet_sentiment_multilingual

3.1.5 Indonesian Adaptation

Potential Indonesian Classification Datasets:

Source Task Classes Status Notes
IndoNLU SMSA (Sentiment) 3 (pos/neg/neu) Available ~11,000 tweets
IndoNLU EmoT (Emotion) 5 Available ~3,400 tweets
NusaX Sentiment 3 Available 10 languages + ID
IndoNLU POS Tagging 23 Available Requires adaptation

[!TIP] Indonesia-MTEB Classification Target: 8-12 datasets covering sentiment, topic, intent, and domain-specific classification.


3.2 Clustering

Purpose: Group similar texts without predefined labels (unsupervised learning).

3.2.1 Task Definition

Clustering evaluates whether an embedding model captures semantic similarity such that related texts form tight clusters in embedding space. Unlike classification, there are no labels during training or evaluation—metrics compare algorithm-assigned clusters to ground-truth labels.

3.2.2 Data Format Specification

{
  "sentences": [
    "Bank Indonesia menaikkan suku bunga acuan",
    "BI rate naik 25 basis point",
    "Timnas Indonesia menang 3-0"
  ],
  "labels": [0, 0, 1]
}

Key Characteristics: - Labels are for evaluation only—never used during clustering - Clustering algorithm: typically k-means or similar - Number of clusters: provided (fixed k)

3.2.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CLUSTERING EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     embeddings = model.encode(sentences)                                    │
│                                                                              │
│  2. CLUSTERING                                                                │
│     from sklearn.cluster import KMeans                                      │
│     kmeans = KMeans(n_clusters=n_classes)                                  │
│     pred_labels = kmeans.fit_predict(embeddings)                            │
│                                                                              │
│  3. METRIC CALCULATION                                                        │
│     v_measure = v_measure_score(true_labels, pred_labels)                  │
│     ari = adjusted_rand_score(true_labels, pred_labels)                    │
│     nmi = normalized_mutual_info_score(true_labels, pred_labels)           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2.4 Clustering Metrics Explained

[!NOTE] V-Measure (Primary Metric):

V-Measure is the harmonic mean of homogeneity and completeness:

V = 2 × (homogeneity × completeness) / (homogeneity + completeness)

homogeneity = 1 - H(C|K) / H(C)  # Each cluster contains only one class
completeness = 1 - H(K|C) / H(K)  # All members of a class are in one cluster

where H(C|K) is conditional entropy, H(C) is entropy
  • Range: [0, 1]
  • Interpretation: 1 = perfect clustering, 0 = random
  • Advantage: Independent of label permutation (unlike purity)

Adjusted Rand Index (ARI):

ARI = (RI - Expected_RI) / (Max_RI - Expected_RI)

where RI = (TP + TN) / (TP + TN + FP + FN) (Rand Index)
  • Range: [-1, 1]
  • Interpretation: 1 = perfect, 0 = random, < 0 = worse than random
  • Use Case: Robust to different cluster sizes

Normalized Mutual Information (NMI):

NMI = I(C;K) / sqrt(H(C) × H(K))

where I(C;K) is mutual information between true and predicted labels
  • Range: [0, 1]
  • Interpretation: 1 = perfect clustering

3.2.5 Example Datasets in MTEB

Dataset Domain Clusters Samples Type Link
reddit-clustering Social Media 199 1.2M P2P mteb/reddit-clustering
stackexchange-clustering-p2p Q&A 121 105K P2P mteb/stackexchange-clustering-p2p
arxiv-clustering-p2p Academic 30 96K P2P mteb/arxiv-clustering-p2p
wikipedia-clustering Encyclopedia 10 70K S2S mteb/wikipedia-clustering

Type Legend: - P2P: Pair-to-Pair (sentence pairs) - S2S: Sentence-to-Sentence (single sentences)

3.2.6 Indonesian Adaptation

Proposed Indonesian Clustering Datasets:

Domain Source Clusters Method Status
News IndoNLU articles 10-15 Aggregation Available
Social Media Twitter/Instagram 20-50 AI Generation Planned
Wikipedia ID Wikipedia 10-30 Aggregation Available
Legal Indonesian court docs 5-10 AI Generation Gap

3.3 Pair Classification

Purpose: Determine if two texts are semantically related (binary classification).

3.3.1 Task Definition

Pair classification evaluates whether an embedding model can distinguish between related and unrelated text pairs based on semantic similarity.

3.3.2 Data Format Specification

{
  "text1": "Bank Indonesia menaikkan suku bunga",
  "text2": "BI rate naik 25 basis point",
  "label": 1
}

Label encoding: - 1: Related / Duplicate / Paraphrase - 0: Not related / Different meaning

3.3.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PAIR CLASSIFICATION EVALUATION PIPELINE                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     emb1 = model.encode(text1_list)                                        │
│     emb2 = model.encode(text2_list)                                        │
│                                                                              │
│  2. SIMILARITY CALCULATION                                                   │
│     similarity = cosine_similarity(emb1, emb2)                              │
│                                                                              │
│  3. THRESHOLD CLASSIFICATION                                                  │
│     predictions = (similarity > threshold).astype(int)                       │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     ap = average_precision_score(true_labels, similarity)                   │
│     accuracy = accuracy_score(true_labels, predictions)                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.3.4 Example Datasets in MTEB

Dataset Domain Pairs Type Link
twitterurlcorpus-pairclassification Social Media 50K Paraphrase mteb/twitterurlcorpus-pairclassification
quora-duplicates-questions Q&A 400K Duplicate mteb/quora-duplicates-questions
stackoverflow-dupequestions Technical 300K Duplicate mteb/stackoverflow-dupequestions

3.3.5 Indonesian Adaptation

Proposed Indonesian Pair Classification Datasets:

Domain Type Source Status
News Headlines Paraphrase Translation + Generation Planned
Social Media Duplicate Twitter/Instagram Gap
Q&A Duplicate Kaskus/StackOverflow ID Gap

3.4 Reranking

Purpose: Reorder retrieved documents by relevance to a query.

3.4.1 Task Definition

Reranking evaluates whether an embedding model can refine an initial document ranking, placing the most relevant documents at the top.

3.4.2 Data Format Specification

{
  "query": "dampak kenaikan suku bunga pada ekonomi Indonesia",
  "documents": [
    {"text": "...", "id": "doc1"},
    {"text": "...", "id": "doc2"},
    {"text": "...", "id": "doc3"}
  ],
  "relevant": ["doc1", "doc3"]
}

3.4.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                       RERANKING EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE QUERY AND DOCUMENTS                                                │
│     query_emb = model.encode(query)                                         │
│     doc_embs = model.encode(documents)                                      │
│                                                                              │
│  2. CALCULATE QUERY-DOC SIMILARITY                                            │
│     scores = cosine_similarity(query_emb, doc_embs)                         │
│                                                                              │
│  3. RANK DOCUMENTS BY SCORE                                                   │
│     ranked_docs = argsort(scores, descending=True)                          │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     MAP = mean([ap_score(relevances, rankings)])                            │
│     MRR = mean([1/rank_first_relevant])                                    │
│     nDCG@k = dcg@k / ideal_dcg@k                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.4.4 Reranking Metrics Explained

MAP (Mean Average Precision):

For each query:
  AP = (1/R) × Σ (precision_at_k × relevance_at_k)
  where R = total relevant documents

MAP = mean(AP over all queries)
  • Range: [0, 1]
  • Interpretation: Average precision across all recall points

MRR (Mean Reciprocal Rank):

For each query:
  RR = 1 / rank_of_first_relevant_document

MRR = mean(RR over all queries)
  • Range: [0, 1]
  • Interpretation: Focus on first relevant document position

nDCG@k (Normalized Discounted Cumulative Gain):

DCG@k = Σ (2^relevance - 1) / log2(i + 1)
IDCG@k = DCG@k for ideal ranking
nDCG@k = DCG@k / IDCG@k
  • Range: [0, 1]
  • Interpretation: Ranking quality at position k

3.4.5 Example Datasets in MTEB

Dataset Domain Queries Avg Docs Link
MIRACLReranking Wikipedia 12K 29 mteb/MIRACLReranking
stackoverflow-qa Technical 150K 30 mteb/StackOverflowQA

3.4.6 Indonesian Adaptation

Proposed Indonesian Reranking Datasets:

Domain Source Queries Method Status
Wikipedia MIRACL-ID ~500 Translation Available
News Indonesian news sites ~1000 Generation Gap
Legal Court documents ~500 Generation Gap

3.5 Retrieval

Purpose: Find relevant documents from a large corpus for a given query.

3.5.1 Task Definition

Retrieval is the core information retrieval task, evaluating whether an embedding model can identify relevant documents from a large collection.

3.5.2 Data Format Specification

{
  "query": "teks query dalam bahasa Indonesia",
  "corpus": [
    {"id": "doc1", "text": "...", "title": "..."},
    {"id": "doc2", "text": "...", "title": "..."}
  ],
  "relevant_docs": {
    "query_id": ["doc1", "doc3", "doc7"]
  }
}

Split Configuration: - Dev: Small corpus for quick evaluation - Test: Full corpus for benchmark

3.5.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                       RETRIEVAL EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE CORPUS (done once, cached)                                        │
│     corpus_embs = model.encode([doc["text"] for doc in corpus])             │
│                                                                              │
│  2. FOR EACH QUERY:                                                          │
│     a. Encode query                                                          │
│        query_emb = model.encode(query)                                      │
│                                                                              │
│     b. Calculate similarities                                                 │
│        scores = cosine_similarity(query_emb, corpus_embs)                   │
│                                                                              │
│     c. Rank and retrieve top-k                                               │
│        ranked_indices = argsort(scores, descending=True)[:k]               │
│                                                                              │
│  3. CALCULATE METRICS ACROSS ALL QUERIES                                      │
│     nDCG@k = mean([ndcg_at_k(query_relevances, query_rankings)])            │
│     Recall@k = mean([recall_at_k(query_relevances, query_rankings)])        │
│     MAP = mean([average_precision(query_relevances, query_scores)])         │
│     MRR = mean([1/rank_first_relevant])                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.5.4 Example Datasets in MTEB

Dataset Domain Corpus Size Queries Avg Relevant Link
MIRACLRetrieval Wikipedia 1.1M (ID: ~50K) 12K 1.5 mteb/MIRACLRetrieval
quora Q&A 1M 10K 1.5 mteb/quora
arguana Arguments 24K 1.4K 2.8 mteb/arguana
fiqa Finance 57K 648 1.6 mteb/fiqa
scidocs Scientific 25K 1K 4.8 mteb/scidocs

3.5.5 Indonesian Adaptation

Available Indonesian Retrieval Resources:

Resource Corpus Size Queries Status Notes
MIRACL-ID ~50K Wikipedia ~500 Available Part of MIRACL
Wikipedia ID Full Custom Available Requires query set
IndoNLG (news) ~50K Custom Available Domain-specific
Legal documents ~100K Custom Gap Requires creation

Proposed Indonesia-MTEB Retrieval Datasets:

Domain Corpus Queries Method Priority
Wikipedia 50K 500 MIRACL-ID adaptation High
News 30K 300 Translation + generation High
Legal 20K 200 AI generation Medium
FAQ 10K 200 Industry collaboration Medium

3.6 STS (Semantic Textual Similarity)

Purpose: Predict similarity scores for text pairs.

3.6.1 Task Definition

STS evaluates whether an embedding model captures fine-grained semantic similarity, correlating with human judgment of text relatedness.

3.6.2 Data Format Specification

{
  "text1": "Bank Indonesia menaikkan suku bunga",
  "text2": "BI rate naik 25 basis point",
  "score": 4.5
}

Score Scales: - 0-5 scale: STS-B, SICK-R - 0-1 scale: Normalized variants - Binary: Some simplified datasets

3.6.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                          STS EVALUATION PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE BOTH SENTENCES                                                     │
│     emb1 = model.encode(text1_list)                                        │
│     emb2 = model.encode(text2_list)                                        │
│                                                                              │
│  2. CALCULATE COSINE SIMILARITY                                               │
│     similarities = cosine_similarity(emb1, emb2)                            │
│                                                                              │
│  3. CORRELATE WITH HUMAN SCORES                                               │
│     spearman = spearmanr(similarities, human_scores)[0]                     │
│     pearson = pearsonr(similarities, human_scores)[0]                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.6.4 STS Metrics Explained

Spearman's Rank Correlation (ρ):

ρ = 1 - (6 × Σd_i²) / (n × (n² - 1))

where d_i = difference in ranks for pair i
      n = number of pairs
  • Range: [-1, 1]
  • Interpretation: Monotonic relationship (rank-based)
  • Robustness: Less sensitive to outliers than Pearson

Pearson Correlation ®:

r = cov(X, Y) / (σ_X × σ_Y)

where cov = covariance, σ = standard deviation
  • Range: [-1, 1]
  • Interpretation: Linear relationship
  • Use Case: When scores are approximately normally distributed

3.6.5 Example Datasets in MTEB

Dataset Domain Scale Pairs Link
stsbenchmark-sts General 0-5 8,628 mteb/stsbenchmark-sts
sickr-sts General 0-5 4,500 mteb/sickr-sts
biosts-sts Biomedical 0-5 600 mteb/biosts-sts

3.6.6 Indonesian Adaptation

Challenge: Limited Indonesian STS datasets exist.

Proposed Indonesia-MTEB STS Datasets:

Domain Pairs Method Status
News Headlines 1,000 Translation (STS-B) Planned
Social Media 500 AI Generation Gap
General 2,000 Human annotation Gap
Technical 300 Domain-specific Gap

3.7 Summarization

Purpose: Evaluate if summary captures document semantics.

3.7.1 Task Definition

Summarization tasks assess whether an embedding model captures the semantic relationship between a document and its summary.

3.7.2 Data Format Specification

{
  "text": "original long document text...",
  "summary": "generated or reference summary..."
}

3.7.3 Evaluation Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                       SUMMARIZATION EVALUATION PIPELINE                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE DOCUMENT AND SUMMARY                                               │
│     doc_emb = model.encode(document)                                        │
│     sum_emb = model.encode(summary)                                         │
│                                                                              │
│  2. CALCULATE COSINE SIMILARITY                                               │
│     similarity = cosine_similarity(doc_emb, sum_emb)                        │
│                                                                              │
│  3. AGGREGATE SCORES                                                          │
│     mean_score = mean(similarities)                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.7.4 Example Datasets in MTEB

Dataset Domain Pairs Link
summeval-fr News (French) 447 Adaptation example
summeval News 2,150 Reference implementation

3.7.5 Indonesian Adaptation

Challenge: No Indonesian summarization datasets for embedding evaluation.

Proposed Solution: 1. Translate existing summarization datasets 2. Create Indonesian news-summary pairs 3. Collaborate with Indonesian media organizations


3.8 Instruction Following

Purpose: Evaluate model's ability to follow task-specific instructions.

3.8.1 Task Definition

Instruction following (added in MMTEB 2025) evaluates whether an embedding model can condition its representations based on task instructions, enabling domain-aware retrieval.

3.8.2 Data Format Specification

{
  "instruction": "Retrieve documents about Indonesian monetary policy",
  "query": "kebijakan suku bunga Bank Indonesia 2024",
  "expected": ["doc1", "doc3", "doc7"]
}

3.8.3 Example Datasets in MTEB

Dataset Domain Instructions Queries Link
InstructIR-mteb Mixed 17 500 mteb/InstructIR-mteb
Core17InstructionRetrieval News Domain-specific Custom mteb/Core17InstructionRetrieval

3.8.4 Indonesian Adaptation

Novel Contribution Opportunity: Indonesia-MTEB can be the first to introduce Indonesian instruction-following datasets.

Proposed Domains: - Legal instruction retrieval - Financial domain instruction - Healthcare instruction - Regional language instruction


4. Evaluation Metrics Deep-Dive

4.1 Complete Metrics Reference

Metric Formula Range Task Interpretation
Accuracy correct / total [0,1] Classification Percentage correct
F1-Score 2×P×R/(P+R) [0,1] Classification Harmonic mean of precision/recall
V-Measure 2×h×c/(h+c) [0,1] Clustering Homogeneity + completeness
ARI (RI-E)/(M-E) [-1,1] Clustering Adjusted clustering similarity
NMI I(C;K)/√(H(C)H(K)) [0,1] Clustering Normalized mutual information
AP Σ(P@k×rel)/R [0,1] Pair Class, Rerank Average precision
MAP mean(AP) [0,1] Retrieval, Rerank Mean of average precision
MRR mean(1/rank_first) [0,1] Retrieval, Rerank Mean reciprocal rank
nDCG@k DCG@k/IDCG@k [0,1] Retrieval, Rerank Normalized discounted gain
Recall@k rel@k/total [0,1] Retrieval Recall at position k
Spearman ρ rank correlation [-1,1] STS Rank-based correlation
Pearson r linear correlation [-1,1] STS Linear correlation
Cosine Sim (A·B)/(|A||B|) [-1,1] Summarization Cosine similarity

4.2 Metric Selection by Task

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB METRIC SELECTION MATRIX                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TASK                    │ PRIMARY        │ SECONDARY        │ TERTIARY     │
│  ────────────────────────┼────────────────┼──────────────────┼──────────────│
│  Classification          │ Accuracy       │ F1-macro         │ F1-micro     │
│  Clustering              │ V-measure      │ ARI              │ NMI          │
│  Pair Classification     │ AP             │ Accuracy         │ F1           │
│  Reranking               │ MAP            │ MRR              │ nDCG@10      │
│  Retrieval               │ nDCG@10        │ Recall@10        │ MAP          │
│  STS                     │ Spearman       │ Pearson          │ -            │
│  Summarization           │ Cosine Sim     │ -                │ -            │
│  Instruction Following   │ Task-specific  │ nDCG@10          │ Recall@10    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

5. MTEB Dataset Format Standards

5.1 HuggingFace Dataset Structure

Standard MTEB Dataset Format:

from datasets import Dataset, DatasetDict

dataset = DatasetDict({
    "train": Dataset.from_dict({
        "text": [...],           # Input text(s)
        "label": [...],          # Labels (for supervised tasks)
        # Additional fields as needed
    }),
    "validation": Dataset.from_dict({...}),
    "test": Dataset.from_dict({...})
})

Task-Specific Variations:

Task Required Fields Optional Fields
Classification text, label -
Clustering sentences, labels main_category
Pair Classification text1, text2, label -
Reranking query, documents, relevant scores
Retrieval corpus, queries, relevant_docs domain
STS text1, text2, score dataset
Summarization text, summary source

5.2 Dataset Card Template

Each MTEB dataset requires a README card:

---
dataset_name: "DatasetName"
language: ["id"]
license: "cc-by-4.0"
---

# DatasetName

## Dataset Description
Brief description of the dataset...

## Citation
```bibtex
@dataset{dataset_name,
  title={Dataset Name},
  author={...},
  year={2026}
}

Tasks

  • TaskType1
  • TaskType2

Languages

  • Indonesian (id)

Dataset Statistics

Split Samples
Train X,XXX
Test XXX
### 5.3 Adding Custom Datasets to MTEB

**Step-by-Step Process:**

```python
# 1. Define your task class
from mteb.abstasks.AbsTask import AbsTask

class MyIndonesianTask(AbsTask):
    metadata = {
        "name": "MyIndonesianTask",
        "dataset": {
            "path": "myorg/my-indonesian-dataset",
            "revision": "main"
        },
        "type": "Classification",
        "category": "s2s",
        "eval_splits": ["test"],
        "eval_langs": ["id"],
        "main_score": "accuracy",
    }

# 2. Register the task
from mteb import get_tasks
tasks = get_tasks(tasks=["MyIndonesianTask"])

# 3. Run evaluation
from mteb import MTEB
evaluation = MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"])

6. MTEB v2 & MMTEB Updates

6.1 What's New in MTEB v2

Feature v1 v2 Impact
API MTEB(tasks).run() tasks.evaluate(model) Simpler interface
Format JSON only JSON + Parquet Faster loading
Modality Text only Text + Image Multimodal support
Caching Basic Advanced with validation Reproducibility
Leaderboard Single Multi-domain Better organization

6.2 MMTEB (ICLR 2025) Additions

New Features:

  1. Instruction Following Tasks
  2. 17 instruction types
  3. Domain-aware retrieval
  4. 100+ new datasets

  5. Long-Document Retrieval

  6. Documents up to 32K tokens
  7. Specialized evaluation

  8. Code Retrieval

  9. Programming language specific
  10. Semantic code search

  11. Conversational Retrieval

  12. Multi-turn dialogue context
  13. Conversation history handling

6.3 API Migration Guide

Old (v1):

tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"])

New (v2):

tasks = mteb.get_tasks(tasks=["Banking77Classification"])
for task in tasks:
    results = task.evaluate(model, eval_splits=["test"])


7. Implementation Guide

7.1 Basic Evaluation Example

from sentence_transformers import SentenceTransformer
import mteb

# 1. Load model
model = SentenceTransformer('intfloat/multilingual-e5-large')

# 2. Select tasks
tasks = mteb.get_tasks(
    tasks=["Banking77Classification", "STSBenchmark"],
    languages=["id"]  # Will filter for Indonesian tasks
)

# 3. Run evaluation
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
    model,
    eval_splits=["test"],
    output_folder="results/"
)

# 4. View results
for task_name, task_results in results.items():
    print(f"{task_name}: {task_results}")

7.2 Custom Encoder Example

from mteb.encoder_interface import EncoderInterface

class MyIndonesianEncoder(EncoderInterface):
    def __init__(self, model_name, device="cuda"):
        self.model = load_model(model_name)
        self.device = device

    def encode(self, texts, batch_size=32):
        """Encode texts to embeddings."""
        return self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=False
        )

    @property
    def dimension(self):
        return self.model.embedding_dim

7.3 Result Format

{
    "dataset_name": {
        "test": {
            "en": {
                "main_score": 0.85,
                "accuracy": 0.85,
                "f1_macro": 0.82,
                "evaluation_time": 12.5,
                "footprint": {
                    "memory_mb": 512,
                    "model_parameters": 560000000
                }
            }
        }
    }
}

8. Indonesia-MTEB Task Mapping

8.1 Proposed Dataset Distribution by Task

Task Category Target Count Current ID Sources Translation Needed AI Generation Needed
Classification 8-12 3 (IndoNLU, NusaX) 4-5 2-3
Clustering 5-8 0 2-3 3-5
Pair Classification 3-5 0 2 1-2
Reranking 3-5 0 2 1-2
Retrieval 8-12 1 (MIRACL-ID) 4-5 3-5
STS 5-8 0 3-4 2-3
Summarization 3-5 0 2 1-2
Instruction Following 3-5 0 0 3-5
TOTAL 38-55 4 19-23 16-26

8.2 Priority Matrix for Indonesia-MTEB

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIA-MTEB TASK PRIORITY MATRIX                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  HIGH PRIORITY (Phase 1: Foundation)                                      ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Classification  │ 8 datasets  │  IndoNLU + Translation + Generation      ║│
│  ║  Retrieval       │ 8 datasets  │  MIRACL-ID + Translation + Generation     ║│
│  ║  Clustering      │ 5 datasets  │  Translation + Generation                 ║│
│  ║  STS            │ 5 datasets  │  Translation + Generation                 ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  MEDIUM PRIORITY (Phase 2: Coverage)                                      ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Pair Class.     │ 4 datasets  │  Translation + Generation                ║│
│  ║  Reranking       │ 4 datasets  │  Translation + Generation                ║│
│  ║  Summarization   │ 3 datasets  │  Translation + Generation                ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  NOVEL CONTRIBUTION (Phase 3: Innovation)                                 ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Instruction Following │ 5 datasets  │  AI Generation (Novel)             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

8.3 Dataset Naming Convention

Proposed Indonesia-MTEB Naming:

indonesiamteb/{task}_{domain}_{source}

Examples:
- indonesiamteb/classification_sentiment_newsnlp
- indonesiamteb/clustering_news_wikipedia_id
- indonesiamteb/retrieval_wikipedia_miracl_id
- indonesiamteb/sts_news_translated_stsb
- indonesiamteb/instruction_retrieval_legal_generated

9. Technical Considerations

9.1 Performance Optimization

Encoding Speed:

Technique Speedup Implementation
Batch encoding 10-50x encode(texts, batch_size=128)
GPU utilization 5-20x model.to("cuda")
Quantization 2-4x quantize_model=True
Caching cache_dir="cache/"

Memory Optimization:

# For large corpora, use streaming
from datasets import load_dataset

corpus = load_dataset("mteb/MIRACLRetrieval", split="test")
corpus = corpus.map(lambda x: {"text": x["text"]}, batched=True)

9.2 Cross-Lingual Considerations

For Indonesian evaluation, consider:

  1. Script Compatibility: Indonesian uses Latin script (same as English)
  2. Tokenization: Different tokenizers may affect embedding quality
  3. Domain Transfer: English-pretrained models may need adaptation

Recommended Models for Indonesian:

Model Parameters Indonesian Training MTEB ID Score
intfloat/multilingual-e5-large 560M Yes (100+ langs) Baseline
BAAI/bge-m3 600M Yes (multilingual) To evaluate
sentence-transformers/LaBSE 470M Yes To evaluate
LazarusNLP/indonesian-sbert-base 110M Yes (ID-only) To evaluate

9.3 Reproducibility

Essential for MTEB integration:

import numpy as np
import torch

# Set seeds
np.random.seed(42)
torch.manual_seed(42)

# Use deterministic algorithms
torch.use_deterministic_algorithms(True)

# Log model details
print(f"Model: {model_name}")
print(f"Revision: {model_revision}")
print(f"MTEB version: {mteb.__version__}")

10. References

Primary Sources

  1. Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316

  2. Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595

  3. Chung, I., et al. (2025). "Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks". arXiv. arXiv:2506.21182

Technical Documentation

  1. MTEB GitHub Repository: github.com/embeddings-benchmark/mteb

  2. MTEB v2 Introduction: huggingface.co/blog/isaacchung/mteb-v2

  3. Sentence Transformers MTEB Guide: sbert.net/docs/sentence_transformer/usage/mteb_evaluation.html

Evaluation Metrics

  1. Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.

  2. Weaviate - Retrieval Evaluation Metrics: weaviate.io/blog/retrieval-evaluation-metrics

  3. Evidently AI - NDCG Explained: evidentlyai.com/ranking-metrics/ndcg-metric

Dataset Examples

  1. MTEB Datasets Hub: huggingface.co/mteb

  2. MIRACL (Multilingual Information Retrieval): github.com/project-miracl/miracl


11. Document Status

[!NOTE] Next Document: Document 03 - Existing Indonesian Datasets

This document provides detailed analysis of existing Indonesian NLP datasets, their MTEB compatibility, and aggregation strategies for Indonesia-MTEB.

Change Log:

Version Date Changes Author
1.0 2026-01-25 Initial version Research Team
2.0 2026-01-25 Enhanced edition with detailed analysis, MMTEB updates, implementation guides Research Team

This document is a living record. Updated as research progresses.