Project: Indonesia-MTEB Benchmark Document: 02 - MTEB Structure & Task Categories Analysis Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning
[!NOTE]
Document Navigation¶
This is the second of twelve documents comprising the Indonesia-MTEB Benchmark research foundation.
| Document | Title | Focus Area |
|---|---|---|
| 01 | Project Overview & Scope | Foundation document |
| 02 | MTEB Structure Analysis | Current Document |
| 03 | Existing Indonesian Datasets | Data aggregation sources |
| 04 | Regional MTEB Methodologies | Precedent analysis |
| 05 | Translation Models Benchmark | Model selection & evaluation |
| 06 | AI Dataset Generation Methods | Novel data creation |
| 07 | Validation Strategies | Quality assurance protocols |
| 08 | ACL Dataset Paper Standards | Publication requirements |
| 09 | Novelty Angle & Publication | Research contribution |
| 10 | Implementation Roadmap | Technical execution plan |
| 11 | Python Package Development | Software architecture |
| 12 | Summary & Quick Reference | Consolidated reference |
MTEB Structure & Task Categories Analysis¶
"Understanding MTEB's internal architecture, evaluation protocols, and dataset formats is the foundation for building Indonesia-MTEB. This document provides a comprehensive technical deep-dive into the Massive Text Embedding Benchmark framework."
Table of Contents¶
- MTEB Architecture Overview
- The 8 Core Task Categories
- Detailed Task Analysis
- Evaluation Metrics Deep-Dive
- MTEB Dataset Format Standards
- MTEB v2 & MMTEB Updates
- Implementation Guide
- Indonesia-MTEB Task Mapping
- Technical Considerations
- References
1. MTEB Architecture Overview¶
1.1 Framework Specification¶
MTEB (Massive Text Embedding Benchmark) is a Python-based evaluation framework that provides standardized protocols for assessing text embedding models across diverse NLP tasks.
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB FRAMEWORK ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ INPUT LAYER ║│
│ ║ ┌────────────────────────────────────────────────────────────────────┐ ║│
│ ║ │ Model Input: Text / Sentence Pair / Query-Document Pair │ ║│
│ ║ │ Format: Raw text strings │ ║│
│ ║ └────────────────────────────────────────────────────────────────────┘ ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ ENCODING LAYER ║│
│ ║ ┌────────────────────────────────────────────────────────────────────┐ ║│
│ ║ │ Embedding Model Interface │ ║│
│ ║ │ - SentenceTransformer │ ║│
│ ║ │ - Custom encoder with .encode() method │ ║│
│ ║ │ - Output: Dense vectors (typically 384-4096 dimensions) │ ║│
│ ║ └────────────────────────────────────────────────────────────────────┘ ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ TASK-SPECIFIC EVALUATORS ║│
│ ║ ┌────────────────────────────────────────────────────────────────────┐ ║│
│ ║ │ │ CLASSIFICATION │ CLUSTERING │ RETRIEVAL │ STS │ ║│
│ ║ │ ├──────────────────┼──────────────┼─────────────┼──────────────┤ ║│
│ ║ │ │ PAIR CLASS. │ RERANKING │ INSTRUCT. │ SUMMARIZATION║ ║│
│ ║ │ └──────────────────┴──────────────┴─────────────┴──────────────┘ ║│
│ ║ │ │ ║│
│ ║ │ Each evaluator: │ ║│
│ ║ │ 1. Loads dataset (train/validation/test splits) │ ║│
│ ║ │ 2. Encodes text using model │ ║│
│ ║ │ 3. Computes task-specific metrics │ ║│
│ ║ │ 4. Returns structured results │ ║│
│ ║ └────────────────────────────────────────────────────────────────────┘ ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │ │
│ ▼ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ AGGREGATION LAYER ║│
│ ║ ┌────────────────────────────────────────────────────────────────────┐ ║│
│ ║ │ Results Collection & Aggregation │ ║│
│ ║ │ - Per-dataset scores │ ║│
│ ║ │ - Per-task averages │ ║│
│ ║ │ - Overall benchmark score │ ║│
│ ║ │ - Leaderboard formatting │ ║│
│ ║ └────────────────────────────────────────────────────────────────────┘ ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
1.2 MTEB Evolution Timeline¶
| Version | Code Name | Datasets | Languages | Tasks | Year | Key Innovation |
|---|---|---|---|---|---|---|
| MTEB v1 | Original | 58 | 112 | 8 | 2022 (Oct) | Initial unification |
| MTEB v1.1 | EACL | 58 | 112 | 8 | 2023 (Apr) | EACL publication |
| MTEB v2 | Multilingual | 284 | 112 | 8 | 2024 | Expanded multilingual |
| MMTEB | ICLR 2025 | 500+ | 1000+ | 9+ | 2025 | Instruction Following |
| Current | Production | 1,308+ | 1000+ | 8+ | 2026 | Community-driven |
Sources: - MTEB Original: arxiv.org/abs/2210.07316 - MMTEB: arxiv.org/abs/2502.13595 - MTEB v2: huggingface.co/blog/isaacchung/mteb-v2
1.3 Repository Structure¶
mteb/
├── mteb/
│ ├── __init__.py
│ ├── benchmark.py # Main benchmark class
│ ├── encoder_interface.py # Model interface
│ ├── abstasks/ # Abstract task definitions
│ │ ├── __init__.py
│ │ ├── AbsTask.py # Base task class
│ │ ├── classification/ # Classification tasks
│ │ ├── clustering/ # Clustering tasks
│ │ ├── retrieval/ # Retrieval tasks
│ │ ├── sts/ # STS tasks
│ │ ├── pair_classification/ # Pair classification
│ │ ├── reranking/ # Reranking tasks
│ │ ├── summarization/ # Summarization tasks
│ │ └── instruction_following/ # Instruction tasks
│ └── models/ # Model registry
├── scripts/ # Evaluation scripts
├── tests/ # Test suite
└── docs/ # Documentation
[!TIP] For Indonesia-MTEB: We will follow the same structure, creating tasks in parallel categories while maintaining full API compatibility.
2. The 8 Core Task Categories¶
2.1 Task Summary Matrix¶
| # | Task | MTEB Type Code | Primary Metric | Secondary Metrics | Typical Input | Typical Output |
|---|---|---|---|---|---|---|
| 1 | Classification | s2s / t2c |
Accuracy | F1 (macro/micro) | Single text | Class label |
| 2 | Clustering | s2s |
V-measure | ARI, NMI | Multiple texts | Cluster assignment |
| 3 | Pair Classification | s2s |
Average Precision | Accuracy, F1 | Text pair | Binary label |
| 4 | Reranking | s2s |
MAP | MRR, nDCG | Query + doc list | Reordered list |
| 5 | Retrieval | s2p / s2s |
nDCG@k | Recall@k, MAP | Query + corpus | Ranked docs |
| 6 | STS | s2s |
Spearman ρ | Pearson r | Text pair | Similarity score |
| 7 | Summarization | s2s |
Cosine Similarity | ROUGE | Text + summary | Similarity |
| 8 | Instruction Following | s2p |
Task-specific | nDCG@k | Instruction + query | Retrieved result |
Type Code Legend:
- s2s: Sentence-to-Sentence (both inputs are sentence-length)
- s2p: Sentence-to-Paragraph (query sentence, doc paragraph)
- t2c: Text-to-Category (text to class label)
2.2 Task Distribution in MTEB¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB TASK DISTRIBUTION (1,308+ DATASETS) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CLASSIFICATION ████████████████████████████████████░░░░ ~35% │
│ RETRIEVAL ████████████████████████░░░░░░░░░░░░░░░░ ~25% │
│ CLUSTERING ████████████████░░░░░░░░░░░░░░░░░░░░░░░ ~15% │
│ STS ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ~10% │
│ PAIR CLASS. ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ~7% │
│ RERANKING ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ~5% │
│ INSTRUCTION FOLL. ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ~3% │
│ SUMMARIZATION ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ~2% │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
[!NOTE] Implication for Indonesia-MTEB: Classification and Retrieval tasks dominate the benchmark. Our Indonesian datasets should reflect similar proportions for comparability.
3. Detailed Task Analysis¶
3.1 Classification¶
Purpose: Assign predefined category labels to individual text instances.
3.1.1 Task Definition¶
Classification tasks evaluate an embedding model's ability to capture semantic features that distinguish between predefined categories. The model must encode text such that similar texts cluster in embedding space.
3.1.2 Data Format Specification¶
{
"text": "Bank Indonesia menaikkan suku bunga acuan sebesar 25 basis point",
"label": 0,
"split": "train"
}
HuggingFace Dataset Structure:
DatasetDict({
'train': Dataset({
'text': ['...', '...', ...],
'label': [0, 1, 2, ...],
}),
'validation': Dataset({...}),
'test': Dataset({...}),
})
3.1.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLASSIFICATION EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODING │
│ embeddings = model.encode(test_texts) │
│ │
│ 2. CLASSIFIER TRAINING │
│ classifier = LogisticRegression() │
│ classifier.fit(train_embeddings, train_labels) │
│ │
│ 3. PREDICTION │
│ predictions = classifier.predict(test_embeddings) │
│ │
│ 4. METRIC CALCULATION │
│ accuracy = accuracy_score(true_labels, predictions) │
│ f1_macro = f1_score(true_labels, predictions, average='macro') │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.1.4 Example Datasets in MTEB¶
| Dataset | Domain | Classes | Train Size | Test Size | Link |
|---|---|---|---|---|---|
ArxivClassification |
Academic | 11 | 35,056 | 4,382 | mteb/ArxivClassification |
Banking77Classification |
Banking | 77 | 10,003 | 3,080 | mteb/Banking77Classification |
EmotionClassification |
Social Media | 6 | 16,000 | 2,000 | mteb/EmotionClassification |
MassiveIntentClassification |
E-commerce | 60 | 164,603 | 10,000 | mteb/amazon_massive_intent |
TweetSentimentMultilingual |
Social Media | 3 | 10,000 | 1,229 | mteb/tweet_sentiment_multilingual |
3.1.5 Indonesian Adaptation¶
Potential Indonesian Classification Datasets:
| Source | Task | Classes | Status | Notes |
|---|---|---|---|---|
| IndoNLU | SMSA (Sentiment) | 3 (pos/neg/neu) | Available | ~11,000 tweets |
| IndoNLU | EmoT (Emotion) | 5 | Available | ~3,400 tweets |
| NusaX | Sentiment | 3 | Available | 10 languages + ID |
| IndoNLU | POS Tagging | 23 | Available | Requires adaptation |
[!TIP] Indonesia-MTEB Classification Target: 8-12 datasets covering sentiment, topic, intent, and domain-specific classification.
3.2 Clustering¶
Purpose: Group similar texts without predefined labels (unsupervised learning).
3.2.1 Task Definition¶
Clustering evaluates whether an embedding model captures semantic similarity such that related texts form tight clusters in embedding space. Unlike classification, there are no labels during training or evaluation—metrics compare algorithm-assigned clusters to ground-truth labels.
3.2.2 Data Format Specification¶
{
"sentences": [
"Bank Indonesia menaikkan suku bunga acuan",
"BI rate naik 25 basis point",
"Timnas Indonesia menang 3-0"
],
"labels": [0, 0, 1]
}
Key Characteristics: - Labels are for evaluation only—never used during clustering - Clustering algorithm: typically k-means or similar - Number of clusters: provided (fixed k)
3.2.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLUSTERING EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODING │
│ embeddings = model.encode(sentences) │
│ │
│ 2. CLUSTERING │
│ from sklearn.cluster import KMeans │
│ kmeans = KMeans(n_clusters=n_classes) │
│ pred_labels = kmeans.fit_predict(embeddings) │
│ │
│ 3. METRIC CALCULATION │
│ v_measure = v_measure_score(true_labels, pred_labels) │
│ ari = adjusted_rand_score(true_labels, pred_labels) │
│ nmi = normalized_mutual_info_score(true_labels, pred_labels) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.2.4 Clustering Metrics Explained¶
[!NOTE] V-Measure (Primary Metric):
V-Measure is the harmonic mean of homogeneity and completeness:
V = 2 × (homogeneity × completeness) / (homogeneity + completeness)
homogeneity = 1 - H(C|K) / H(C) # Each cluster contains only one class
completeness = 1 - H(K|C) / H(K) # All members of a class are in one cluster
where H(C|K) is conditional entropy, H(C) is entropy
- Range: [0, 1]
- Interpretation: 1 = perfect clustering, 0 = random
- Advantage: Independent of label permutation (unlike purity)
Adjusted Rand Index (ARI):
ARI = (RI - Expected_RI) / (Max_RI - Expected_RI)
where RI = (TP + TN) / (TP + TN + FP + FN) (Rand Index)
- Range: [-1, 1]
- Interpretation: 1 = perfect, 0 = random, < 0 = worse than random
- Use Case: Robust to different cluster sizes
Normalized Mutual Information (NMI):
NMI = I(C;K) / sqrt(H(C) × H(K))
where I(C;K) is mutual information between true and predicted labels
- Range: [0, 1]
- Interpretation: 1 = perfect clustering
3.2.5 Example Datasets in MTEB¶
| Dataset | Domain | Clusters | Samples | Type | Link |
|---|---|---|---|---|---|
reddit-clustering |
Social Media | 199 | 1.2M | P2P | mteb/reddit-clustering |
stackexchange-clustering-p2p |
Q&A | 121 | 105K | P2P | mteb/stackexchange-clustering-p2p |
arxiv-clustering-p2p |
Academic | 30 | 96K | P2P | mteb/arxiv-clustering-p2p |
wikipedia-clustering |
Encyclopedia | 10 | 70K | S2S | mteb/wikipedia-clustering |
Type Legend: - P2P: Pair-to-Pair (sentence pairs) - S2S: Sentence-to-Sentence (single sentences)
3.2.6 Indonesian Adaptation¶
Proposed Indonesian Clustering Datasets:
| Domain | Source | Clusters | Method | Status |
|---|---|---|---|---|
| News | IndoNLU articles | 10-15 | Aggregation | Available |
| Social Media | Twitter/Instagram | 20-50 | AI Generation | Planned |
| Wikipedia | ID Wikipedia | 10-30 | Aggregation | Available |
| Legal | Indonesian court docs | 5-10 | AI Generation | Gap |
3.3 Pair Classification¶
Purpose: Determine if two texts are semantically related (binary classification).
3.3.1 Task Definition¶
Pair classification evaluates whether an embedding model can distinguish between related and unrelated text pairs based on semantic similarity.
3.3.2 Data Format Specification¶
{
"text1": "Bank Indonesia menaikkan suku bunga",
"text2": "BI rate naik 25 basis point",
"label": 1
}
Label encoding:
- 1: Related / Duplicate / Paraphrase
- 0: Not related / Different meaning
3.3.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ PAIR CLASSIFICATION EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODING │
│ emb1 = model.encode(text1_list) │
│ emb2 = model.encode(text2_list) │
│ │
│ 2. SIMILARITY CALCULATION │
│ similarity = cosine_similarity(emb1, emb2) │
│ │
│ 3. THRESHOLD CLASSIFICATION │
│ predictions = (similarity > threshold).astype(int) │
│ │
│ 4. METRIC CALCULATION │
│ ap = average_precision_score(true_labels, similarity) │
│ accuracy = accuracy_score(true_labels, predictions) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.3.4 Example Datasets in MTEB¶
| Dataset | Domain | Pairs | Type | Link |
|---|---|---|---|---|
twitterurlcorpus-pairclassification |
Social Media | 50K | Paraphrase | mteb/twitterurlcorpus-pairclassification |
quora-duplicates-questions |
Q&A | 400K | Duplicate | mteb/quora-duplicates-questions |
stackoverflow-dupequestions |
Technical | 300K | Duplicate | mteb/stackoverflow-dupequestions |
3.3.5 Indonesian Adaptation¶
Proposed Indonesian Pair Classification Datasets:
| Domain | Type | Source | Status |
|---|---|---|---|
| News Headlines | Paraphrase | Translation + Generation | Planned |
| Social Media | Duplicate | Twitter/Instagram | Gap |
| Q&A | Duplicate | Kaskus/StackOverflow ID | Gap |
3.4 Reranking¶
Purpose: Reorder retrieved documents by relevance to a query.
3.4.1 Task Definition¶
Reranking evaluates whether an embedding model can refine an initial document ranking, placing the most relevant documents at the top.
3.4.2 Data Format Specification¶
{
"query": "dampak kenaikan suku bunga pada ekonomi Indonesia",
"documents": [
{"text": "...", "id": "doc1"},
{"text": "...", "id": "doc2"},
{"text": "...", "id": "doc3"}
],
"relevant": ["doc1", "doc3"]
}
3.4.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ RERANKING EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODE QUERY AND DOCUMENTS │
│ query_emb = model.encode(query) │
│ doc_embs = model.encode(documents) │
│ │
│ 2. CALCULATE QUERY-DOC SIMILARITY │
│ scores = cosine_similarity(query_emb, doc_embs) │
│ │
│ 3. RANK DOCUMENTS BY SCORE │
│ ranked_docs = argsort(scores, descending=True) │
│ │
│ 4. METRIC CALCULATION │
│ MAP = mean([ap_score(relevances, rankings)]) │
│ MRR = mean([1/rank_first_relevant]) │
│ nDCG@k = dcg@k / ideal_dcg@k │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.4.4 Reranking Metrics Explained¶
MAP (Mean Average Precision):
For each query:
AP = (1/R) × Σ (precision_at_k × relevance_at_k)
where R = total relevant documents
MAP = mean(AP over all queries)
- Range: [0, 1]
- Interpretation: Average precision across all recall points
MRR (Mean Reciprocal Rank):
- Range: [0, 1]
- Interpretation: Focus on first relevant document position
nDCG@k (Normalized Discounted Cumulative Gain):
- Range: [0, 1]
- Interpretation: Ranking quality at position k
3.4.5 Example Datasets in MTEB¶
| Dataset | Domain | Queries | Avg Docs | Link |
|---|---|---|---|---|
MIRACLReranking |
Wikipedia | 12K | 29 | mteb/MIRACLReranking |
stackoverflow-qa |
Technical | 150K | 30 | mteb/StackOverflowQA |
3.4.6 Indonesian Adaptation¶
Proposed Indonesian Reranking Datasets:
| Domain | Source | Queries | Method | Status |
|---|---|---|---|---|
| Wikipedia | MIRACL-ID | ~500 | Translation | Available |
| News | Indonesian news sites | ~1000 | Generation | Gap |
| Legal | Court documents | ~500 | Generation | Gap |
3.5 Retrieval¶
Purpose: Find relevant documents from a large corpus for a given query.
3.5.1 Task Definition¶
Retrieval is the core information retrieval task, evaluating whether an embedding model can identify relevant documents from a large collection.
3.5.2 Data Format Specification¶
{
"query": "teks query dalam bahasa Indonesia",
"corpus": [
{"id": "doc1", "text": "...", "title": "..."},
{"id": "doc2", "text": "...", "title": "..."}
],
"relevant_docs": {
"query_id": ["doc1", "doc3", "doc7"]
}
}
Split Configuration: - Dev: Small corpus for quick evaluation - Test: Full corpus for benchmark
3.5.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODE CORPUS (done once, cached) │
│ corpus_embs = model.encode([doc["text"] for doc in corpus]) │
│ │
│ 2. FOR EACH QUERY: │
│ a. Encode query │
│ query_emb = model.encode(query) │
│ │
│ b. Calculate similarities │
│ scores = cosine_similarity(query_emb, corpus_embs) │
│ │
│ c. Rank and retrieve top-k │
│ ranked_indices = argsort(scores, descending=True)[:k] │
│ │
│ 3. CALCULATE METRICS ACROSS ALL QUERIES │
│ nDCG@k = mean([ndcg_at_k(query_relevances, query_rankings)]) │
│ Recall@k = mean([recall_at_k(query_relevances, query_rankings)]) │
│ MAP = mean([average_precision(query_relevances, query_scores)]) │
│ MRR = mean([1/rank_first_relevant]) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.5.4 Example Datasets in MTEB¶
| Dataset | Domain | Corpus Size | Queries | Avg Relevant | Link |
|---|---|---|---|---|---|
MIRACLRetrieval |
Wikipedia | 1.1M (ID: ~50K) | 12K | 1.5 | mteb/MIRACLRetrieval |
quora |
Q&A | 1M | 10K | 1.5 | mteb/quora |
arguana |
Arguments | 24K | 1.4K | 2.8 | mteb/arguana |
fiqa |
Finance | 57K | 648 | 1.6 | mteb/fiqa |
scidocs |
Scientific | 25K | 1K | 4.8 | mteb/scidocs |
3.5.5 Indonesian Adaptation¶
Available Indonesian Retrieval Resources:
| Resource | Corpus Size | Queries | Status | Notes |
|---|---|---|---|---|
| MIRACL-ID | ~50K Wikipedia | ~500 | Available | Part of MIRACL |
| Wikipedia ID | Full | Custom | Available | Requires query set |
| IndoNLG (news) | ~50K | Custom | Available | Domain-specific |
| Legal documents | ~100K | Custom | Gap | Requires creation |
Proposed Indonesia-MTEB Retrieval Datasets:
| Domain | Corpus | Queries | Method | Priority |
|---|---|---|---|---|
| Wikipedia | 50K | 500 | MIRACL-ID adaptation | High |
| News | 30K | 300 | Translation + generation | High |
| Legal | 20K | 200 | AI generation | Medium |
| FAQ | 10K | 200 | Industry collaboration | Medium |
3.6 STS (Semantic Textual Similarity)¶
Purpose: Predict similarity scores for text pairs.
3.6.1 Task Definition¶
STS evaluates whether an embedding model captures fine-grained semantic similarity, correlating with human judgment of text relatedness.
3.6.2 Data Format Specification¶
{
"text1": "Bank Indonesia menaikkan suku bunga",
"text2": "BI rate naik 25 basis point",
"score": 4.5
}
Score Scales: - 0-5 scale: STS-B, SICK-R - 0-1 scale: Normalized variants - Binary: Some simplified datasets
3.6.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ STS EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODE BOTH SENTENCES │
│ emb1 = model.encode(text1_list) │
│ emb2 = model.encode(text2_list) │
│ │
│ 2. CALCULATE COSINE SIMILARITY │
│ similarities = cosine_similarity(emb1, emb2) │
│ │
│ 3. CORRELATE WITH HUMAN SCORES │
│ spearman = spearmanr(similarities, human_scores)[0] │
│ pearson = pearsonr(similarities, human_scores)[0] │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.6.4 STS Metrics Explained¶
Spearman's Rank Correlation (ρ):
- Range: [-1, 1]
- Interpretation: Monotonic relationship (rank-based)
- Robustness: Less sensitive to outliers than Pearson
Pearson Correlation ®:
- Range: [-1, 1]
- Interpretation: Linear relationship
- Use Case: When scores are approximately normally distributed
3.6.5 Example Datasets in MTEB¶
| Dataset | Domain | Scale | Pairs | Link |
|---|---|---|---|---|
stsbenchmark-sts |
General | 0-5 | 8,628 | mteb/stsbenchmark-sts |
sickr-sts |
General | 0-5 | 4,500 | mteb/sickr-sts |
biosts-sts |
Biomedical | 0-5 | 600 | mteb/biosts-sts |
3.6.6 Indonesian Adaptation¶
Challenge: Limited Indonesian STS datasets exist.
Proposed Indonesia-MTEB STS Datasets:
| Domain | Pairs | Method | Status |
|---|---|---|---|
| News Headlines | 1,000 | Translation (STS-B) | Planned |
| Social Media | 500 | AI Generation | Gap |
| General | 2,000 | Human annotation | Gap |
| Technical | 300 | Domain-specific | Gap |
3.7 Summarization¶
Purpose: Evaluate if summary captures document semantics.
3.7.1 Task Definition¶
Summarization tasks assess whether an embedding model captures the semantic relationship between a document and its summary.
3.7.2 Data Format Specification¶
3.7.3 Evaluation Protocol¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ SUMMARIZATION EVALUATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. ENCODE DOCUMENT AND SUMMARY │
│ doc_emb = model.encode(document) │
│ sum_emb = model.encode(summary) │
│ │
│ 2. CALCULATE COSINE SIMILARITY │
│ similarity = cosine_similarity(doc_emb, sum_emb) │
│ │
│ 3. AGGREGATE SCORES │
│ mean_score = mean(similarities) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.7.4 Example Datasets in MTEB¶
| Dataset | Domain | Pairs | Link |
|---|---|---|---|
summeval-fr |
News (French) | 447 | Adaptation example |
summeval |
News | 2,150 | Reference implementation |
3.7.5 Indonesian Adaptation¶
Challenge: No Indonesian summarization datasets for embedding evaluation.
Proposed Solution: 1. Translate existing summarization datasets 2. Create Indonesian news-summary pairs 3. Collaborate with Indonesian media organizations
3.8 Instruction Following¶
Purpose: Evaluate model's ability to follow task-specific instructions.
3.8.1 Task Definition¶
Instruction following (added in MMTEB 2025) evaluates whether an embedding model can condition its representations based on task instructions, enabling domain-aware retrieval.
3.8.2 Data Format Specification¶
{
"instruction": "Retrieve documents about Indonesian monetary policy",
"query": "kebijakan suku bunga Bank Indonesia 2024",
"expected": ["doc1", "doc3", "doc7"]
}
3.8.3 Example Datasets in MTEB¶
| Dataset | Domain | Instructions | Queries | Link |
|---|---|---|---|---|
InstructIR-mteb |
Mixed | 17 | 500 | mteb/InstructIR-mteb |
Core17InstructionRetrieval |
News | Domain-specific | Custom | mteb/Core17InstructionRetrieval |
3.8.4 Indonesian Adaptation¶
Novel Contribution Opportunity: Indonesia-MTEB can be the first to introduce Indonesian instruction-following datasets.
Proposed Domains: - Legal instruction retrieval - Financial domain instruction - Healthcare instruction - Regional language instruction
4. Evaluation Metrics Deep-Dive¶
4.1 Complete Metrics Reference¶
| Metric | Formula | Range | Task | Interpretation |
|---|---|---|---|---|
| Accuracy | correct / total | [0,1] | Classification | Percentage correct |
| F1-Score | 2×P×R/(P+R) | [0,1] | Classification | Harmonic mean of precision/recall |
| V-Measure | 2×h×c/(h+c) | [0,1] | Clustering | Homogeneity + completeness |
| ARI | (RI-E)/(M-E) | [-1,1] | Clustering | Adjusted clustering similarity |
| NMI | I(C;K)/√(H(C)H(K)) | [0,1] | Clustering | Normalized mutual information |
| AP | Σ(P@k×rel)/R | [0,1] | Pair Class, Rerank | Average precision |
| MAP | mean(AP) | [0,1] | Retrieval, Rerank | Mean of average precision |
| MRR | mean(1/rank_first) | [0,1] | Retrieval, Rerank | Mean reciprocal rank |
| nDCG@k | DCG@k/IDCG@k | [0,1] | Retrieval, Rerank | Normalized discounted gain |
| Recall@k | rel@k/total | [0,1] | Retrieval | Recall at position k |
| Spearman ρ | rank correlation | [-1,1] | STS | Rank-based correlation |
| Pearson r | linear correlation | [-1,1] | STS | Linear correlation |
| Cosine Sim | (A·B)/(|A||B|) | [-1,1] | Summarization | Cosine similarity |
4.2 Metric Selection by Task¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MTEB METRIC SELECTION MATRIX │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TASK │ PRIMARY │ SECONDARY │ TERTIARY │
│ ────────────────────────┼────────────────┼──────────────────┼──────────────│
│ Classification │ Accuracy │ F1-macro │ F1-micro │
│ Clustering │ V-measure │ ARI │ NMI │
│ Pair Classification │ AP │ Accuracy │ F1 │
│ Reranking │ MAP │ MRR │ nDCG@10 │
│ Retrieval │ nDCG@10 │ Recall@10 │ MAP │
│ STS │ Spearman │ Pearson │ - │
│ Summarization │ Cosine Sim │ - │ - │
│ Instruction Following │ Task-specific │ nDCG@10 │ Recall@10 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5. MTEB Dataset Format Standards¶
5.1 HuggingFace Dataset Structure¶
Standard MTEB Dataset Format:
from datasets import Dataset, DatasetDict
dataset = DatasetDict({
"train": Dataset.from_dict({
"text": [...], # Input text(s)
"label": [...], # Labels (for supervised tasks)
# Additional fields as needed
}),
"validation": Dataset.from_dict({...}),
"test": Dataset.from_dict({...})
})
Task-Specific Variations:
| Task | Required Fields | Optional Fields |
|---|---|---|
| Classification | text, label |
- |
| Clustering | sentences, labels |
main_category |
| Pair Classification | text1, text2, label |
- |
| Reranking | query, documents, relevant |
scores |
| Retrieval | corpus, queries, relevant_docs |
domain |
| STS | text1, text2, score |
dataset |
| Summarization | text, summary |
source |
5.2 Dataset Card Template¶
Each MTEB dataset requires a README card:
---
dataset_name: "DatasetName"
language: ["id"]
license: "cc-by-4.0"
---
# DatasetName
## Dataset Description
Brief description of the dataset...
## Citation
```bibtex
@dataset{dataset_name,
title={Dataset Name},
author={...},
year={2026}
}
Tasks¶
- TaskType1
- TaskType2
Languages¶
- Indonesian (id)
Dataset Statistics¶
6. MTEB v2 & MMTEB Updates¶
6.1 What's New in MTEB v2¶
| Feature | v1 | v2 | Impact |
|---|---|---|---|
| API | MTEB(tasks).run() |
tasks.evaluate(model) |
Simpler interface |
| Format | JSON only | JSON + Parquet | Faster loading |
| Modality | Text only | Text + Image | Multimodal support |
| Caching | Basic | Advanced with validation | Reproducibility |
| Leaderboard | Single | Multi-domain | Better organization |
6.2 MMTEB (ICLR 2025) Additions¶
New Features:
- Instruction Following Tasks
- 17 instruction types
- Domain-aware retrieval
-
100+ new datasets
-
Long-Document Retrieval
- Documents up to 32K tokens
-
Specialized evaluation
-
Code Retrieval
- Programming language specific
-
Semantic code search
-
Conversational Retrieval
- Multi-turn dialogue context
- Conversation history handling
6.3 API Migration Guide¶
Old (v1):
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"])
New (v2):
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
for task in tasks:
results = task.evaluate(model, eval_splits=["test"])
7. Implementation Guide¶
7.1 Basic Evaluation Example¶
from sentence_transformers import SentenceTransformer
import mteb
# 1. Load model
model = SentenceTransformer('intfloat/multilingual-e5-large')
# 2. Select tasks
tasks = mteb.get_tasks(
tasks=["Banking77Classification", "STSBenchmark"],
languages=["id"] # Will filter for Indonesian tasks
)
# 3. Run evaluation
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
eval_splits=["test"],
output_folder="results/"
)
# 4. View results
for task_name, task_results in results.items():
print(f"{task_name}: {task_results}")
7.2 Custom Encoder Example¶
from mteb.encoder_interface import EncoderInterface
class MyIndonesianEncoder(EncoderInterface):
def __init__(self, model_name, device="cuda"):
self.model = load_model(model_name)
self.device = device
def encode(self, texts, batch_size=32):
"""Encode texts to embeddings."""
return self.model.encode(
texts,
batch_size=batch_size,
show_progress_bar=False
)
@property
def dimension(self):
return self.model.embedding_dim
7.3 Result Format¶
{
"dataset_name": {
"test": {
"en": {
"main_score": 0.85,
"accuracy": 0.85,
"f1_macro": 0.82,
"evaluation_time": 12.5,
"footprint": {
"memory_mb": 512,
"model_parameters": 560000000
}
}
}
}
}
8. Indonesia-MTEB Task Mapping¶
8.1 Proposed Dataset Distribution by Task¶
| Task Category | Target Count | Current ID Sources | Translation Needed | AI Generation Needed |
|---|---|---|---|---|
| Classification | 8-12 | 3 (IndoNLU, NusaX) | 4-5 | 2-3 |
| Clustering | 5-8 | 0 | 2-3 | 3-5 |
| Pair Classification | 3-5 | 0 | 2 | 1-2 |
| Reranking | 3-5 | 0 | 2 | 1-2 |
| Retrieval | 8-12 | 1 (MIRACL-ID) | 4-5 | 3-5 |
| STS | 5-8 | 0 | 3-4 | 2-3 |
| Summarization | 3-5 | 0 | 2 | 1-2 |
| Instruction Following | 3-5 | 0 | 0 | 3-5 |
| TOTAL | 38-55 | 4 | 19-23 | 16-26 |
8.2 Priority Matrix for Indonesia-MTEB¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDONESIA-MTEB TASK PRIORITY MATRIX │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ HIGH PRIORITY (Phase 1: Foundation) ║│
│ ║ ────────────────────────────────────────────────────────────────────── ║│
│ ║ Classification │ 8 datasets │ IndoNLU + Translation + Generation ║│
│ ║ Retrieval │ 8 datasets │ MIRACL-ID + Translation + Generation ║│
│ ║ Clustering │ 5 datasets │ Translation + Generation ║│
│ ║ STS │ 5 datasets │ Translation + Generation ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ MEDIUM PRIORITY (Phase 2: Coverage) ║│
│ ║ ────────────────────────────────────────────────────────────────────── ║│
│ ║ Pair Class. │ 4 datasets │ Translation + Generation ║│
│ ║ Reranking │ 4 datasets │ Translation + Generation ║│
│ ║ Summarization │ 3 datasets │ Translation + Generation ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
│ ╔═════════════════════════════════════════════════════════════════════════╗│
│ ║ NOVEL CONTRIBUTION (Phase 3: Innovation) ║│
│ ║ ────────────────────────────────────────────────────────────────────── ║│
│ ║ Instruction Following │ 5 datasets │ AI Generation (Novel) ║│
│ ╚═════════════════════════════════════════════════════════════════════════╝│
│ │
└─────────────────────────────────────────────────────────────────────────────┘
8.3 Dataset Naming Convention¶
Proposed Indonesia-MTEB Naming:
indonesiamteb/{task}_{domain}_{source}
Examples:
- indonesiamteb/classification_sentiment_newsnlp
- indonesiamteb/clustering_news_wikipedia_id
- indonesiamteb/retrieval_wikipedia_miracl_id
- indonesiamteb/sts_news_translated_stsb
- indonesiamteb/instruction_retrieval_legal_generated
9. Technical Considerations¶
9.1 Performance Optimization¶
Encoding Speed:
| Technique | Speedup | Implementation |
|---|---|---|
| Batch encoding | 10-50x | encode(texts, batch_size=128) |
| GPU utilization | 5-20x | model.to("cuda") |
| Quantization | 2-4x | quantize_model=True |
| Caching | ∞ | cache_dir="cache/" |
Memory Optimization:
# For large corpora, use streaming
from datasets import load_dataset
corpus = load_dataset("mteb/MIRACLRetrieval", split="test")
corpus = corpus.map(lambda x: {"text": x["text"]}, batched=True)
9.2 Cross-Lingual Considerations¶
For Indonesian evaluation, consider:
- Script Compatibility: Indonesian uses Latin script (same as English)
- Tokenization: Different tokenizers may affect embedding quality
- Domain Transfer: English-pretrained models may need adaptation
Recommended Models for Indonesian:
| Model | Parameters | Indonesian Training | MTEB ID Score |
|---|---|---|---|
intfloat/multilingual-e5-large |
560M | Yes (100+ langs) | Baseline |
BAAI/bge-m3 |
600M | Yes (multilingual) | To evaluate |
sentence-transformers/LaBSE |
470M | Yes | To evaluate |
LazarusNLP/indonesian-sbert-base |
110M | Yes (ID-only) | To evaluate |
9.3 Reproducibility¶
Essential for MTEB integration:
import numpy as np
import torch
# Set seeds
np.random.seed(42)
torch.manual_seed(42)
# Use deterministic algorithms
torch.use_deterministic_algorithms(True)
# Log model details
print(f"Model: {model_name}")
print(f"Revision: {model_revision}")
print(f"MTEB version: {mteb.__version__}")
10. References¶
Primary Sources¶
-
Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316
-
Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595
-
Chung, I., et al. (2025). "Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks". arXiv. arXiv:2506.21182
Technical Documentation¶
-
MTEB GitHub Repository: github.com/embeddings-benchmark/mteb
-
MTEB v2 Introduction: huggingface.co/blog/isaacchung/mteb-v2
-
Sentence Transformers MTEB Guide: sbert.net/docs/sentence_transformer/usage/mteb_evaluation.html
Evaluation Metrics¶
-
Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.
-
Weaviate - Retrieval Evaluation Metrics: weaviate.io/blog/retrieval-evaluation-metrics
-
Evidently AI - NDCG Explained: evidentlyai.com/ranking-metrics/ndcg-metric
Dataset Examples¶
-
MTEB Datasets Hub: huggingface.co/mteb
-
MIRACL (Multilingual Information Retrieval): github.com/project-miracl/miracl
11. Document Status¶
[!NOTE] Next Document: Document 03 - Existing Indonesian Datasets
This document provides detailed analysis of existing Indonesian NLP datasets, their MTEB compatibility, and aggregation strategies for Indonesia-MTEB.
Change Log:
| Version | Date | Changes | Author |
|---|---|---|---|
| 1.0 | 2026-01-25 | Initial version | Research Team |
| 2.0 | 2026-01-25 | Enhanced edition with detailed analysis, MMTEB updates, implementation guides | Research Team |
This document is a living record. Updated as research progresses.