Project: Indonesia-MTEB Benchmark Document: 02 - MTEB Structure & Task Categories Analysis Version: 2.0 (Enhanced Edition) Last Updated: 2026-01-25 Status: Research Phase - Foundation Planning

[!NOTE]

This is the second of twelve documents comprising the Indonesia-MTEB Benchmark research foundation.

Document	Title	Focus Area
01	Project Overview & Scope	Foundation document
02	MTEB Structure Analysis	Current Document
03	Existing Indonesian Datasets	Data aggregation sources
04	Regional MTEB Methodologies	Precedent analysis
05	Translation Models Benchmark	Model selection & evaluation
06	AI Dataset Generation Methods	Novel data creation
07	Validation Strategies	Quality assurance protocols
08	ACL Dataset Paper Standards	Publication requirements
09	Novelty Angle & Publication	Research contribution
10	Implementation Roadmap	Technical execution plan
11	Python Package Development	Software architecture
12	Summary & Quick Reference	Consolidated reference

MTEB Structure & Task Categories Analysis¶

"Understanding MTEB's internal architecture, evaluation protocols, and dataset formats is the foundation for building Indonesia-MTEB. This document provides a comprehensive technical deep-dive into the Massive Text Embedding Benchmark framework."

1. MTEB Architecture Overview¶

1.1 Framework Specification¶

MTEB (Massive Text Embedding Benchmark) is a Python-based evaluation framework that provides standardized protocols for assessing text embedding models across diverse NLP tasks.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB FRAMEWORK ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                         INPUT LAYER                                       ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Model Input: Text / Sentence Pair / Query-Document Pair           │  ║│
│  ║  │  Format: Raw text strings                                         │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                      ENCODING LAYER                                       ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Embedding Model Interface                                         │  ║│
│  ║  │  - SentenceTransformer                                            │  ║│
│  ║  │  - Custom encoder with .encode() method                            │  ║│
│  ║  │  - Output: Dense vectors (typically 384-4096 dimensions)           │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                    TASK-SPECIFIC EVALUATORS                             ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  │  CLASSIFICATION  │  CLUSTERING  │  RETRIEVAL  │  STS         │  ║│
│  ║  │  ├──────────────────┼──────────────┼─────────────┼──────────────┤  ║│
│  ║  │  │  PAIR CLASS.     │  RERANKING   │  INSTRUCT.  │  SUMMARIZATION║  ║│
│  ║  │  └──────────────────┴──────────────┴─────────────┴──────────────┘  ║│
│  ║  │                                                                      │  ║│
│  ║  │  Each evaluator:                                                     │  ║│
│  ║  │  1. Loads dataset (train/validation/test splits)                     │  ║│
│  ║  │  2. Encodes text using model                                         │  ║│
│  ║  │  3. Computes task-specific metrics                                   │  ║│
│  ║  │  4. Returns structured results                                      │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                    │                                         │
│                                    ▼                                         │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║                       AGGREGATION LAYER                                   ║│
│  ║  ┌────────────────────────────────────────────────────────────────────┐  ║│
│  ║  │  Results Collection & Aggregation                                   │  ║│
│  ║  │  - Per-dataset scores                                               │  ║│
│  ║  │  - Per-task averages                                                │  ║│
│  ║  │  - Overall benchmark score                                           │  ║│
│  ║  │  - Leaderboard formatting                                           │  ║│
│  ║  └────────────────────────────────────────────────────────────────────┘  ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 MTEB Evolution Timeline¶

Version	Code Name	Datasets	Languages	Tasks	Year	Key Innovation
MTEB v1	Original	58	112	8	2022 (Oct)	Initial unification
MTEB v1.1	EACL	58	112	8	2023 (Apr)	EACL publication
MTEB v2	Multilingual	284	112	8	2024	Expanded multilingual
MMTEB	ICLR 2025	500+	1000+	9+	2025	Instruction Following
Current	Production	1,308+	1000+	8+	2026	Community-driven

Sources: - MTEB Original: arxiv.org/abs/2210.07316 - MMTEB: arxiv.org/abs/2502.13595 - MTEB v2: huggingface.co/blog/isaacchung/mteb-v2

1.3 Repository Structure¶

mteb/
├── mteb/
│   ├── __init__.py
│   ├── benchmark.py            # Main benchmark class
│   ├── encoder_interface.py    # Model interface
│   ├── abstasks/               # Abstract task definitions
│   │   ├── __init__.py
│   │   ├── AbsTask.py          # Base task class
│   │   ├── classification/     # Classification tasks
│   │   ├── clustering/         # Clustering tasks
│   │   ├── retrieval/          # Retrieval tasks
│   │   ├── sts/                # STS tasks
│   │   ├── pair_classification/ # Pair classification
│   │   ├── reranking/          # Reranking tasks
│   │   ├── summarization/      # Summarization tasks
│   │   └── instruction_following/  # Instruction tasks
│   └── models/                 # Model registry
├── scripts/                    # Evaluation scripts
├── tests/                      # Test suite
└── docs/                       # Documentation

[!TIP] For Indonesia-MTEB: We will follow the same structure, creating tasks in parallel categories while maintaining full API compatibility.

2. The 8 Core Task Categories¶

2.1 Task Summary Matrix¶

#	Task	MTEB Type Code	Primary Metric	Secondary Metrics	Typical Input	Typical Output
1	Classification	`s2s` / `t2c`	Accuracy	F1 (macro/micro)	Single text	Class label
2	Clustering	`s2s`	V-measure	ARI, NMI	Multiple texts	Cluster assignment
3	Pair Classification	`s2s`	Average Precision	Accuracy, F1	Text pair	Binary label
4	Reranking	`s2s`	MAP	MRR, nDCG	Query + doc list	Reordered list
5	Retrieval	`s2p` / `s2s`	nDCG@k	Recall@k, MAP	Query + corpus	Ranked docs
6	STS	`s2s`	Spearman ρ	Pearson r	Text pair	Similarity score
7	Summarization	`s2s`	Cosine Similarity	ROUGE	Text + summary	Similarity
8	Instruction Following	`s2p`	Task-specific	nDCG@k	Instruction + query	Retrieved result

Type Code Legend: - s2s: Sentence-to-Sentence (both inputs are sentence-length) - s2p: Sentence-to-Paragraph (query sentence, doc paragraph) - t2c: Text-to-Category (text to class label)

2.2 Task Distribution in MTEB¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB TASK DISTRIBUTION (1,308+ DATASETS)                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CLASSIFICATION      ████████████████████████████████████░░░░  ~35%         │
│  RETRIEVAL          ████████████████████████░░░░░░░░░░░░░░░░  ~25%         │
│  CLUSTERING         ████████████████░░░░░░░░░░░░░░░░░░░░░░░  ~15%         │
│  STS                ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~10%         │
│  PAIR CLASS.        ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~7%          │
│  RERANKING          ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~5%          │
│  INSTRUCTION FOLL.  ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~3%          │
│  SUMMARIZATION      ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~2%          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

[!NOTE] Implication for Indonesia-MTEB: Classification and Retrieval tasks dominate the benchmark. Our Indonesian datasets should reflect similar proportions for comparability.

3. Detailed Task Analysis¶

3.1 Classification¶

Purpose: Assign predefined category labels to individual text instances.

3.1.1 Task Definition¶

Classification tasks evaluate an embedding model's ability to capture semantic features that distinguish between predefined categories. The model must encode text such that similar texts cluster in embedding space.

3.1.2 Data Format Specification¶

{
  "text": "Bank Indonesia menaikkan suku bunga acuan sebesar 25 basis point",
  "label": 0,
  "split": "train"
}

HuggingFace Dataset Structure:

DatasetDict({
    'train': Dataset({
        'text': ['...', '...', ...],
        'label': [0, 1, 2, ...],
    }),
    'validation': Dataset({...}),
    'test': Dataset({...}),
})

3.1.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLASSIFICATION EVALUATION PIPELINE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     embeddings = model.encode(test_texts)                                   │
│                                                                              │
│  2. CLASSIFIER TRAINING                                                       │
│     classifier = LogisticRegression()                                       │
│     classifier.fit(train_embeddings, train_labels)                          │
│                                                                              │
│  3. PREDICTION                                                                │
│     predictions = classifier.predict(test_embeddings)                        │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     accuracy = accuracy_score(true_labels, predictions)                     │
│     f1_macro = f1_score(true_labels, predictions, average='macro')         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.1.4 Example Datasets in MTEB¶

Dataset	Domain	Classes	Train Size	Test Size	Link
`ArxivClassification`	Academic	11	35,056	4,382	mteb/ArxivClassification
`Banking77Classification`	Banking	77	10,003	3,080	mteb/Banking77Classification
`EmotionClassification`	Social Media	6	16,000	2,000	mteb/EmotionClassification
`MassiveIntentClassification`	E-commerce	60	164,603	10,000	mteb/amazon_massive_intent
`TweetSentimentMultilingual`	Social Media	3	10,000	1,229	mteb/tweet_sentiment_multilingual

3.1.5 Indonesian Adaptation¶

Potential Indonesian Classification Datasets:

Source	Task	Classes	Status	Notes
IndoNLU	SMSA (Sentiment)	3 (pos/neg/neu)	Available	~11,000 tweets
IndoNLU	EmoT (Emotion)	5	Available	~3,400 tweets
NusaX	Sentiment	3	Available	10 languages + ID
IndoNLU	POS Tagging	23	Available	Requires adaptation

[!TIP] Indonesia-MTEB Classification Target: 8-12 datasets covering sentiment, topic, intent, and domain-specific classification.

3.2 Clustering¶

Purpose: Group similar texts without predefined labels (unsupervised learning).

3.2.1 Task Definition¶

Clustering evaluates whether an embedding model captures semantic similarity such that related texts form tight clusters in embedding space. Unlike classification, there are no labels during training or evaluation—metrics compare algorithm-assigned clusters to ground-truth labels.

3.2.2 Data Format Specification¶

{
  "sentences": [
    "Bank Indonesia menaikkan suku bunga acuan",
    "BI rate naik 25 basis point",
    "Timnas Indonesia menang 3-0"
  ],
  "labels": [0, 0, 1]
}

Key Characteristics: - Labels are for evaluation only—never used during clustering - Clustering algorithm: typically k-means or similar - Number of clusters: provided (fixed k)

3.2.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CLUSTERING EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     embeddings = model.encode(sentences)                                    │
│                                                                              │
│  2. CLUSTERING                                                                │
│     from sklearn.cluster import KMeans                                      │
│     kmeans = KMeans(n_clusters=n_classes)                                  │
│     pred_labels = kmeans.fit_predict(embeddings)                            │
│                                                                              │
│  3. METRIC CALCULATION                                                        │
│     v_measure = v_measure_score(true_labels, pred_labels)                  │
│     ari = adjusted_rand_score(true_labels, pred_labels)                    │
│     nmi = normalized_mutual_info_score(true_labels, pred_labels)           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.2.4 Clustering Metrics Explained¶

[!NOTE] V-Measure (Primary Metric):

V-Measure is the harmonic mean of homogeneity and completeness:

V = 2 × (homogeneity × completeness) / (homogeneity + completeness)

homogeneity = 1 - H(C|K) / H(C)  # Each cluster contains only one class
completeness = 1 - H(K|C) / H(K)  # All members of a class are in one cluster

where H(C|K) is conditional entropy, H(C) is entropy

Range: [0, 1]
Interpretation: 1 = perfect clustering, 0 = random
Advantage: Independent of label permutation (unlike purity)

Adjusted Rand Index (ARI):

ARI = (RI - Expected_RI) / (Max_RI - Expected_RI)

where RI = (TP + TN) / (TP + TN + FP + FN) (Rand Index)

Range: [-1, 1]
Interpretation: 1 = perfect, 0 = random, < 0 = worse than random
Use Case: Robust to different cluster sizes

Normalized Mutual Information (NMI):

NMI = I(C;K) / sqrt(H(C) × H(K))

where I(C;K) is mutual information between true and predicted labels

Range: [0, 1]
Interpretation: 1 = perfect clustering

3.2.5 Example Datasets in MTEB¶

Dataset	Domain	Clusters	Samples	Type	Link
`reddit-clustering`	Social Media	199	1.2M	P2P	mteb/reddit-clustering
`stackexchange-clustering-p2p`	Q&A	121	105K	P2P	mteb/stackexchange-clustering-p2p
`arxiv-clustering-p2p`	Academic	30	96K	P2P	mteb/arxiv-clustering-p2p
`wikipedia-clustering`	Encyclopedia	10	70K	S2S	mteb/wikipedia-clustering

Type Legend: - P2P: Pair-to-Pair (sentence pairs) - S2S: Sentence-to-Sentence (single sentences)

3.2.6 Indonesian Adaptation¶

Proposed Indonesian Clustering Datasets:

Domain	Source	Clusters	Method	Status
News	IndoNLU articles	10-15	Aggregation	Available
Social Media	Twitter/Instagram	20-50	AI Generation	Planned
Wikipedia	ID Wikipedia	10-30	Aggregation	Available
Legal	Indonesian court docs	5-10	AI Generation	Gap

3.3 Pair Classification¶

Purpose: Determine if two texts are semantically related (binary classification).

3.3.1 Task Definition¶

Pair classification evaluates whether an embedding model can distinguish between related and unrelated text pairs based on semantic similarity.

3.3.2 Data Format Specification¶

{
  "text1": "Bank Indonesia menaikkan suku bunga",
  "text2": "BI rate naik 25 basis point",
  "label": 1
}

Label encoding: - 1: Related / Duplicate / Paraphrase - 0: Not related / Different meaning

3.3.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PAIR CLASSIFICATION EVALUATION PIPELINE                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODING                                                                  │
│     emb1 = model.encode(text1_list)                                        │
│     emb2 = model.encode(text2_list)                                        │
│                                                                              │
│  2. SIMILARITY CALCULATION                                                   │
│     similarity = cosine_similarity(emb1, emb2)                              │
│                                                                              │
│  3. THRESHOLD CLASSIFICATION                                                  │
│     predictions = (similarity > threshold).astype(int)                       │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     ap = average_precision_score(true_labels, similarity)                   │
│     accuracy = accuracy_score(true_labels, predictions)                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.3.4 Example Datasets in MTEB¶

Dataset	Domain	Pairs	Type	Link
`twitterurlcorpus-pairclassification`	Social Media	50K	Paraphrase	mteb/twitterurlcorpus-pairclassification
`quora-duplicates-questions`	Q&A	400K	Duplicate	mteb/quora-duplicates-questions
`stackoverflow-dupequestions`	Technical	300K	Duplicate	mteb/stackoverflow-dupequestions

3.3.5 Indonesian Adaptation¶

Proposed Indonesian Pair Classification Datasets:

Domain	Type	Source	Status
News Headlines	Paraphrase	Translation + Generation	Planned
Social Media	Duplicate	Twitter/Instagram	Gap
Q&A	Duplicate	Kaskus/StackOverflow ID	Gap

3.4 Reranking¶

Purpose: Reorder retrieved documents by relevance to a query.

3.4.1 Task Definition¶

Reranking evaluates whether an embedding model can refine an initial document ranking, placing the most relevant documents at the top.

3.4.2 Data Format Specification¶

{
  "query": "dampak kenaikan suku bunga pada ekonomi Indonesia",
  "documents": [
    {"text": "...", "id": "doc1"},
    {"text": "...", "id": "doc2"},
    {"text": "...", "id": "doc3"}
  ],
  "relevant": ["doc1", "doc3"]
}

3.4.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                       RERANKING EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE QUERY AND DOCUMENTS                                                │
│     query_emb = model.encode(query)                                         │
│     doc_embs = model.encode(documents)                                      │
│                                                                              │
│  2. CALCULATE QUERY-DOC SIMILARITY                                            │
│     scores = cosine_similarity(query_emb, doc_embs)                         │
│                                                                              │
│  3. RANK DOCUMENTS BY SCORE                                                   │
│     ranked_docs = argsort(scores, descending=True)                          │
│                                                                              │
│  4. METRIC CALCULATION                                                        │
│     MAP = mean([ap_score(relevances, rankings)])                            │
│     MRR = mean([1/rank_first_relevant])                                    │
│     nDCG@k = dcg@k / ideal_dcg@k                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.4.4 Reranking Metrics Explained¶

MAP (Mean Average Precision):

For each query:
  AP = (1/R) × Σ (precision_at_k × relevance_at_k)
  where R = total relevant documents

MAP = mean(AP over all queries)

Range: [0, 1]
Interpretation: Average precision across all recall points

MRR (Mean Reciprocal Rank):

For each query:
  RR = 1 / rank_of_first_relevant_document

MRR = mean(RR over all queries)

Range: [0, 1]
Interpretation: Focus on first relevant document position

nDCG@k (Normalized Discounted Cumulative Gain):

DCG@k = Σ (2^relevance - 1) / log2(i + 1)
IDCG@k = DCG@k for ideal ranking
nDCG@k = DCG@k / IDCG@k

Range: [0, 1]
Interpretation: Ranking quality at position k

3.4.5 Example Datasets in MTEB¶

Dataset	Domain	Queries	Avg Docs	Link
`MIRACLReranking`	Wikipedia	12K	29	mteb/MIRACLReranking
`stackoverflow-qa`	Technical	150K	30	mteb/StackOverflowQA

3.4.6 Indonesian Adaptation¶

Proposed Indonesian Reranking Datasets:

Domain	Source	Queries	Method	Status
Wikipedia	MIRACL-ID	~500	Translation	Available
News	Indonesian news sites	~1000	Generation	Gap
Legal	Court documents	~500	Generation	Gap

3.5 Retrieval¶

Purpose: Find relevant documents from a large corpus for a given query.

3.5.1 Task Definition¶

Retrieval is the core information retrieval task, evaluating whether an embedding model can identify relevant documents from a large collection.

3.5.2 Data Format Specification¶

{
  "query": "teks query dalam bahasa Indonesia",
  "corpus": [
    {"id": "doc1", "text": "...", "title": "..."},
    {"id": "doc2", "text": "...", "title": "..."}
  ],
  "relevant_docs": {
    "query_id": ["doc1", "doc3", "doc7"]
  }
}

Split Configuration: - Dev: Small corpus for quick evaluation - Test: Full corpus for benchmark

3.5.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                       RETRIEVAL EVALUATION PIPELINE                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE CORPUS (done once, cached)                                        │
│     corpus_embs = model.encode([doc["text"] for doc in corpus])             │
│                                                                              │
│  2. FOR EACH QUERY:                                                          │
│     a. Encode query                                                          │
│        query_emb = model.encode(query)                                      │
│                                                                              │
│     b. Calculate similarities                                                 │
│        scores = cosine_similarity(query_emb, corpus_embs)                   │
│                                                                              │
│     c. Rank and retrieve top-k                                               │
│        ranked_indices = argsort(scores, descending=True)[:k]               │
│                                                                              │
│  3. CALCULATE METRICS ACROSS ALL QUERIES                                      │
│     nDCG@k = mean([ndcg_at_k(query_relevances, query_rankings)])            │
│     Recall@k = mean([recall_at_k(query_relevances, query_rankings)])        │
│     MAP = mean([average_precision(query_relevances, query_scores)])         │
│     MRR = mean([1/rank_first_relevant])                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.5.4 Example Datasets in MTEB¶

Dataset	Domain	Corpus Size	Queries	Avg Relevant	Link
`MIRACLRetrieval`	Wikipedia	1.1M (ID: ~50K)	12K	1.5	mteb/MIRACLRetrieval
`quora`	Q&A	1M	10K	1.5	mteb/quora
`arguana`	Arguments	24K	1.4K	2.8	mteb/arguana
`fiqa`	Finance	57K	648	1.6	mteb/fiqa
`scidocs`	Scientific	25K	1K	4.8	mteb/scidocs

3.5.5 Indonesian Adaptation¶

Available Indonesian Retrieval Resources:

Resource	Corpus Size	Queries	Status	Notes
MIRACL-ID	~50K Wikipedia	~500	Available	Part of MIRACL
Wikipedia ID	Full	Custom	Available	Requires query set
IndoNLG (news)	~50K	Custom	Available	Domain-specific
Legal documents	~100K	Custom	Gap	Requires creation

Proposed Indonesia-MTEB Retrieval Datasets:

Domain	Corpus	Queries	Method	Priority
Wikipedia	50K	500	MIRACL-ID adaptation	High
News	30K	300	Translation + generation	High
Legal	20K	200	AI generation	Medium
FAQ	10K	200	Industry collaboration	Medium

3.6 STS (Semantic Textual Similarity)¶

Purpose: Predict similarity scores for text pairs.

3.6.1 Task Definition¶

STS evaluates whether an embedding model captures fine-grained semantic similarity, correlating with human judgment of text relatedness.

3.6.2 Data Format Specification¶

{
  "text1": "Bank Indonesia menaikkan suku bunga",
  "text2": "BI rate naik 25 basis point",
  "score": 4.5
}

Score Scales: - 0-5 scale: STS-B, SICK-R - 0-1 scale: Normalized variants - Binary: Some simplified datasets

3.6.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                          STS EVALUATION PIPELINE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE BOTH SENTENCES                                                     │
│     emb1 = model.encode(text1_list)                                        │
│     emb2 = model.encode(text2_list)                                        │
│                                                                              │
│  2. CALCULATE COSINE SIMILARITY                                               │
│     similarities = cosine_similarity(emb1, emb2)                            │
│                                                                              │
│  3. CORRELATE WITH HUMAN SCORES                                               │
│     spearman = spearmanr(similarities, human_scores)[0]                     │
│     pearson = pearsonr(similarities, human_scores)[0]                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.6.4 STS Metrics Explained¶

Spearman's Rank Correlation (ρ):

ρ = 1 - (6 × Σd_i²) / (n × (n² - 1))

where d_i = difference in ranks for pair i
      n = number of pairs

Range: [-1, 1]
Interpretation: Monotonic relationship (rank-based)
Robustness: Less sensitive to outliers than Pearson

Pearson Correlation ®:

r = cov(X, Y) / (σ_X × σ_Y)

where cov = covariance, σ = standard deviation

Range: [-1, 1]
Interpretation: Linear relationship
Use Case: When scores are approximately normally distributed

3.6.5 Example Datasets in MTEB¶

Dataset	Domain	Scale	Pairs	Link
`stsbenchmark-sts`	General	0-5	8,628	mteb/stsbenchmark-sts
`sickr-sts`	General	0-5	4,500	mteb/sickr-sts
`biosts-sts`	Biomedical	0-5	600	mteb/biosts-sts

3.6.6 Indonesian Adaptation¶

Challenge: Limited Indonesian STS datasets exist.

Proposed Indonesia-MTEB STS Datasets:

Domain	Pairs	Method	Status
News Headlines	1,000	Translation (STS-B)	Planned
Social Media	500	AI Generation	Gap
General	2,000	Human annotation	Gap
Technical	300	Domain-specific	Gap

3.7 Summarization¶

Purpose: Evaluate if summary captures document semantics.

3.7.1 Task Definition¶

Summarization tasks assess whether an embedding model captures the semantic relationship between a document and its summary.

3.7.2 Data Format Specification¶

{
  "text": "original long document text...",
  "summary": "generated or reference summary..."
}

3.7.3 Evaluation Protocol¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                       SUMMARIZATION EVALUATION PIPELINE                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. ENCODE DOCUMENT AND SUMMARY                                               │
│     doc_emb = model.encode(document)                                        │
│     sum_emb = model.encode(summary)                                         │
│                                                                              │
│  2. CALCULATE COSINE SIMILARITY                                               │
│     similarity = cosine_similarity(doc_emb, sum_emb)                        │
│                                                                              │
│  3. AGGREGATE SCORES                                                          │
│     mean_score = mean(similarities)                                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3.7.4 Example Datasets in MTEB¶

Dataset	Domain	Pairs	Link
`summeval-fr`	News (French)	447	Adaptation example
`summeval`	News	2,150	Reference implementation

3.7.5 Indonesian Adaptation¶

Challenge: No Indonesian summarization datasets for embedding evaluation.

Proposed Solution: 1. Translate existing summarization datasets 2. Create Indonesian news-summary pairs 3. Collaborate with Indonesian media organizations

3.8 Instruction Following¶

Purpose: Evaluate model's ability to follow task-specific instructions.

3.8.1 Task Definition¶

Instruction following (added in MMTEB 2025) evaluates whether an embedding model can condition its representations based on task instructions, enabling domain-aware retrieval.

3.8.2 Data Format Specification¶

{
  "instruction": "Retrieve documents about Indonesian monetary policy",
  "query": "kebijakan suku bunga Bank Indonesia 2024",
  "expected": ["doc1", "doc3", "doc7"]
}

3.8.3 Example Datasets in MTEB¶

Dataset	Domain	Instructions	Queries	Link
`InstructIR-mteb`	Mixed	17	500	mteb/InstructIR-mteb
`Core17InstructionRetrieval`	News	Domain-specific	Custom	mteb/Core17InstructionRetrieval

3.8.4 Indonesian Adaptation¶

Novel Contribution Opportunity: Indonesia-MTEB can be the first to introduce Indonesian instruction-following datasets.

Proposed Domains: - Legal instruction retrieval - Financial domain instruction - Healthcare instruction - Regional language instruction

4. Evaluation Metrics Deep-Dive¶

4.1 Complete Metrics Reference¶

Metric	Formula	Range	Task	Interpretation
Accuracy	correct / total	[0,1]	Classification	Percentage correct
F1-Score	2×P×R/(P+R)	[0,1]	Classification	Harmonic mean of precision/recall
V-Measure	2×h×c/(h+c)	[0,1]	Clustering	Homogeneity + completeness
ARI	(RI-E)/(M-E)	[-1,1]	Clustering	Adjusted clustering similarity
NMI	I(C;K)/√(H(C)H(K))	[0,1]	Clustering	Normalized mutual information
AP	Σ(P@k×rel)/R	[0,1]	Pair Class, Rerank	Average precision
MAP	mean(AP)	[0,1]	Retrieval, Rerank	Mean of average precision
MRR	mean(1/rank_first)	[0,1]	Retrieval, Rerank	Mean reciprocal rank
nDCG@k	DCG@k/IDCG@k	[0,1]	Retrieval, Rerank	Normalized discounted gain
Recall@k	rel@k/total	[0,1]	Retrieval	Recall at position k
Spearman ρ	rank correlation	[-1,1]	STS	Rank-based correlation
Pearson r	linear correlation	[-1,1]	STS	Linear correlation
Cosine Sim	(A·B)/(\|A\|\|B\|)	[-1,1]	Summarization	Cosine similarity

4.2 Metric Selection by Task¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    MTEB METRIC SELECTION MATRIX                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TASK                    │ PRIMARY        │ SECONDARY        │ TERTIARY     │
│  ────────────────────────┼────────────────┼──────────────────┼──────────────│
│  Classification          │ Accuracy       │ F1-macro         │ F1-micro     │
│  Clustering              │ V-measure      │ ARI              │ NMI          │
│  Pair Classification     │ AP             │ Accuracy         │ F1           │
│  Reranking               │ MAP            │ MRR              │ nDCG@10      │
│  Retrieval               │ nDCG@10        │ Recall@10        │ MAP          │
│  STS                     │ Spearman       │ Pearson          │ -            │
│  Summarization           │ Cosine Sim     │ -                │ -            │
│  Instruction Following   │ Task-specific  │ nDCG@10          │ Recall@10    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

5. MTEB Dataset Format Standards¶

5.1 HuggingFace Dataset Structure¶

Standard MTEB Dataset Format:

from datasets import Dataset, DatasetDict

dataset = DatasetDict({
    "train": Dataset.from_dict({
        "text": [...],           # Input text(s)
        "label": [...],          # Labels (for supervised tasks)
        # Additional fields as needed
    }),
    "validation": Dataset.from_dict({...}),
    "test": Dataset.from_dict({...})
})

Task-Specific Variations:

Task	Required Fields	Optional Fields
Classification	`text`, `label`	-
Clustering	`sentences`, `labels`	`main_category`
Pair Classification	`text1`, `text2`, `label`	-
Reranking	`query`, `documents`, `relevant`	`scores`
Retrieval	`corpus`, `queries`, `relevant_docs`	`domain`
STS	`text1`, `text2`, `score`	`dataset`
Summarization	`text`, `summary`	`source`

5.2 Dataset Card Template¶

Each MTEB dataset requires a README card:

---
dataset_name: "DatasetName"
language: ["id"]
license: "cc-by-4.0"
---

# DatasetName

## Dataset Description
Brief description of the dataset...

## Citation
```bibtex
@dataset{dataset_name,
  title={Dataset Name},
  author={...},
  year={2026}
}

Tasks¶

TaskType1
TaskType2

Languages¶

Indonesian (id)

Dataset Statistics¶

Split Samples

Train X,XXX

Test XXX

### 5.3 Adding Custom Datasets to MTEB

**Step-by-Step Process:**

```python
# 1. Define your task class
from mteb.abstasks.AbsTask import AbsTask

class MyIndonesianTask(AbsTask):
    metadata = {
        "name": "MyIndonesianTask",
        "dataset": {
            "path": "myorg/my-indonesian-dataset",
            "revision": "main"
        },
        "type": "Classification",
        "category": "s2s",
        "eval_splits": ["test"],
        "eval_langs": ["id"],
        "main_score": "accuracy",
    }

# 2. Register the task
from mteb import get_tasks
tasks = get_tasks(tasks=["MyIndonesianTask"])

# 3. Run evaluation
from mteb import MTEB
evaluation = MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"])

6. MTEB v2 & MMTEB Updates¶

6.1 What's New in MTEB v2¶

Feature	v1	v2	Impact
API	`MTEB(tasks).run()`	`tasks.evaluate(model)`	Simpler interface
Format	JSON only	JSON + Parquet	Faster loading
Modality	Text only	Text + Image	Multimodal support
Caching	Basic	Advanced with validation	Reproducibility
Leaderboard	Single	Multi-domain	Better organization

6.2 MMTEB (ICLR 2025) Additions¶

New Features:

Instruction Following Tasks
17 instruction types
Domain-aware retrieval
100+ new datasets
Long-Document Retrieval
Documents up to 32K tokens
Specialized evaluation
Code Retrieval
Programming language specific
Semantic code search
Conversational Retrieval
Multi-turn dialogue context
Conversation history handling

6.3 API Migration Guide¶

Old (v1):

tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"])

New (v2):

tasks = mteb.get_tasks(tasks=["Banking77Classification"])
for task in tasks:
    results = task.evaluate(model, eval_splits=["test"])

7. Implementation Guide¶

7.1 Basic Evaluation Example¶

from sentence_transformers import SentenceTransformer
import mteb

# 1. Load model
model = SentenceTransformer('intfloat/multilingual-e5-large')

# 2. Select tasks
tasks = mteb.get_tasks(
    tasks=["Banking77Classification", "STSBenchmark"],
    languages=["id"]  # Will filter for Indonesian tasks
)

# 3. Run evaluation
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
    model,
    eval_splits=["test"],
    output_folder="results/"
)

# 4. View results
for task_name, task_results in results.items():
    print(f"{task_name}: {task_results}")

7.2 Custom Encoder Example¶

from mteb.encoder_interface import EncoderInterface

class MyIndonesianEncoder(EncoderInterface):
    def __init__(self, model_name, device="cuda"):
        self.model = load_model(model_name)
        self.device = device

    def encode(self, texts, batch_size=32):
        """Encode texts to embeddings."""
        return self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=False
        )

    @property
    def dimension(self):
        return self.model.embedding_dim

7.3 Result Format¶

{
    "dataset_name": {
        "test": {
            "en": {
                "main_score": 0.85,
                "accuracy": 0.85,
                "f1_macro": 0.82,
                "evaluation_time": 12.5,
                "footprint": {
                    "memory_mb": 512,
                    "model_parameters": 560000000
                }
            }
        }
    }
}

8. Indonesia-MTEB Task Mapping¶

8.1 Proposed Dataset Distribution by Task¶

Task Category	Target Count	Current ID Sources	Translation Needed	AI Generation Needed
Classification	8-12	3 (IndoNLU, NusaX)	4-5	2-3
Clustering	5-8	0	2-3	3-5
Pair Classification	3-5	0	2	1-2
Reranking	3-5	0	2	1-2
Retrieval	8-12	1 (MIRACL-ID)	4-5	3-5
STS	5-8	0	3-4	2-3
Summarization	3-5	0	2	1-2
Instruction Following	3-5	0	0	3-5
TOTAL	38-55	4	19-23	16-26

8.2 Priority Matrix for Indonesia-MTEB¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                  INDONESIA-MTEB TASK PRIORITY MATRIX                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  HIGH PRIORITY (Phase 1: Foundation)                                      ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Classification  │ 8 datasets  │  IndoNLU + Translation + Generation      ║│
│  ║  Retrieval       │ 8 datasets  │  MIRACL-ID + Translation + Generation     ║│
│  ║  Clustering      │ 5 datasets  │  Translation + Generation                 ║│
│  ║  STS            │ 5 datasets  │  Translation + Generation                 ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  MEDIUM PRIORITY (Phase 2: Coverage)                                      ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Pair Class.     │ 4 datasets  │  Translation + Generation                ║│
│  ║  Reranking       │ 4 datasets  │  Translation + Generation                ║│
│  ║  Summarization   │ 3 datasets  │  Translation + Generation                ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
│  ╔═════════════════════════════════════════════════════════════════════════╗│
│  ║  NOVEL CONTRIBUTION (Phase 3: Innovation)                                 ║│
│  ║  ──────────────────────────────────────────────────────────────────────  ║│
│  ║  Instruction Following │ 5 datasets  │  AI Generation (Novel)             ║│
│  ╚═════════════════════════════════════════════════════════════════════════╝│
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

8.3 Dataset Naming Convention¶

Proposed Indonesia-MTEB Naming:

indonesiamteb/{task}_{domain}_{source}

Examples:
- indonesiamteb/classification_sentiment_newsnlp
- indonesiamteb/clustering_news_wikipedia_id
- indonesiamteb/retrieval_wikipedia_miracl_id
- indonesiamteb/sts_news_translated_stsb
- indonesiamteb/instruction_retrieval_legal_generated

9. Technical Considerations¶

9.1 Performance Optimization¶

Encoding Speed:

Technique	Speedup	Implementation
Batch encoding	10-50x	`encode(texts, batch_size=128)`
GPU utilization	5-20x	`model.to("cuda")`
Quantization	2-4x	`quantize_model=True`
Caching	∞	`cache_dir="cache/"`

Memory Optimization:

# For large corpora, use streaming
from datasets import load_dataset

corpus = load_dataset("mteb/MIRACLRetrieval", split="test")
corpus = corpus.map(lambda x: {"text": x["text"]}, batched=True)

9.2 Cross-Lingual Considerations¶

For Indonesian evaluation, consider:

Script Compatibility: Indonesian uses Latin script (same as English)
Tokenization: Different tokenizers may affect embedding quality
Domain Transfer: English-pretrained models may need adaptation

Recommended Models for Indonesian:

Model	Parameters	Indonesian Training	MTEB ID Score
`intfloat/multilingual-e5-large`	560M	Yes (100+ langs)	Baseline
`BAAI/bge-m3`	600M	Yes (multilingual)	To evaluate
`sentence-transformers/LaBSE`	470M	Yes	To evaluate
`LazarusNLP/indonesian-sbert-base`	110M	Yes (ID-only)	To evaluate

9.3 Reproducibility¶

Essential for MTEB integration:

import numpy as np
import torch

# Set seeds
np.random.seed(42)
torch.manual_seed(42)

# Use deterministic algorithms
torch.use_deterministic_algorithms(True)

# Log model details
print(f"Model: {model_name}")
print(f"Revision: {model_revision}")
print(f"MTEB version: {mteb.__version__}")

10. References¶

Primary Sources¶

Muennighoff, N., et al. (2023). "MTEB: Massive Text Embedding Benchmark". Proceedings of the 17^th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023). arXiv:2210.07316
Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". International Conference on Learning Representations (ICLR 2025). arXiv:2502.13595
Chung, I., et al. (2025). "Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks". arXiv. arXiv:2506.21182

Technical Documentation¶

MTEB GitHub Repository: github.com/embeddings-benchmark/mteb
MTEB v2 Introduction: huggingface.co/blog/isaacchung/mteb-v2
Sentence Transformers MTEB Guide: sbert.net/docs/sentence_transformer/usage/mteb_evaluation.html

Evaluation Metrics¶

Rosenberg, A., & Hirschberg, J. (2007). "V-Measure: A conditional entropy-based external cluster evaluation measure". EMNLP-CoNLL.
Weaviate - Retrieval Evaluation Metrics: weaviate.io/blog/retrieval-evaluation-metrics
Evidently AI - NDCG Explained: evidentlyai.com/ranking-metrics/ndcg-metric

Dataset Examples¶

MTEB Datasets Hub: huggingface.co/mteb
MIRACL (Multilingual Information Retrieval): github.com/project-miracl/miracl

11. Document Status¶

[!NOTE] Next Document: Document 03 - Existing Indonesian Datasets

This document provides detailed analysis of existing Indonesian NLP datasets, their MTEB compatibility, and aggregation strategies for Indonesia-MTEB.

Change Log:

Version	Date	Changes	Author
1.0	2026-01-25	Initial version	Research Team
2.0	2026-01-25	Enhanced edition with detailed analysis, MMTEB updates, implementation guides	Research Team

This document is a living record. Updated as research progresses.

Document Navigation¶