Indonesia-MTEB Benchmark¶
Complete documentation for the Indonesian Massive Text Embedding Benchmark.
Overview¶
Indonesia-MTEB is a comprehensive text embedding benchmark for Indonesian, covering all 8 MTEB task categories through three data strategies:
- Aggregation - 50+ existing Indonesian datasets
- Translation - Full MTEB translated to Indonesian
- AI Generation - Novel datasets for uncovered tasks
Documents¶
Foundation¶
- 01. Project Overview - Problem statement, 3-pronged strategy, goals
- 02. MTEB Structure Analysis - 8 task categories, formats, metrics
- 03. Existing Indonesian Datasets - Inventory of 50+ datasets
Methodology¶
- 04. Regional MTEB Methodologies - C-MTEB, VN-MTEB, TR-MTEB, SEA-BED
- 05. Translation Models Benchmark - EN-ID translation comparison
- 06. AI Dataset Generation Methods - LLM-based generation
Validation & Standards¶
- 07. Validation Strategies - 3-stage pipeline, quality thresholds
- 08. ACL Dataset Paper Standards - ARR submission, licensing
Publication & Implementation¶
- 09. Novelty Angle & Publication - Unique contributions
- 10. Implementation Roadmap - 12-month timeline, resources
- 11. Python Package Development - Building the package
Quick Stats¶
| Metric | Value |
|---|---|
| Primary Tasks | Classification, Clustering, Retrieval, STS, etc. |
| Target Datasets | 50-100+ total |
| Languages | Indonesian, Javanese, Sundanese, Malay |
| Validation | 3-stage pipeline (language → similarity → LLM-judge) |