Skip to content

Indonesia-MTEB Benchmark

Complete documentation for the Indonesian Massive Text Embedding Benchmark.

Overview

Indonesia-MTEB is a comprehensive text embedding benchmark for Indonesian, covering all 8 MTEB task categories through three data strategies:

  1. Aggregation - 50+ existing Indonesian datasets
  2. Translation - Full MTEB translated to Indonesian
  3. AI Generation - Novel datasets for uncovered tasks

Documents

Foundation

Methodology

Validation & Standards

Publication & Implementation

Quick Stats

Metric Value
Primary Tasks Classification, Clustering, Retrieval, STS, etc.
Target Datasets 50-100+ total
Languages Indonesian, Javanese, Sundanese, Malay
Validation 3-stage pipeline (language → similarity → LLM-judge)