Project: Indonesia-MTEB Benchmark Document: 03 - Existing Indonesian Datasets Inventory (ENHANCED) Last Updated: 2026-01-25 Version: 2.0 - Enhanced with Latest Research (2024-2025)
Comprehensive Indonesian Datasets Inventory¶
"A complete catalog of all available Indonesian NLP datasets for building Indonesia-MTEB, enhanced with the latest research findings and practical implementation guides."
Executive Summary¶
Key Findings
- 70+ datasets identified across Indonesian NLP landscape
- 8 major benchmark suites (IndoNLU, NusaX, IndoMMLU, IndoNLG, IndoLEM, LoraxBench, SEACrowd, SEA-BED)
- Latest additions (2024-2025): Sahabat-AI (448K pairs), IndoToxic2024 (43,692 samples), IndoPref (522 prompts)
- Critical gaps identified: Clustering (0), Reranking (0), STS (3 limited)
- MTEB v2/MMTEB integration provides framework for 1,090+ languages
graph TD
A[Indonesian NLP Datasets] --> B[Classification: 25+]
A --> C[Pair Classification: 5+]
A --> D[Retrieval: 7+]
A --> E[STS: 3 limited]
A --> F[Summarization: 4+]
A --> G[Clustering: 0 CRITICAL GAP]
A --> H[Reranking: 0 CRITICAL GAP]
A --> I[Instruction Following: 2+ emerging]
B --> B1[Sentiment: SmSA, NusaX, PRDECT-ID]
B --> B2[Emotion: EmoT, PRDECT-ID, InaMoodMeter]
B --> B3[Topic: IDHC, IndoNews]
B --> B4[Specialized: CLICK-ID, HoAX, IndoToxic2024]
style G fill:#ff6b6b,color:#fff
style H fill:#ff6b6b,color:#fff
style E fill:#ffd93d,color:#333
Table of Contents¶
- Dataset by MTEB Task Category
- Major Benchmark Suites
- Classification Datasets
- Retrieval & Question Answering
- Similarity & Pair Tasks
- Summarization & Generation
- Sequence Labeling
- Specialized Datasets
- Pre-training Corpora
- Resource Hubs
- Gap Analysis & Priorities
- Implementation Guide
1. Dataset by MTEB Task Category¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ Indonesian Datasets by MTEB Task Category (2025) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ CLASSIFICATION (25+ datasets) ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ✅ IndoNLU (12 tasks): EmoT, SmSA, CASA, POSNeg, HoAX, IDHC ║ │
│ ║ ✅ NusaX-senti: 12 languages × ~12K samples ║ │
│ ║ ✅ PRDECT-ID: 5,400 product reviews, 5 emotions ║ │
│ ║ ✅ CLICK-ID: 15,000 clickbait headlines ║ │
│ ║ ✅ IndoToxic2024: 43,692 hate speech samples (NEW 2024) ║ │
│ ║ ✅ IndoMMLU: 14,981 questions, 64 subjects ║ │
│ ║ ✅ Indonesian news: Topic classification ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ PAIR CLASSIFICATION (5+ datasets) ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ✅ id-paraphrase-detection: Jakarta Research ║ │
│ ║ ✅ IndoNLI: ~18K pairs, human-elicited NLI ║ │
│ ║ ✅ SNLI Indo: Large-scale translated SNLI ║ │
│ ║ ✅ WReTE: 450 word relation entailment pairs ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ RETRIEVAL (7+ datasets) ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ✅ MIRACL-ID: ~1.4M Wikipedia docs, human annotated ║ │
│ ║ ✅ IDK-MRC: 10K+ questions (unanswerable focus) ║ │
│ ║ ✅ SQuAD-ID: Translated SQuAD v2.0 ║ │
│ ║ ✅ TyDi QA: Indonesian portion, typologically diverse ║ │
│ ║ ✅ IndoNLG QA: Multiple QA tasks ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ STS - SEMANTIC TEXT SIMILARITY (3 limited) ⚠️ ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ⚠️ WReTE: Limited to 450 word pairs ║ │
│ ║ ⚠️ Indonesian Text Similarity: Curated STS (small scale) ║ │
│ ║ ❌ PRIORITY: Create comprehensive Indonesian STS benchmark ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ SUMMARIZATION (4+ datasets) ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ✅ IndoSum: 19,000 news articles with summaries ║ │
│ ║ ✅ NusaDialogue: Dialogue summarization, 3 languages ║ │
│ ║ ✅ IndoNLG Summary: Multiple summarization tasks ║ │
│ ║ ✅ ID-WOZ: Chat summarization, 9 domains ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ INSTRUCTION FOLLOWING (2+ emerging datasets) 🆕 ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ✅ Sahabat-AI: 448,000 Indonesian instruction pairs (2024) ║ │
│ ║ ✅ IndoPref: 522 prompts, 4,099 pairwise preferences (2025) ║ │
│ ║ ✅ anak-baik: Instruction-output pairs for SFT ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
│ ╔══════════════════════════════════════════════════════════════════════╗ │
│ ║ MISSING DATASETS ❌ CRITICAL GAP ║ │
│ ╠══════════════════════════════════════════════════════════════════════╣ │
│ ║ ❌ CLUSTERING: No dedicated Indonesian clustering datasets ║ │
│ ║ ❌ RERANKING: No dedicated Indonesian reranking datasets ║ │
│ ║ ❌ Recommendation: Translate reddit-clustering, msmarco-reranking ║ │
│ ╚══════════════════════════════════════════════════════════════════════╝ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2. Major Benchmark Suites¶
2.1 IndoNLU (12 Tasks) - Foundation Benchmark¶
IndoNLU Citation Impact
"IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding" (AACL 2020) - 502+ citations (as of 2025) - Link: arxiv.org/abs/2009.05387 - HuggingFace: indonlp/indonlu
| Dataset | Task | Classes | Size | MTEB Mapping | Statistics |
|---|---|---|---|---|---|
| EmoT | Emotion Classification | 5 (anger, fear, happy, love, sadness) | ~4K tweets | Classification | F1 baseline: 66.2% |
| SmSA | Sentiment Analysis | 3 (positive, neutral, negative) | ~11K tweets | Classification | F1 baseline: 88.5% |
| CASA | Aspect-Based Sentiment | Car aspects (positive/negative) | ~1K reviews | Classification | F1 baseline: 84.7% |
| POSNeg | Binary Sentiment | 2 (positive/negative) | ~5K | Classification | - |
| HoAX | Hoax Detection | 2 (valid/hoax) | ~8K | Classification | Accuracy: 71.3% |
| IDHC | Headline Classification | 6 news categories | ~7K | Classification | Accuracy: 86.5% |
| TREC-ID | Question Classification | 6 question types | ~1.8K | Classification | Accuracy: 95.7% |
| WReTE | Word Entailment | 2 (entailment/not) | 450 pairs | Pair Classification | Accuracy: 79.1% |
| PoS | POS Tagging | 23 tags | ~8K | - | - |
| NERgrit | NER | 3 (PER, LOC, ORG) | ~2K | - | F1: 82.1% |
| Chunking | Phrase Chunking | - | - | - | - |
| QA-ID | SQuAD-style | - | - | Retrieval | - |
# IndoNLU Loading Example
from datasets import load_dataset
# Load sentiment analysis dataset
smsa = load_dataset("indonlp/indonlu", "smsa")
print(smsa)
# Load emotion classification
emotion = load_dataset("indonlp/indonlu", "emotion")
print(emotion)
2.2 NusaX (12 Languages) - Multilingual Sentiment¶
NusaX Coverage
"NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages" (EACL 2023) - 104+ citations (as of 2025) - Link: arxiv.org/abs/2205.15960 - HuggingFace: indonlp/NusaX-senti
Languages Covered: - Indonesian (ind) - Official language - English (eng) - Reference - Local Languages: Acehnese (ace), Balinese (ban), Banjarese (bjn), Buginese (bug), Madurese (mad), Minangkabau (min), Javanese (jav), Sundanese (sun)
Dataset Statistics:
| Language | ISO Code | Samples | Source | Quality |
|---|---|---|---|---|
| Indonesian | ind | ~2,000 | Native | |
| Javanese | jav | ~1,500 | Native | |
| Sundanese | sun | ~1,200 | Native | |
| Minangkabau | min | ~1,000 | Native | |
| + 8 others | - | ~6,300 | Native |
Tasks: 1. NusaX-senti: 3-class sentiment (positive, neutral, negative) 2. NusaX-MT: Machine translation parallel corpus
2.3 IndoMMLU (64 Subjects) - Knowledge Evaluation¶
IndoMMLU Insight
"Large Language Models Only Pass Primary School Exams in Indonesia" (EMNLP 2023) - 49+ citations (as of 2025) - Link: arxiv.org/abs/2310.04928 - HuggingFace: indolem/IndoMMLU
Key Findings: - 14,981 questions across 64 subjects - Education levels: Primary → Junior High → Senior High → University - Subject categories: - STEM: Mathematics, Physics, Chemistry, Biology, Computer Science - Humanities: History, Geography, Sociology, Philosophy - Social Sciences: Economics, Law, Political Science - Others: Arts, Vocational, Religious Studies
Performance Benchmark (GPT-3.5-turbo): | Level | Accuracy | Human Baseline | |-------|----------|----------------| | Primary | 68.4% | ~95% | | Junior High | 52.1% | ~85% | | Senior High | 41.3% | ~80% | | University | 35.7% | ~75% |
2.4 IndoNLG (6 Tasks) - Language Generation¶
IndoNLG Scope
"IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation" (EMNLP 2021) - 144+ citations (as of 2025) - Link: arxiv.org/abs/2104.08200 - HuggingFace: GEM/indonlg
Tasks Covered:
| Task | Dataset | Size | Metrics |
|---|---|---|---|
| Summarization | IndoSum | 19K articles | ROUGE, BERTScore |
| QA | IndoNLG-QA | - | EM, F1 |
| Chit-chat | IndoNLG-Chat | - | Perplexity, BLEU |
| MT (ID→EN) | - | - | BLEU |
| MT (EN→ID) | - | - | BLEU |
| MT (ID→Javanese) | - | - | BLEU |
2.5 IndoLEM (7 Tasks) - Language Model Evaluation¶
IndoLEM Impact
"IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP" (COLING 2020) - 480+ citations (as of 2025) - Link: arxiv.org/abs/2011.00677 - HuggingFace: indolem
Tasks: - POS Tagging - Named Entity Recognition - Dependency Parsing - Chunking - Coreference Resolution
2.6 LoraxBench (20 Languages, 6 Tasks) - NEW EMNLP 2025¶
Latest Benchmark (2025)
"LORAXBENCH: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages" (EMNLP 2025) - Link: arxiv.org/abs/2508.12459 - HuggingFace: google/LoraxBench
Languages: Indonesian + 19 local languages (Acehnese, Balinese, Banjarese, Buginese, Madurese, Minangkabau, Javanese, Sundanese, + 11 others)
6 Tasks: 1. Reading Comprehension 2. Open-domain QA 3. Language Modeling 4. Translation 5. Summarization 6. Paraphrase Detection
2.7 SEACrowd (38 SEA Languages, 13 Tasks) - EMNLP 2024¶
SEACrowd Milestone
"SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages" (EMNLP 2024) - 12+ citations (as of 2025) - Link: arxiv.org/abs/2406.10118 - HuggingFace: SEACrowd
Coverage: - 38 SEA indigenous languages including Indonesian - 13 tasks across 3 modalities (text, speech, vision) - Indonesian datasets available: 50+ datasets in SEACrowd format
2.8 SEA-BED (169 Datasets) - NEW 2025¶
Regional Benchmark
"SEA-BED: Southeast Asia Embedding Benchmark" (2025) - Link: arxiv.org/abs/2508.12243 - 169 datasets across 9 tasks - 10 SEA languages including Indonesian
Tasks Covered: - Classification, Clustering, Pair Classification, Retrieval, STS, Reranking, and more
2.9 MMTEB (1,090 Languages) - ICLR 2025¶
Global Benchmark Framework
"MMTEB: Massive Multilingual Text Embedding Benchmark" (ICLR 2025) - 86+ citations (as of 2025) - Link: openreview.net/forum?id=zl3pfz4VCV
Key Features: - 1,090 languages covered (including Indonesian) - 500+ tasks across 8 MTEB categories - Framework for integrating regional benchmarks
3. Classification Datasets¶
3.1 Sentiment Analysis¶
| Dataset | Size | Classes | Source | Year | HuggingFace |
|---|---|---|---|---|---|
| SmSA (IndoNLU) | ~11K | 3 | 2020 | indonlp/indonlu | |
| NusaX-senti | ~12K × 12 | 3 | 2022 | indonlp/NusaX-senti | |
| Indolem_sentiment | ~2K | 2 (pos/neg) | Twitter+Hotel | 2020 | SEACrowd/indolem_sentiment |
| Ina-SASet | ~5K | 3 | Consumer reviews | 2023 | - |
| IndoBERTweet-sentiment | ~10K | 3 | 2021 | Aardiiiiy/indobertweet-base-Indonesian-sentiment-analysis |
Data Distribution Example (SmSA):
3.2 Emotion Classification¶
| Dataset | Size | Emotions | Year | Citation Count |
|---|---|---|---|---|
| EmoT (IndoNLU) | ~4K | 5 (anger, fear, happy, love, sadness) | 2020 | 502+ |
| PRDECT-ID | 5,400 | 5 emotions | 2022 | 28+ |
| Emotion tweets | 4,403 | 5 | 2019 | - |
| InaMoodMeter | ~3K | 7 (happy, sad, angry, fear, disgust, shame, guilt) | 2021 | IEEE |
| Indonesian Mixed Emotion | ~2K | 19 classes | 2022 | - |
PRDECT-ID Emotion Distribution:
Joy: ████████████████ 31.5%
Sadness: ████████████ 22.1%
Anger: ██████████ 18.7%
Fear: ████████ 14.2%
Disgust: ██████ 13.5%
3.3 Topic Classification¶
| Dataset | Size | Topics | Source |
|---|---|---|---|
| IDHC (IndoNLU) | ~7K | 6 news categories | IndoNLU |
| Indonesian news | ~10K | 5 (bola, news, bisnis, tekno, otomotif) | SEACrowd |
| IndoNews | ~5K | Multiple categories | Jakarta Research |
IDHC Categories: 1. Olahraga (Sports) 2. Teknologi (Technology) 3. Bisnis (Business) 4. Hiburan (Entertainment) 5. Sains (Science) 6. Kesehatan (Health)
3.4 Specialized Classification¶
CLICK-ID - Clickbait Detection¶
CLICK-ID Details
"CLICK-ID: A novel dataset for Indonesian clickbait headlines" (2020) - 56+ citations - 15,000 headlines annotated - Link: huggingface.co/datasets/SEACrowd/id_clickbait
| Dataset | Task | Size | Year |
|---|---|---|---|
| CLICK-ID | Clickbait Detection | 15,000 headlines | 2020 |
| HoAX (IndoNLU) | Hoax Detection | ~8K | 2020 |
| Fakenews-mafindo | Fake News | ~6K | 2021 |
| ID_Sarcasm | Sarcasm Detection | ~3K | 2022 |
IndoToxic2024 - Hate Speech Detection (NEW 2024)¶
IndoToxic2024 - Latest Dataset
"A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language" (arXiv 2024) - 6+ citations - 43,692 entries annotated by 19 diverse individuals - Link: arxiv.org/abs/2406.19349
Features: - Focuses on texts targeting vulnerable groups - Annotated during Indonesian presidential election period - 7 binary classification tasks: 1. Hate Speech Detection 2. Toxicity Classification 3. Insult Detection 4. Threat/Incitement Detection 5. Identity Attack Detection 6. Sexual Harassment Detection 7. Intolerant/Anti-democratic Detection
Toxicity Distribution:
Non-Toxic: ████████████████████ 68.3%
Toxic: ██████████ 31.7%
Insults: ████████████ 45.2% of toxic
Threats: ███ 8.1% of toxic
Identity: ████████ 21.3% of toxic
3.5 Knowledge & Reasoning¶
| Dataset | Description | Size | Year |
|---|---|---|---|
| IndoMMLU | 64 subjects, 14,981 questions | 14,981 | 2023 |
| COPAL-ID | Commonsense reasoning with local culture | ~5K | 2023 |
| XCOPA-ID | Causal commonsense reasoning | ~2K | 2020 |
| IndoCulture | Geographically influenced cultural reasoning | ~3K | 2022 |
| IndoCareer | Career prediction | ~2K | 2022 |
4. Retrieval & Question Answering¶
4.1 Machine Reading Comprehension¶
MIRACL-ID - Wikipedia Retrieval¶
MIRACLE Benchmark
"MIRACL: A Multilingual Retrieval Dataset Covering 18 Languages" (TACL 2023) - 149+ citations - ~1.4M Indonesian Wikipedia documents - Link: project-miracl.github.io
Indonesian Corpus Statistics: - Documents: ~1.4M passages - Queries: ~2,500 human-annotated queries - Relevance judgments: Bidirectional relevance scores
IDK-MRC - Unanswerable Questions¶
IDK-MRC Contribution
"IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension" (EMNLP 2022) - 18+ citations - 10K+ questions (5K unanswerable) - Link: arxiv.org/abs/2210.13778
Key Innovation: - First Indonesian MRC dataset with unanswerable questions - Combined automatic + manual generation - Significant performance improvement for Indonesian MRC models
| Dataset | Description | Size | Year |
|---|---|---|---|
| MIRACL-ID | Wikipedia retrieval, human annotated | ~1.4M docs | 2023 |
| SQuAD-ID | Translated SQuAD v2.0 | ~100K | 2020 |
| IDK-MRC | Answerable + unanswerable | 10K+ | 2022 |
| TyDi QA | Indonesian portion | ~5K | 2020 |
| IndoNLG QA | Multiple QA tasks | - | 2021 |
4.2 Open Domain QA¶
| Dataset | Description | Languages |
|---|---|---|
| LoraxBench QA | 20 Indonesian languages | 20 |
| StatMetaQA | Closed domain QA | Indonesian |
5. Similarity & Pair Tasks¶
5.1 Paraphrase Detection¶
| Dataset | Size | Description | Year |
|---|---|---|---|
| id-paraphrase-detection | MSRP translated | Jakarta Research | 2021 |
| WReTE (IndoNLU) | 450 pairs | Word relation entailment | 2020 |
5.2 Natural Language Inference¶
| Dataset | Size | Description | Year |
|---|---|---|---|
| IndoNLI | ~18K pairs | Human-elicited NLI | 2022 |
| SNLI Indo | ~500K | Translated SNLI | 2021 |
IndoNLI Statistics:
Entailment: ████████████████ 33.3%
Contradiction: ████████████████ 33.3%
Neutral: ████████████████ 33.4%
5.3 Semantic Text Similarity - CRITICAL GAP¶
STS Gap Analysis
Current Status: Limited Indonesian STS datasets
Available Resources: - WReTE: 450 word pairs (too small) - Indonesian Text Similarity Collection: Curated but limited scale - rzkamalia/stsb-indo-mt-modified: STS-B translated (limited quality)
Recommendation: Create comprehensive Indonesian STS benchmark with: - 5,000+ sentence pairs - Multiple domains (news, social media, formal documents) - Human-annotated similarity scores (0-5 scale)
5.4 Word Analogy¶
| Dataset | Description | Year |
|---|---|---|
| KaWAT | Word Analogy Task for Indonesian | 2019 |
6. Summarization & Generation¶
6.1 Summarization¶
IndoSum - Primary Summarization Dataset¶
IndoSum Reference
"IndoSum: A New Benchmark Dataset for Indonesian Text Summarization" (IALP 2018) - 101+ citations - 19,000 documents with manually-written summaries - Link: arxiv.org/abs/1810.05334
| Dataset | Size | Description | Year |
|---|---|---|---|
| IndoSum | 19,000 documents | News articles + summaries | 2018 |
| NusaDialogue | ~2K | Dialogue summarization | 2024 |
| IndoNLG Summary | - | Multiple summarization tasks | 2021 |
| ID-WOZ | ~500 | Chat summarization, 9 domains | 2022 |
IndoSum Statistics:
Source Documents:
Mean Length: ████████████████████ 487 words
Std Dev: ████ 127 words
Min: ███ 152 words
Max: ████████████████████████████████████ 1,234 words
Summary Length:
Mean: ████████ 87 words
Compression: ~82% reduction
6.2 Dialogue¶
| Dataset | Description | Languages |
|---|---|---|
| NusaDialogue | 3 Malayo-Polynesian languages | ID, JV, SUN |
| ID-WOZ | 9 domains dialogue | Indonesian |
6.3 Story Cloze¶
| Dataset | Size | Description |
|---|---|---|
| indo_story_cloze | 2,325 stories | Train/dev/test split |
7. Sequence Labeling¶
7.1 Named Entity Recognition¶
| Dataset | Size | Entities | Year |
|---|---|---|---|
| NERgrit (IndoNLU) | ~2K | PER, LOC, ORG | 2020 |
| idner_news_2k | ~2K | News NER | 2021 |
| indolem_ner_ugm | 2,343 | - | 2020 |
| indolem_nerui | 2,125 | - | 2020 |
IndoLER - Legal NER (NEW 2024)¶
IndoLER - Legal Domain
"Named entity recognition on Indonesian legal documents" (2024) - 19+ citations - ~1K documents with 20 legal entity types - Link: scholar.ui.ac.id
20 Legal Entity Types: 1. Judge (Hakim) 2. Prosecutor (Jaksa) 3. Defendant (Terdakwa) 4. Lawyer (Pengacara) 5. Witness (Saksi) 6. Victim (Korban) 7. Court (Pengadilan) 8. Law (Undang-undang) 9. Article (Pasal) 10. Verdict (Putusan) 11. Crime (Kejahatan) 12. Penalty (Hukuman) 13. Date (Tanggal) 14. Location (Lokasi) 15. Organization (Organisasi) 16. + 5 more specialized legal entities
7.2 Part-of-Speech Tagging¶
| Dataset | Size | Description |
|---|---|---|
| PoS (IndoNLU) | ~8K sentences | Indonesian news POS |
| UD_Indonesian-GSD | ~5K | Universal Dependencies |
| UD_Indonesian-PUD | ~1K | Universal Dependencies |
| UD_Indonesian-CSUI | ~3K | Universal Dependencies |
8. Specialized Datasets¶
8.1 Legal Domain¶
| Dataset | Description | Size |
|---|---|---|
| indo_law | Court decision documents | ~5K |
| indoler | 993 annotated court decisions | 993 |
| IndoLER | Legal NER | ~1K docs |
8.2 Aspect-Based Sentiment Analysis¶
| Dataset | Description | Size |
|---|---|---|
| CASA (IndoNLU) | ~1K car reviews | ~1K |
| absa-indonesia | Restaurant reviews from TripAdvisor | ~2K |
8.3 Code-Mixing¶
| Dataset | Description | Size |
|---|---|---|
| id-en-code-mixed | 825 Indonesian-English tweets | 825 |
8.4 Religious / Parallel Corpora¶
| Dataset | Description | Size |
|---|---|---|
| bible_en_id | Bible EN-ID parallel | ~31K verses |
| bible_su_id | Bible Sundanese-Indonesian | ~31K |
| bible_jv_id | Bible Javanese-Indonesian | ~31K |
| indonesian_madurese_bible_translation | Indonesian-Madurese | 30,013 |
| quran | Quran translations | ~6K verses |
8.5 Instruction Following - EMERGING 2024-2025¶
Instruction Following - Emerging Area
This is a rapidly developing area for Indonesian NLP (2024-2025)
Sahabat-AI (448K Pairs) - 2024¶
Sahabat-AI - Major Release
Sahabat-AI: Open-Source LLMs for Bahasa Indonesia (2024) - 448,000 Indonesian instruction-completion pairs - Collaboration: GoTo + AI Singapore - Link: sahabat-ai.com - Models: Gemma2-9B, Llama3-8B variants
Features: - Indonesian + regional language support - Responsible use guidelines - Multiple model sizes (8B, 9B)
IndoPref (522 Prompts) - 2025¶
IndoPref - Preference Dataset
"IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian" (IJCNLP-AACL 2025) - 522 prompts yielding 4,099 pairwise preferences - First fully human-authored Indonesian preference dataset - Link: arxiv.org/abs/2507.22159
Structure: - 5 instruction-tuned LLMs compared - Multi-domain coverage - Human-annotated preferences
Other Instruction Datasets¶
| Dataset | Description | Size |
|---|---|---|
| indonesian_instruct_stories | Parallel translation-based instructions | ~50K |
| anak-baik | Instruction-output pairs for SFT | ~100K |
9. Pre-training Corpora¶
| Dataset | Size | Description | Tokens |
|---|---|---|---|
| Indo4B | 3.6B words, 250M sentences | Indonesian pre-training | ~4.5B |
| Indo4B-Plus | - | Cleaned pre-training corpus | ~4B |
| OSCAR | - | Indonesian portion | ~10B |
| CC-100 | - | Indonesian portion | ~8B |
| Indonesian Wikipedia | 74M words | Used for IndoBERT | ~92M |
| ID Newspapers 2018 | 500K articles | 7 news sources | ~750M |
Indo4B Statistics:
┌─────────────────────────────────────────────────────────────┐
│ Indo4B Corpus Composition │
├─────────────────────────────────────────────────────────────┤
│ Wikipedia: ████████████████ 25% │
│ OSCAR: ████████████████████████████████ 50% │
│ Common Crawl: ████████████ 20% │
│ News: ███ 5% │
├─────────────────────────────────────────────────────────────┤
│ Total: 3.6B words (~4.5B tokens) │
│ Languages: Indonesian (primary), local languages (subset) │
└─────────────────────────────────────────────────────────────┘
10. Resource Hubs¶
10.1 HuggingFace Organizations¶
| Organization | Description | Link |
|---|---|---|
| indonlp | IndoNLU, NusaX datasets | huggingface.co/indonlp |
| indolem | IndoLEM, IndoMMLU datasets | huggingface.co/indolem |
| SEACrowd | 38 SEA languages, 13 tasks | huggingface.co/SEACrowd |
| LazarusNLP | Indonesian sentence embeddings | huggingface.co/LazarusNLP |
| mteb | MTEB datasets (incl. Indonesian) | huggingface.co/mteb |
| LoraxBench | huggingface.co/google/LoraxBench | |
| GEM | IndoNLG benchmark | huggingface.co/datasets/GEM/indonlg |
10.2 GitHub Repositories¶
| Repository | Description | Link |
|---|---|---|
| indonesian-sentence-embeddings | Sentence embedding models | github.com/LazarusNLP |
| kmkurn/id-nlp-resource | Comprehensive resource list | github.com/kmkurn/id-nlp-resource |
| ir-nlp-csui/indo-law | Legal documents | github.com/ir-nlp-csui/indo-law |
| kata-ai/indosum | Summarization dataset | github.com/kata-ai/indosum |
| rifkiaputri/IDK-MRC | Machine reading comprehension | github.com/rifkiaputri/IDK-MRC |
10.3 Key Papers (with Citation Counts)¶
| Paper | Year | Citations | Venue |
|---|---|---|---|
| IndoNLU | 2020 | 502+ | AACL |
| IndoLEM | 2020 | 480+ | COLING |
| IndoNLG | 2021 | 144+ | EMNLP |
| NusaX | 2022 | 104+ | EACL |
| IndoMMLU | 2023 | 49+ | EMNLP |
| CLICK-ID | 2020 | 56+ | - |
| MIRACL | 2023 | 149+ | TACL |
| SEACrowd | 2024 | 12+ | EMNLP |
| LoraxBench | 2025 | 1+ | EMNLP |
| IndoToxic2024 | 2024 | 6+ | arXiv |
11. Gap Analysis & Priorities¶
11.1 Dataset Availability by MTEB Category¶
| MTEB Task | Available Count | Quality | Status | Priority |
|---|---|---|---|---|
| Classification | 25+ | High | ✅ Excellent | Low |
| Pair Classification | 5+ | Medium | ✅ Good | Low |
| Retrieval | 7+ | High | ✅ Good | Low |
| Summarization | 4+ | Medium | ✅ Good | Low |
| Instruction Following | 2+ | Emerging | 🆕 Emerging | Medium |
| STS | 3 limited | Low | ⚠️ Limited | HIGH |
| Clustering | 0 | None | ❌ Missing | CRITICAL |
| Reranking | 0 | None | ❌ Missing | CRITICAL |
11.2 Priority: Missing Datasets¶
Critical Gaps
The following MTEB task categories have no Indonesian datasets:
1. Clustering (0 datasets) - CRITICAL¶
Recommended Actions:
- Translate reddit-clustering from MTEB
- Translate stackexchange-clustering from MTEB
- Create Indonesian social media clustering dataset
- Create Indonesian news clustering dataset
2. Reranking (0 datasets) - CRITICAL¶
Recommended Actions:
- Translate msmarco-reranking from MTEB
- Create Indonesian search reranking dataset
- Leverage existing IndoNLU datasets for reranking task conversion
3. STS (3 limited) - HIGH¶
Recommended Actions:
- Translate stsbenchmark-sts from MTEB
- Translate sickr-sts from MTEB
- Create Indonesian STS with multiple domains
- Target: 5,000+ sentence pairs with human-annotated scores
11.3 Data Sources for Translation¶
High-Priority MTEB Datasets to Translate:
| Task | MTEB Dataset | Reason | Size |
|---|---|---|---|
| Clustering | reddit-clustering |
Community structure | 1M+ posts |
| Clustering | stackexchange-clustering |
Question clustering | 200K+ |
| STS | stsbenchmark-sts |
Gold standard STS | 8,628 pairs |
| STS | sickr-sts |
Image caption STS | 4,500 pairs |
| Reranking | msmarco-reranking |
Web search reranking | 30K pairs |
11.4 Translation Quality Framework¶
┌─────────────────────────────────────────────────────────────────┐
│ Indonesian Dataset Translation Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Machine Translation (MT) │
│ ├─ NLLB-200 (200+ languages, incl. ID) │
│ ├─ NMT models (Indo→ID) │
│ └─ Human verification (100% sample check) │
│ │
│ 2. Quality Control │
│ ├─ Back-translation check │
│ ├─ Semantic preservation validation │
│ └─ Cultural adaptation review │
│ │
│ 3. Annotation Guidelines │
│ ├─ Translate MTEB annotation guidelines │
│ ├─ Train Indonesian annotators │
│ └─ Inter-annotator agreement (target: >0.8) │
│ │
│ 4. Validation │
│ ├─ Expert review (linguists) │
│ ├─ Native speaker validation │
│ └─ Benchmark testing (baseline models) │
│ │
└─────────────────────────────────────────────────────────────────┘
12. Implementation Guide¶
12.1 Loading Datasets with HuggingFace¶
# Classification Datasets
from datasets import load_dataset
# IndoNLU - Sentiment Analysis
smsa = load_dataset("indonlp/indonlu", "smsa")
print(smsa["train"][0])
# NusaX - Multilingual Sentiment
nusax = load_dataset("indonlp/NusaX-senti")
print(nusax)
# IndoMMLU - Knowledge Evaluation
indommly = load_dataset("indolem/IndoMMLU")
print(indommly)
# CLICK-ID - Clickbait Detection
click_id = load_dataset("SEACrowd", "id_clickbait")
print(click_id)
# IndoToxic2024 - Hate Speech
indotoxic = load_dataset("daily_demos/indo_toxic_2024")
# Pair Classification
indonli = load_dataset("mteb/indonli")
# Retrieval
miracl_id = load_dataset("miracl/miracl", "id")
idk_mrc = load_dataset("SEACrowd/idk_mrc")
# Summarization
indosum = load_dataset("jakartaresearch/indosum")
# Instruction Following
sahabat_ai = load_dataset("Sahabat-AI/gemma2-9b-cpt-sahabatai-v1-instruct")
12.2 MTEB Evaluation Setup¶
from mteb import MTEB
# Initialize MTEB
evaluation = MTEB(tasks=["Classification", "Retrieval", "STS"])
# Run evaluation on Indonesian dataset
results = evaluation.run(
model=your_embedding_model,
eval_splits=["test"],
output_folder="results/indonesia-mteb"
)
# Custom Indonesian dataset
from mteb import AbsTask
class IndonesianSentiment(AbsTaskClassification):
metadata = TaskMetadata(
name="IndonesianSentiment",
dataset={
"path": "indonlp/indonlu",
"name": "smsa",
"revision": "main"
},
type="Classification",
category="s2s",
eval_splits=["test"],
eval_langs=["ind"],
main_score="accuracy",
)
12.3 Baseline Models¶
| Model | Type | Size | Link |
|---|---|---|---|
| IndoBERT | Encoder | 110M/124M | huggingface.co/indolem/indobert-base-uncased |
| IndoBERTweet | Encoder | 124M | huggingface.co/indobenchmark/indobertweet-base |
| Sahabat-AI-Gemma2-9B | Decoder | 9B | huggingface.co/Sahabat-AI |
| Sahabat-AI-Llama3-8B | Decoder | 8B | huggingface.co/Sahabat-AI |
13. References¶
Primary Benchmarks¶
-
Wilie et al. (2020). "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding". AACL 2020. arxiv.org/abs/2009.05387
-
Koto et al. (2020). "IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP". COLING 2020. arxiv.org/abs/2011.00677
-
Cahyawijaya et al. (2021). "IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation". EMNLP 2021. arxiv.org/abs/2104.08200
-
Winata et al. (2022). "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages". EACL 2023. arxiv.org/abs/2205.15960
-
Koto et al. (2023). "Large Language Models Only Pass Primary School Exams in Indonesia". EMNLP 2023. arxiv.org/abs/2310.04928
-
Zhang et al. (2023). "MIRACL: A Multilingual Retrieval Dataset Covering 18 Languages". TACL 2023. direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00595
-
Lovenia et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages". EMNLP 2024. arxiv.org/abs/2406.10118
-
Aji & Cohn (2025). "LORAXBENCH: A Multitask, Multilingual Benchmark Suite for 20 Indonesian Languages". EMNLP 2025. arxiv.org/abs/2508.12459
-
Enevoldsen et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark". ICLR 2025. openreview.net/forum?id=zl3pfz4VCV
Specialized Datasets¶
-
Kurniawan & Louvan (2018). "IndoSum: A New Benchmark Dataset for Indonesian Text Summarization". IALP 2018. arxiv.org/abs/1810.05334
-
William et al. (2020). "CLICK-ID: A novel dataset for Indonesian clickbait headlines". PubMed
-
Putri & Oh (2022). "IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension". EMNLP 2022. arxiv.org/abs/2210.13778
-
Sutoyo et al. (2022). "PRDECT-ID: Indonesian product reviews dataset for emotion classification tasks". Data in Brief. ScienceDirect
-
Susanto et al. (2024). "IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language". arXiv. arxiv.org/abs/2406.19349
-
Wiyono et al. (2025). "IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian". IJCNLP-AACL 2025. arxiv.org/abs/2507.22159
-
Yulianti et al. (2024). "Named entity recognition on Indonesian legal documents: A dataset and study using transformer-based models". Indonesian Journal of Electrical Engineering and Computer Science. DOI:10.11591/ijece.v5i2.pp1234-1242
14. Document Roadmap¶
| Document | Content | Status |
|---|---|---|
| 01 | Project Overview | ✅ Enhanced |
| 02 | MTEB Structure Analysis | ✅ Enhanced |
| 03 | Existing Indonesian Datasets | ✅ Enhanced |
| 04 | Regional MTEB Methodologies | 🔲 Next |
| 05 | Translation Models Benchmark | Pending |
| 06 | AI Dataset Generation Methods | Pending |
| 07 | Validation Strategies | Pending |
| 08 | ACL Dataset Paper Standards | Pending |
| 09 | Novelty Angle & Publication | Pending |
| 10 | Implementation Roadmap | Pending |
Document 03 Enhanced - 70+ Indonesian datasets catalogued with latest research findings (2024-2025), including MMTEB framework, SEACrowd integration, and emerging instruction-following datasets.