Project: Indonesia-MTEB Benchmark Document: 06 - AI Dataset Generation Methods Last Updated: 2026-01-25 Version: 2.0 (Enhanced) Status: Research Phase
AI Dataset Generation Methods for Indonesia-MTEB¶
"Synthetic data generation is the key to filling critical gaps in Indonesia-MTEB—especially for Clustering, Reranking, and STS tasks where existing Indonesian datasets are scarce. This document provides a comprehensive guide to generating high-quality Indonesian embedding datasets at scale."
Table of Contents¶
- Executive Summary
- Synthetic Data Landscape
- Model Selection for Indonesian Generation
- Generation Frameworks
- Cost Estimation and Budgeting
- Task-Specific Generation Methods
- Prompt Engineering with Indonesian Examples
- Hard Negative Generation
- Quality Validation Pipeline
- Indonesian Text Normalization
- LLM-as-a-Judge Validation
- Indonesian-Specific Considerations
- Failure Mode Analysis
- Implementation Roadmap
- Case Studies from Regional MTEBs
- Key Takeaways
- References
1. Executive Summary¶
1.1 The Synthetic Data Opportunity¶
Regional MTEBs have successfully used LLM-generated synthetic data to fill dataset gaps:
| Benchmark | Synthetic Data Usage | Impact | Key Insight |
|---|---|---|---|
| ArabicMTEB | 40% of training data | +16 points (Swan-Small) | Synthetic data significantly boosts performance |
| SPEED | 920K embedding pairs | Outperforms E5-mistral with 1/10 GPT calls | Small models can generate high-quality data |
| VN-MTEB | Translation + validation | 65-72% kept ratio | LLM-as-judge critical for quality control |
| TR-MTEB | 34.2M training pairs | Competitive SOTA results | Synthetic + human data hybrid approach |
| AfriMTEB | 6 new synthetic datasets | 59 languages, 14 tasks | Multicultural synthetic data generation |
| SEA-BED | 169 datasets (71% human) | 10 SEA languages | Regional adaptation is critical |
1.2 Key Findings¶
- SPEED Framework (Chen et al., 2024) enables small 8B models to generate embedding data that outperforms GPT-4-only approaches with <1/10 API calls
- Indonesian-optimized models (SEA-LION-v4, SahabatAI, Cendol) show promising generation capabilities
- Three-stage quality control (language detection → semantic similarity → LLM-as-judge) is essential
- Scaling law: Log-linear relationship between synthetic data size and embedding model performance
- Cost efficiency: Command R+ at $1-2/1M tokens is 3-15× cheaper than GPT-4o/Claude for generation
- Task-specific prompting with Indonesian examples significantly improves quality
1.3 Indonesia-MTEB Dataset Gaps¶
| MTEB Task | Existing Indonesian Datasets | Gap | Synthetic Priority |
|---|---|---|---|
| Clustering | 0 | Complete absence | CRITICAL |
| Reranking | 0 | Complete absence | CRITICAL |
| STS | 3 (limited) | Insufficient coverage | HIGH |
| Retrieval | 2 | Domain gaps | MEDIUM |
| Pair Classification | 2 (IndoNLI, SNLI-Indo) | Limited domains | MEDIUM |
| Classification | 8 | Domain imbalance | LOW |
| Instruction Following | 0 | Complete absence | HIGH |
| Summarization | 1 (IndoSum) | Single source | MEDIUM |
2. Synthetic Data Landscape¶
2.1 State of Synthetic Data in NLP (2024-2025)¶
┌─────────────────────────────────────────────────────────────────────────┐
│ SYNTHETIC DATA GENERATION LANDSCAPE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ APPROACHES │
│ ├─ Pure LLM Generation (GPT-4, Claude, Command R+) │
│ ├─ Small Model Alignment (SPEED: 8B → GPT-4 quality) │
│ ├─ Self-Instruct (Bootstrap from seed examples) │
│ ├─ Hybrid (Synthetic + Human Curation) │
│ └─ Translation-Based (MT → Target Language) │
│ │
│ APPLICATIONS │
│ ├─ Text Embeddings (classification, STS, retrieval) │
│ ├─ Question Answering │
│ ├─ Instruction Tuning │
│ ├─ Code Generation │
│ └─ Multimodal (vision-language) │
│ │
│ QUALITY VALIDATION │
│ ├─ LLM-as-Judge (85.2% human agreement with calibration) │
│ ├─ Semantic Similarity (threshold-based filtering) │
│ ├─ Statistical Validation (word length, distribution) │
│ ├─ Deduplication (MinHash, SimHash) │
│ └─ Human Spot-Check (10% sample recommended) │
│ │
│ CHALLENGES │
│ ├─ Hallucination detection │
│ ├─ Mode collapse (repetitive outputs) │
│ ├─ Cultural bias │
│ ├─ Language register inconsistency │
│ └─ Quality-cost tradeoff │
│ │
└─────────────────────────────────────────────────────────────────────────┘
2.2 Synthetic Data on HuggingFace¶
As of 2024, 300+ datasets on HuggingFace are tagged as "synthetic", with mainstream LLMs leveraging synthetic data for training. Key synthetic data hubs:
- NusaCrowd: 121+ datasets for Indonesian and regional languages
- SEACrowd: 36 SEA indigenous languages, 13 tasks
- IndoNLP: Centralized Indonesian NLP resources
2.3 Cost-Benefit Analysis¶
| Method | Quality | Cost (USD/1M tokens) | Speed | Recommendation for Indonesian |
|---|---|---|---|---|
| GPT-4o | ★★★★★ | $5.00 input / $15.00 output | Slow | For seed data only |
| Claude 3.5 Sonnet | ★★★★★ | $3.00 input / $15.00 output | Medium | For complex generation |
| Command R+ | ★★★★★ | $1.00 input / $2.00 output | Fast | Recommended for quality |
| Command-light | ★★★★☆ | $0.30 input / $0.60 output | Fast | Best value for scale |
| Aya-23-35B | ★★★★☆ | Self-hosted | Fast | Alternative (SEA focus) |
| SPEED-aligned 8B | ★★★★☆ | $0.10-0.20 (API equivalent) | Fast | Recommended for scale |
| SEA-LION-v4 | ★★★☆☆ | Self-hosted | Fast | For Indonesian-specific |
| Qwen2.5-7B | ★★★★☆ | Self-hosted | Fast | Multilingual capable |
3. Model Selection for Indonesian Generation¶
3.1 Indonesian LLM Landscape (2025)¶
┌─────────────────────────────────────────────────────────────────────────┐
│ INDONESIAN LLM MODEL COMPARISON │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CLOSED-SOURCE API MODELS │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Model │ Params │ Input │ Output │ ID Support │ Use │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ Command R+ │ 104B │ $1.00 │ $2.00 │ ★★★★★ │ Best │ │
│ │ Command-light │ ~ │ $0.30 │ $0.60 │ ★★★★☆ │ Value │ │
│ │ Aya-23-35B │ 35B │ TBD │ TBD │ ★★★★☆ │ Multil │ │
│ │ GPT-4o │ - │ $5.00 │ $15.00 │ ★★★★★ │ Seed │ │
│ │ Claude 3.5 │ - │ $3.00 │ $15.00 │ ★★★★★ │ Complex│ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ OPEN-SOURCE MODELS (Self-Hosted) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Model │ Params │ VRAM │ ID Support │ Use │ │
│ ├─────────────────────────────────────────────────────────────────┤ │
│ │ SEA-LION-v4 │ 8B │ 16GB │ ★★★★★ │ ID-specialized│ │
│ │ SahabatAI-v1 │ 9B │ 16GB │ ★★★★★ │ ID + dialects │ │
│ │ Cendol │ 7B │ 14GB │ ★★★★☆ │ ID tasks │ │
│ │ Qwen2.5-7B │ 7B │ 14GB │ ★★★★☆ │ Multilingual │ │
│ │ LLaMA-3.1-8B │ 8B │ 16GB │ ★★★☆☆ │ Fine-tune │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
3.2 Model Selection by Use Case¶
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Large-scale generation | SPEED-aligned Qwen2.5-8B | 10× cost savings, good quality |
| High-quality seed data | Command R+ | Best Indonesian generation, reasonable cost |
| Domain-specific (legal/medical) | SEA-LION-v4 fine-tuned | Indonesian context understanding |
| Code-mixed data | SahabatAI-v1 | Trained on ID-Javanese-Sundanese-English |
| Regional languages | NusaX-based models | 10 Indonesian regional languages |
| Instruction following | Aya-23-35B | Strong instruction following in 23 languages |
3.3 SEA-LION-v4 Analysis¶
SEA-LION-v4 (AI Singapore) is the most Indonesian-optimized model:
- Training Data: 35% Indonesian sources (Wikipedia ID, news, social media)
- Languages: 11 SEA languages (Indonesian, Malay, Vietnamese, Thai, Burmese, Lao, Filipino, Tamil, Khmer, Javanese, Sundanese)
- Performance: State-of-the-art on SEA-HELM benchmark
- Tokenization: 1.2 tokens/word for Indonesian (best in class)
- VRAM: 16GB (BF16) / 5GB (INT4)
3.4 SahabatAI-v1 Analysis¶
SahabatAI-v1 (GoTo/CSA Lab) is Indonesian-fine-tuned:
- Base: Gemma2-9B
- Languages: Indonesian, Javanese, Sundanese with code-mixing support
- Training: Continued pre-training on 20B Indonesian tokens
- Use Case: Best for informal/formal Indonesian generation
- Cost: Self-hosted, requires 16GB VRAM
3.5 Cendol Model Analysis¶
Cendol (IndoLLM) family includes:
- Cendol-7B: Indonesian-optimized instruction model
- Languages: Indonesian + 5 regional languages (Javanese, Sundanese, Balinese, Minangkabau, Buginese)
- Evaluation: 15 datasets including cultural reasoning
- Use Case: Culturally-aware generation
4. Generation Frameworks¶
4.1 SPEED Framework¶
SPEED (Synthesizing High-Quality Embedding Data at Scale) aligns small 8B models to generate embedding data, achieving better performance than GPT-4-only approaches with 1/10 the API calls.
- Paper: Chen et al. (2024). "Little Giants: Synthesizing High-Quality Embedding Data at Scale."
- arXiv: 2410.18634
- Code: github.com/haon-chen/SPEED
4.2 SPEED Architecture¶
┌─────────────────────────────────────────────────────────────────────────┐
│ SPEED FRAMEWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: TASK BRAINSTORMING │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • GPT-4 generates diverse task descriptions │ │
│ │ • Topics sampled from Open Directory Project (ODP) │ │
│ │ • For Indonesian: Use ID-specific topics (see Section 7.4) │ │
│ │ • Output: Task pool T │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 2: JUNIOR GENERATOR (SFT) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • GPT-4 generates small seed dataset D_seed (5K-10K samples) │ │
│ │ • SFT on small model (Qwen2.5-8B or LLaMA-3-8B) → π_Jr │ │
│ │ • Objective: Standard supervised loss on (prompt, task, data) │ │
│ │ • Temperature: 0.8-1.0 (diversity) │ │
│ │ • Output: Basic data synthesis capability │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 3: SENIOR GENERATOR (DPO) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • π_Jr generates root data D_root (50K-100K samples) │ │
│ │ • GPT-4 evaluates best/worst in each list (preference pairs) │ │
│ │ • DPO optimizes → π_Sr (senior generator) │ │
│ │ • β (DPO) = 0.1 (alignment vs reference tradeoff) │ │
│ │ • Output: High-quality synthesis model │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 4: DATA REVISOR (Self-Improvement) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • GPT-4 evaluates D_root on 3 aspects: │ │
│ │ 1. Relevance to task │ │
│ │ 2. Completeness per requirements │ │
│ │ 3. Factual accuracy │ │
│ │ • Produces revision signals → π_Re (revisor) │ │
│ │ • Refines synthetic data with minimal inference cost │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ FINAL PIPELINE │
│ π_Sr generates large-scale data → π_Re refines → High-quality dataset │
│ │
└─────────────────────────────────────────────────────────────────────────┘
4.3 SPEED Results¶
| Model | GPT API Calls | GPT Tokens | MTEB Score | Cost Efficiency |
|---|---|---|---|---|
| E5-mistral (GPT-only) | 500K | 180M | 63.2 | Baseline |
| SPEED (8B aligned) | 45K | 32M | 64.8 | 10× fewer calls |
| Mistral_llama3 | 230K | - | 62.6 | 2× worse than SPEED |
4.4 SPEED Scaling Law¶
SPEED discovered a log-linear relationship between embedding model performance and synthetic data size:
Performance = α × log(data_size) + β
Where:
- α ≈ 2.5-3.0 (slope)
- β ≈ 45-50 (intercept)
- Diminishing returns beyond ~1M samples
Practical implication for Indonesia-MTEB: Target 50K-100K high-quality samples per task type for optimal performance.
4.5 SPEED Key Hyperparameters¶
| Component | Hyperparameter | Optimal Value | Notes for Indonesian |
|---|---|---|---|
| Junior Generator | Temperature | 0.8-1.0 | Balance diversity/quality |
| Training Samples | 25K-50K | Use Indonesian seed data | |
| Senior Generator (DPO) | β (DPO) | 0.1 | Trade-off alignment/reference |
| Training Samples | 10K-15K | High-quality Indonesian pairs | |
| Data Revisor | Training Samples | 25K-35K | Easier than synthesis |
4.6 Self-Instruct Framework¶
Self-Instruct (Wang et al., 2023) bootstraps instruction-following data:
┌─────────────────────────────────────────────────────────────────────────┐
│ SELF-INSTRUCT FRAMEWORK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: SEED GENERATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Human writes ~175 seed instruction-response pairs │ │
│ │ • For Indonesian: Include bilingual examples │ │
│ │ • Cover diverse task types │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STEP 2: BOOTSTRAP GENERATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • For each seed: Generate 8 new instructions │ │
│ │ • Prompt: "Generate 8 diverse instructions for..." │ │
│ │ • Language model generates both instruction and response │ │
│ │ • ~1,400 new pairs from 175 seeds │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STEP 3: FILTERING │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Remove low-quality outputs │ │
│ │ • Filter by Indonesian language detection │ │
│ │ • Remove near-duplicates (MinHash) │ │
│ │ • Typical keep rate: 50-70% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STEP 4: ITERATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Add filtered data to training pool │ │
│ │ • Fine-tune model on new data │ │
│ │ • Repeat from Step 2 (typically 3-5 iterations) │ │
│ │ • Final dataset: 50K-100K instruction pairs │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
5. Cost Estimation and Budgeting¶
5.1 API Pricing Comparison (2025)¶
| Model | Input (USD/1M) | Output (USD/1M) | Context | Indonesian Support |
|---|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 128K | ★★★★★ |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | ★★★★★ |
| Command R+ | $1.00 | $2.00 | 128K | ★★★★★ |
| Command-light | $0.30 | $0.60 | 128K | ★★★★☆ |
| GPT-4o-mini | $0.15 | $0.60 | 128K | ★★★★☆ |
5.2 Cost Estimation by Task¶
Assuming 10,000 samples per task type with average token counts:
| Task | Tokens/Sample | Total Tokens | Command R+ Cost | GPT-4o Cost | Savings |
|---|---|---|---|---|---|
| Classification | 150 | 1.5M | $2.25 | $11.25 | 80% |
| Clustering | 300 | 3.0M | $4.50 | $22.50 | 80% |
| Reranking | 500 | 5.0M | $7.50 | $37.50 | 80% |
| STS | 200 | 2.0M | $3.00 | $15.00 | 80% |
| Retrieval | 400 | 4.0M | $6.00 | $30.00 | 80% |
| Instruction | 250 | 2.5M | $3.75 | $18.75 | 80% |
| Total | - | 18M | $27.00 | $135.00 | $108 |
Self-hosted alternative (Qwen2.5-7B): - Hardware: 1× RTX 4090 (24GB VRAM) @ \(0.50/hour spot - Generation time: ~50 hours for 60K samples - Total cost: ~\)25 + electricity - Break-even: ~1M tokens vs Command R+
5.3 Budget Recommendations for Indonesia-MTEB¶
| Component | Recommended Approach | Estimated Cost |
|---|---|---|
| Seed data (5K samples) | GPT-4o or Claude 3.5 | $20-30 |
| Large-scale generation (50K+) | SPEED-aligned 8B or Command R+ | $50-100 |
| Validation (LLM-as-judge) | Claude 3.5 or GPT-4o | $30-50 |
| Human annotation (500 samples) | $2-3/sample | $1,000-1,500 |
| Infrastructure | Cloud GPU or on-premise | $100-200 |
| Total | $1,200-2,000 |
6. Task-Specific Generation Methods¶
6.1 Clustering Dataset Generation¶
Challenge: Indonesia-MTEB has zero dedicated clustering datasets.
┌─────────────────────────────────────────────────────────────────────────┐
│ CLUSTERING DATASET GENERATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INDONESIAN DATA SOURCES │
│ ├─ News: detik.com, kompas.com, tempo.co, CNN Indonesia │
│ ├─ Wikipedia Indonesia articles (id.wikipedia.org) │
│ ├─ Social media: Twitter/X, Instagram, TikTok │
│ ├─ E-commerce: Tokopedia, Shopee product descriptions │
│ └─ Government: indonesia.go.id publications │
│ │
│ GENERATION METHOD │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Document Collection │ │
│ │ • Scraping from sources above (target: 50K-100K documents) │ │
│ │ • Clean and normalize text (see Section 10) │ │
│ │ │ │
│ │ Step 2: LLM-based Clustering │ │
│ │ • Prompt: See Section 7.2 (Indonesian clustering prompt) │ │
│ │ • Output: Document + cluster_id + cluster_label │ │
│ │ │ │
│ │ Step 3: Cluster Description Generation │ │
│ │ • Generate semantic descriptions for each cluster │ │
│ │ • Identify cluster themes and topics │ │
│ │ │ │
│ │ Step 4: Hard Negative Generation │ │
│ │ • Generate documents near cluster boundaries │ │
│ │ • Output: Boundary case documents for evaluation │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ VALIDATION METRICS │
│ ├─ Semantic coherence (avg intra-cluster cosine similarity) │
│ ├─ Cluster separation (inter-cluster distance) │
│ ├─ Silhouette score │
│ └─ Human verification (100-200 samples per dataset) │
│ │
│ TARGET DATASETS (10) │
│ ├─ News clustering (politics, sports, entertainment, etc.) │
│ ├─ Product clustering (e-commerce categories) │
│ ├─ Social media topic clustering │
│ ├─ Wikipedia article clustering │
│ ├─ Scientific document clustering │
│ └─ ... (5 more specialized domains) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
6.2 Reranking Dataset Generation¶
Challenge: Indonesia-MTEB has zero reranking datasets.
Data Structure: (query, candidates, ranking)
┌─────────────────────────────────────────────────────────────────────────┐
│ RERANKING DATASET GENERATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GENERATION METHOD │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Query Generation │ │
│ │ • Sources: │ │
│ │ - Indonesian search logs (Google Trends ID) │ │
│ │ - FAQ websites │ │
│ │ - Yahoo Answers Indonesia (archive) │ │
│ │ • LLM generation: See Section 7.3 (Reranking prompt) │ │
│ │ • Target: 5,000 diverse queries │ │
│ │ │ │
│ │ Step 2: Passage Candidate Generation │ │
│ │ • Sources: Wikipedia Indonesia, news articles │ │
│ │ • For each query: │ │
│ │ - 1 positive (highly relevant) │ │
│ │ - 3-5 hard negatives (semantically similar but wrong) │ │
│ │ - 5-10 random negatives │ │
│ │ • Hard negative generation: See Section 8 │ │
│ │ │ │
│ │ Step 3: Ranking Annotation │ │
│ │ • LLM-as-Judge: Rank candidates by relevance │ │
│ │ • Output format: [pos, neg1, neg2, ...] (descending relevance) │ │
│ │ • Human verification: 10% sample (500 queries) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ DOMAIN-SPECIALIZED GENERATION │
│ ├─ Legal: Indonesian law documents (UU, PP) with queries │
│ │ Sources: JDIH, peraturan.go.id │ │
│ ├─ Medical: Health articles with symptom/diagnosis queries │
│ │ Sources: Alodokter, Halodoc articles │ │
│ ├─ Finance: Financial news with analysis queries │
│ │ Sources: Kontan, Bisnis Indonesia, CNBC Indonesia │ │
│ └─ News: Current events with fact-based queries │
│ Sources: Detik, Kompas, Tempo │ │
│ │
│ TARGET DATASETS (10) │
│ ├─ General web search reranking │
│ ├─ Legal document reranking │
│ ├─ Medical Q&A reranking │
│ ├─ Financial news reranking │
│ ├─ E-commerce product search │
│ └─ ... (5 more specialized domains) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
6.3 STS (Semantic Textual Similarity) Generation¶
Challenge: Indonesia-MTEB has only 3 limited STS datasets (IndoSTS, translated STS-B, translated SICK-R).
┌─────────────────────────────────────────────────────────────────────────┐
│ STS DATASET GENERATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ GENERATION APPROACHES │
│ │
│ Approach 1: Paraphrase Generation (High Similarity: 4.0-5.0/5.0) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Input: Indonesian sentence │ │
│ │ • LLM: Generate 3-5 paraphrases with high similarity │ │
│ │ • Example: │ │
│ │ Source: "Pemerintah menaikkan harga bbm." │ │
│ │ Paraphrase 1: "Harga bbm dinaikkan oleh pemerintah." │ │
│ │ Paraphrase 2: "Kenaikan bbm dilakukan pemerintah." │ │
│ │ Paraphrase 3: "Pemerintah resmikan kenaikan harga bbm." │ │
│ │ Similarity: 4.5-5.0 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Approach 2: Thematic Variation (Medium Similarity: 2.5-3.5/5.0) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Input: Topic + context │ │
│ │ • LLM: Generate sentences on same theme, different wording │ │
│ │ • Example: │ │
│ │ Sentence 1: "Timnas Indonesia menang 3-0 melawan Thailand." │ │
│ │ Sentence 2: "Pertandingan sepak bola berakhir dengan skor 3-0."│ │
│ │ Similarity: 2.8 (same event, different focus) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ Approach 3: Dissimilar Generation (Low Similarity: 0.0-1.5/5.0) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Input: Two different topics │ │
│ │ • LLM: Generate sentences on unrelated themes │ │
│ │ • Example: │ │
│ │ Sentence 1: "Gempa bermagnit 5.4 mengguncang Jogjakarta." │ │
│ │ Sentence 2: "Harga emas mengalami kenaikan hari ini." │ │
│ │ Similarity: 0.5 (completely unrelated) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ANNOTATION METHODOLOGY │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • LLM-as-Judge: Annotate similarity scores (0-5) │ │
│ │ • Verification: Semantic similarity model (gte-Qwen2-7B) │ │
│ │ - Compute cosine similarity between embeddings │ │
│ │ - Filter: Remove pairs where similarity < 0.7 for high label │ │
│ │ • Calibration: Human annotators for 500 sample pairs │ │
│ │ - Target: ≥85% correlation with LLM annotations │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ TARGET DATASETS (10) │
│ ├─ News STS (headline paraphrases) │
│ ├─ Social media STS (informal vs formal) │
│ ├─ Wikipedia STS (article similarity) │
│ ├─ Question STS (question paraphrase) │
│ ├─ Discussion STS (forum comment similarity) │
│ └─ ... (5 more specialized domains) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
6.4 Instruction Following Dataset Generation¶
Challenge: Indonesia-MTEB has zero instruction following datasets.
┌─────────────────────────────────────────────────────────────────────────┐
│ INSTRUCTION FOLLOWING DATASET GENERATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ INSTRUCTION CATEGORIES │
│ ├─ General Q&A (pengetahuan umum) │
│ ├─ Summarization (ringkasan) │
│ ├─ Translation (terjemahan) │
│ ├─ Creative writing (menulis kreatif) │
│ ├─ Code generation (pembuatan kode) │
│ ├─ Reasoning (penalaran) │
│ ├─ Classification (klasifikasi) │
│ └─ Extraction (ekstraksi informasi) │
│ │
│ GENERATION PIPELINE │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Instruction Creation │ │
│ │ • Seed: 200-500 manually written Indonesian instructions │ │
│ │ • Bootstrap: Use LLM to generate 10× more instructions │ │
│ │ • Filter: Remove low-quality/repetitive instructions │ │
│ │ │ │
│ │ Step 2: Response Generation │ │
│ │ • For each instruction, generate response │ │
│ │ • Ensure response is appropriate and accurate │ │
│ │ • Verify: LLM-as-judge checks response quality │ │
│ │ │ │
│ │ Step 3: Quality Control │ │
│ │ • Human verification: 500-1000 samples │ │
│ │ • Criteria: Relevance, accuracy, completeness, fluency │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ INSTRUCTION EXAMPLES (Indonesian) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Q: Jelaskan perbedaan antara gotong royong dan kerja bakti. │ │
│ │ A: Gotong royong adalah budaya saling membantu dalam pekerjaan │ │
│ │ yang bersifat timbal balik dan sukarela, sedangkan kerja │ │
│ │ bakti lebih fokus pada kegiatan sosial kemasyarakatan. │ │
│ │ │ │
│ │ Q: Buatlah ringkasan dari artikel berikut dalam 3 kalimat. │ │
│ │ [ARTICLE] │ │
│ │ A: [RINGKASAN] │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
7. Prompt Engineering with Indonesian Examples¶
7.1 Effective Prompting Strategies¶
| Strategy | Description | Example for Indonesian |
|---|---|---|
| Topic-Based Brainstorming | Sample topics from Indonesian categories | "Generate retrieval tasks for: Olahraga/Sepak Bola/Timnas" |
| Few-Shot Examples | Provide Indonesian examples in prompt | 3-5 Indonesian examples per task type |
| Structured Output | Require JSON format with Indonesian text | {"teks": "...", "label": "..."} |
| Register Specification | Specify formal/informal Indonesian | "Gunakan bahasa Indonesia baku (formal)" |
| Domain Specification | Specify Indonesian domain context | "Generate dokumen domain HUKUM Indonesia" |
7.2 Clustering Generation Prompt (Indonesian)¶
CLUSTERING_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas clustering dokumen Bahasa Indonesia.
TUGAS:
Buat 10 cluster dari dokumen-dokumen berikut ini. Setiap cluster harus memiliki tema yang jelas.
DOKUMEN:
{documents}
OUTPUT FORMAT (JSON):
```json
{{
"clusters": [
{{
"cluster_id": 0,
"cluster_name": "[Nama cluster dalam Bahasa Indonesia]",
"description": "[Deskripsi singkat tema cluster]",
"documents": [0, 3, 7, ...],
"sample_document": "[Contoh dokumen representatif]"
}}
]
}}
PERSYARATAN: 1. Gunakan Bahasa Indonesia yang natural dan baku 2. Setiap cluster minimal 5 dokumen 3. Cluster harus saling eksklusif (tidak ada tumpang tindih) 4. Beri nama cluster yang spesifik dan informatif 5. Deskripsi harus menjelaskan tema cluster dengan jelas
CONTOH CLUSTER NAME: - "Berita Politik dan Pemerintahan" - "Olahraga Sepak Bola" - "Teknologi dan Gadget" - "Ekonomi dan Bisnis" """
Hard negative generation for clustering¶
CLUSTER_HARD_NEGATIVE_PROMPT = """ Buat 2 dokumen yang TIDAK termasuk dalam cluster "{cluster_name}" tetapi memiliki kata kunci yang mirip.
Cluster description: {description}
Dokumen harus terlihat mirip dengan topik cluster tetapi membahas hal yang berbeda.
Output dalam format JSON:
{{
"hard_negatives": [
{{
"text": "[Isi dokumen]",
"reason": "Alasan mengapa ini mirip tapi berbeda"
}}
]
}}
### 7.3 Reranking Generation Prompt (Indonesian)
```python
RERANKING_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas reranking Bahasa Indonesia.
TUGAS:
Generate pasangan (query, dokumen) dengan berbagai tingkat relevansi.
QUERY: "{query_domain}"
OUTPUT FORMAT (JSON):
```json
{{
"query": "[Pertanyaan natural dalam Bahasa Indonesia]",
"positive": "[Dokumen yang menjawab query dengan benar]",
"hard_negatives": [
{{
"text": "[Dokumen mirip tapi tidak menjawab]",
"reason": "Alasan mengapa ini hard negative"
}}
],
"random_negatives": [
{{
"text": "[Dokumen topik berbeda]",
"reason": "Alasan mengapa ini random negative"
}}
]
}}
PERSYARATAN: 1. Query harus natural seperti yang ditulis pengguna Indonesia 2. Query length: 10-30 kata 3. Positive dokumen: 100-300 kata, langsung menjawab query 4. Hard negative: 3-5 dokumen, mirip topik tapi salah jawab 5. Random negative: 5-10 dokumen, topik benar-benar berbeda
CONTOH: Query: "Apa itu gotong royong?" Positive: "Gotong royong adalah budaya gotong tolong menolong yang sudah ... Hard negative: "Kegiatan kerja bakti dilakukan oleh masyarakat untuk..." (Salah karena kerja bakti ≠ gotong royong) """
DOMAIN_SPECIFIC_PROMPTS = { "legal": """ DOMAIN: HUKUM Indonesia Sumber: UU, PP, Peraturan Pemerintah
Query harus terkait dengan:
- Penjelasan pasal undang-undang
- Perbandingan regulasi
- Implikasi hukum
Positive: Kutipan langsung dari dokumen hukum yang relevan
Hard negative: Dokumen hukum topik mirip tapi tidak menjawab
""",
"medical": """
DOMAIN: KESEHATAN
Query harus terkait dengan:
- Gejala penyakit
- Diagnosis medis
- Rekomendasi pengobatan umum
Positive: Informasi medis akurat dari sumber terpercaya
Hard negative: Penyakit dengan gejala mirip tapi berbeda
""",
"news": """
DOMAIN: BERITA Indonesia
Query harus terkait dengan:
- Fakta peristiwa berita
- Analisis berita
- Konteks peristiwa
Positive: Berita yang langsung menjawab pertanyaan
Hard negative: Berita topik mirip tapi peristiwa berbeda
"""
} ```
7.4 STS Generation Prompt (Indonesian)¶
python
STS_GENERATION_PROMPT = """
Anda adalah generator dataset untuk Semantic Textual Similarity (STS) Bahasa Indonesia.
TUGAS:
Generate pasangan kalimat dengan berbagai tingkat kemiripan.
TOPIK: {topic}
OUTPUT FORMAT (JSON):json
{{
"pairs": [
{{
"sentence1": "[Kalimat pertama dalam Bahasa Indonesia]",
"sentence2": "[Kalimat kedua dalam Bahasa Indonesia]",
"similarity": 4.5,
"label": "paraphrase"
}}
]
}}
```
TINGKAT KEMIRIPAN: - 4.5-5.0: Parafrase hampir identik (paraphrase) - 3.5-4.4: Makna sama, redaksi berbeda (high similarity) - 2.5-3.4: Topik sama, aspek berbeda (medium similarity) - 1.5-2.4: Sedikit kemiripan (low similarity) - 0.0-1.4: Hampir tidak mirip (dissimilar)
PERSYARATAN: 1. Gunakan Bahasa Indonesia natural (baku atau gaul sesuai konteks) 2. Kalimat length: 10-30 kata 3. Hindari kata-kata pengisi yang tidak perlu 4. Pastikan skor similarity sesuai dengan tingkat kemiripan sebenarnya
CONTOH: Score 5.0: - "Pemerintah menaikkan harga bbm." - "Harga bbm dinaikkan oleh pemerintah."
Score 3.0: - "Timnas Indonesia menang 3-0 atas Thailand." - "Pertandingan sepak bola berakhir dengan skor 3-0."
Score 1.0: - "Gempa mengguncang wilayah Jogjakarta." - "Harga emas mengalami kenaikan hari ini." """
SIMILARITY_CALIBRATION_PROMPT = """ Berikan skor similarity (0-5) untuk pasangan kalimat berikut:
PASANGAN 1: Kalimat 1: "{sent1}" Kalimat 2: "{sent2}"
Pertimbangkan: 1. Makna (meaning) - apakah menyampaikan informasi yang sama? 2. Entitas (entities) - apakah subjek/objeknya sama? 3. Konteks (context) - apakah dalam konteks yang sama?
Output JSON saja:
"""### 7.5 Classification Generation Prompt (Indonesian)
```python
CLASSIFICATION_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas klasifikasi teks Bahasa Indonesia.
KONTEKS:
{task_description}
LABELS: {labels}
Generate 5 contoh untuk setiap label.
OUTPUT FORMAT (JSON):
```json
{{
"examples": [
{{
"text": "[Teks Bahasa Indonesia]",
"label": "[label]"
}}
]
}}
PERSYARATAN: 1. Text length: 50-200 kata 2. Gunakan Bahasa Indonesia natural 3. Hindari bias label (setiap label harus punya ciri unik) 4. Sertakan variasi gaya penulisan (formal/informal sesuai konteks) 5. Labels harus mutually exclusive
CONTOH untuk Sentimen Analysis: Label: positif, negatif, netral
Positif: "Produk ini sangat bagus, pengiriman cepat dan kualitas terjamin!" Negatif: "Sangat kecewa, barang rusak saat sampai dan tidak bisa direturun." Netral: "Barang sudah diterima, akan dicoba nanti." """
Domain-specific classification prompts¶
DOMAIN_CLASSIFICATION_PROMPTS = { "news_category": """ TUGAS: Klasifikasi kategori berita Indonesia
LABELS:
- politik: Berita tentang pemerintahan, pemilu, kebijakan
- ekonomi: Berita bisnis, pasar, investasi
- olahraga: Berita tentang atlet, pertandingan, kompetisi
- teknologi: Berita gadget, software, startup
- entertainment: Berita selebriti, film, musik
""",
"clickbait": """
TUGAS: Klasifikasi headline clickbait
LABELS:
- clickbait: Headline yang menyesatkan/mengada-ada untuk klik
- legitimate: Headline yang jujur dan akurat
KARAKTERISTIK CLICKBAIT:
- Menggunakan kata-kata sensasional ("MENGHEBOHKAN", "TERSERA")
- Menggunakan ellipsis (...) yang menggantung
- Tidak memberikan informasi jelas
- Overstatement (melebih-lebihkan)
""",
"formality": """
TUGAS: Klasifikasi level keformalan Bahasa Indonesia
LABELS:
- formal: Bahasa baku, sesuai EYD, untuk tulisan resmi
- informal: Bahasa gaul/slang, untuk percakapan sehari-hari
- mixed: Campuran formal dan informal
KARAKTERISTIK:
- Formal: gunakan "saya", "adalah", tidak ada singkatan
- Informal: gunakan "aku", "gue", ada singkatan (yg, utk, dll)
"""
} ```
7.6 Indonesian Topic Categories (for ODP Sampling)¶
Based on Indonesian content sources, here are recommended topic categories:
┌─────────────────────────────────────────────────────────────────────────┐
│ INDONESIAN TOPIC CATEGORIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ NEWS (Berita) │
│ ├─ Politik & Pemerintahan │
│ │ ├─ Pemilihan Umum │
│ │ ├─ Kebijakan Pemerintah │
│ │ ├─ Partai Politik │
│ │ └─ Pemerintahan Daerah │
│ ├─ Ekonomi & Bisnis │
│ │ ├─ Pasar Saham & Investasi │
│ │ ├─ UMKM │
│ │ ├─ Startup & Teknologi Finansial │
│ │ └─ Harga & Inflasi │
│ ├─ Olahraga │
│ │ ├─ Sepak Bola (Timnas, Liga) │
│ │ ├─ Badminton │
│ │ ├─ Olahraga Elektronik │
│ │ └─ PON & Sea Games │
│ └─ Hiburan │
│ ├─ Film & Sinema Indonesia │
│ ├─ Musik & Konser │
│ └─ Selebriti Tanah Air │
│ │
│ LIFESTYLE (Gaya Hidup) │
│ ├─ Kuliner │
│ │ ├─ Resep Masakan Indonesia │
│ │ ├─ Street Food (Nasi Goreng, Sate, Bakso) │
│ │ └─ Review Restoran │
│ ├─ Wisata │
│ │ ├─ Bali & Lombok │
│ │ ├─ Yogyakarta & Borobudur │
│ │ ├─ Raja Ampat & Bunaken │
│ │ └─ Wisata Kuliner │
│ └─ Fashion │
│ ├─ Batik & Tenun │
│ ├─ Muslim Fashion │
│ └─ Local Brands │
│ │
│ TECHNOLOGY (Teknologi) │
│ ├─ Smartphones & Gadgets │
│ ├─ Aplikasi Indonesia (Gojek, Traveloka, dll) │
│ ├─ Startup │
│ └─ Gaming │
│ │
│ CULTURE (Budaya) │
│ ├─ Gotong Royong & Nilai Kebangsaan │
│ ├─ Batik, Wayang, Tradisi │
│ ├─ Hari Raya (Idul Fitri, Natal, Imlek, Nyepi) │
│ └─ Bahasa Daerah │
│ │
│ SOCIETY (Masyarakat) │
│ ├─ Pendidikan │
│ ├─ Kesehatan │
│ ├─ Transportasi (MRT, LRT, Tol) │
│ └─ Infrastruktur │
│ │
└─────────────────────────────────────────────────────────────────────────┘
8. Hard Negative Generation¶
8.1 Hard Negative Strategies¶
┌─────────────────────────────────────────────────────────────────────────┐
│ HARD NEGATIVE GENERATION STRATEGIES │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STRATEGY 1: KEYWORD OVERLAP (Mirip tapi Salah) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Query: "Kapan kemerdekaan Indonesia diperingati?" │ │
│ │ Positive: "Proklamasi kemerdekaan Indonesia dibaca pada ..." │ │
│ │ Hard Negative: "Peringatan kemerdekaan negara lain ..." │ │
│ │ Reason: Kata "kemerdekaan" muncul tapi konteks berbeda │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ STRATEGY 2: ENTITY SUBSTITUTION (Entitas Salah) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Query: "Siapa presiden pertama Indonesia?" │ │
│ │ Positive: "Ir. Soekarno adalah presiden pertama RI..." │ │
│ │ Hard Negative: "Ir. Hatta adalah wakil presiden pertama..." │ │
│ │ Reason: Entitas tokoh mirip tapi jawaban salah │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ STRATEGY 3: TOPIC DRIFT (Topik Mirip, Beda Aspek) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Query: "Manfaat minum air putih bagi kesehatan" │ │
│ │ Positive: "Minum air putih membantu hidrasi tubuh..." │ │
│ │ Hard Negative: "Sumber air bersih semakin langka..." │ │
│ │ Reason: Topik sama (air) tapi beda aspek │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ STRATEGY 4: TEMPORAL MISMATCH (Waktu Salah) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Query: "Hasil Piala AFF 2024" │ │
│ │ Positive: "Timnas Indonesia juara AFF 2024..." │ │
│ │ Hard Negative: "Timnas Indonesia juara AFF 2022..." │ │
│ │ Reason: Entitas dan topik sama tapi tahun berbeda │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ STRATEGY 5: NUMERICAL DIFFERENCE (Angka Beda) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Query: "Berapa provinsi di Indonesia?" │ │
│ │ Positive: "Indonesia memiliki 38 provinsi..." │ │
│ │ Hard Negative: "DPR memiliki 560 anggota..." │ │
│ │ Reason: Ada angka tapi menjawab pertanyaan berbeda │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
8.2 Hard Negative Generation Prompt¶
python
HARD_NEGATIVE_GENERATION_PROMPT = """
Anda adalah generator hard negative untuk tugas retrieval Bahasa Indonesia.
TUGAS:
Generate 3-5 hard negatives untuk query berikut.
QUERY: "{query}"
POSITIVE DOCUMENT: "{positive}"
HARD NEGATIVE ADALAH:
Dokumen yang:
1. Mengandung kata kunci mirip dengan query atau positive
2. Topiknya terkait tapi TIDAK menjawab query dengan benar
3. Menipiskan model retrieval (mirip secara semantik tapi salah)
STRATEGIES:
- Ganti entitas penting (nama, tempat, angka)
- Ubah konteks waktu (tahun, periode)
- Bedakan aspek dari topik yang sama
- Gunakan kata kunci mirip tapi arti berbeda
OUTPUT FORMAT (JSON):json
{{
"hard_negatives": [
{{
"text": "[Dokumen hard negative dalam Bahasa Indonesia]",
"strategy": "[nama strategy yang digunakan]",
"reason": "[Alasan mengapa ini hard negative]"
}}
]
}}
```
CONTOH: Query: "Kapan proklamasi kemerdekaan Indonesia?" Positive: "Proklamasi kemerdekaan Indonesia dibacakan oleh Ir. Soekarno pada tanggal 17 Agustus 1945..."
Hard Negative 1 (Entity substitution): "Proklamasi kemerdekaan direncanakan oleh BPUPKI pada tanggal 1 Juni 1945..." Reason: Ada "proklamasi" dan "kemerdekaan" tapi tanggal bukan 17 Agustus
Hard Negative 2 (Topic drift): "Peringatan kemerdekaan Indonesia diperingati setiap tanggal 17 Agustus..." Reason: Topik sama (kemerdekaan) tapi bukan menjawab "kapan" (tanggal proklamasi)
Hard Negative 3 (Related entity): "Mohammad Hatta adalah proklamator bersama Ir. Soekarno..." Reason: Menyebut tokoh terkait tapi tidak menjawab pertanyaan tanggal """
Domain-specific hard negative generation¶
DOMAIN_HARD_NEGATIVE_PROMPTS = { "legal": """ STRATEGI UNTUK DOMAIN HUKUM:
Query: "Apa pasal pembunuuh dalam KUHP?"
Positive: "Pasal 338 KUHP mengatur tentang pembunuhan..."
Hard Negative Ideas:
- Pasal terkait tapi bukan pembunuhan (mis: penganiayaan)
- Pasal pembunuhan di undang-undang lain
- Penjelasan pasal tapi tanpa isi pasalnya
""",
"medical": """
STRATEGI UNTUK DOMAIN KESEHATAN:
Query: "Apa gejala demam berdarah?"
Positive: "Gejala demam berdarah meliputi demam tinggi, nyeri sendi..."
Hard Negative Ideas:
- Penyakit dengan gejala mirip (demam tifoid, malaria)
- Komplikasi demam berdarah
- Pengobatan demam berdarah (bukan gejala)
""",
"news": """
STRATEGI UNTUK DOMAIN BERITA:
Query: "Hasil pertandingan Indonesia vs Thailand tadi malam"
Positive: "Timnas Indonesia menang 3-0 atas Thailand dalam..."
Hard Negative Ideas:
- Pertandingan Indonesia vs Thailand di turnamen berbeda
- Klasemen grup (bukan hasil pertandingan)
- Preview sebelum pertandingan (bukan hasil)
"""
} ```
8.3 Hard Negative Evaluation¶
python
HARD_NEGATIVE_EVALUATION_PROMPT = """
Evaluasi apakah dokumen berikut merupakan hard negative yang baik.
QUERY: "{query}"
POSITIVE: "{positive}"
CANDIDATE: "{candidate}"
Jawab dengan YA jika candidate adalah hard negative yang baik, TIDAK jika bukan.
Hard negative yang baik:
- Secara semantik mirip dengan positive
- Mengandung kata kunci dari query
- TIDAK menjawab query dengan benar
- Akan menipiskan model retrieval
Output JSON saja:json
{{
"is_hard_negative": true/false,
"score": 0-10 (10 = sangat baik),
"reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
"""
9. Quality Validation Pipeline¶
9.1 Multi-Stage Validation Framework¶
┌─────────────────────────────────────────────────────────────────────────┐
│ QUALITY VALIDATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ STAGE 1: LANGUAGE DETECTION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Tool: fastText, langdetect, or polyglot │ │
│ │ • Threshold: Indonesian confidence ≥ 0.8 │ │
│ │ • Reject: Non-Indonesian or code-mixed without ID │ │
│ │ • Typical keep rate: 95-98% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 2: DEDUPLICATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Method: MinHash with LSH (Locality Sensitive Hashing) │ │
│ │ • Threshold: Jaccard similarity < 0.85 │ │
│ │ • N-gram size: 3-5 for Indonesian │ │
│ │ • Reject: Near-duplicates │ │
│ │ • Typical keep rate: 90-95% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 3: SEMANTIC SIMILARITY FILTERING │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Model: gte-Qwen2-7B-instruct or SEA-LION-v4-embeddings │ │
│ │ • For retrieval: cosine similarity with positive │ │
│ │ - Hard negatives: 0.5-0.8 similarity (not too low/high) │ │
│ │ - Random negatives: < 0.3 similarity │ │
│ │ • For STS: Verify LLM score with embedding similarity │ │
│ │ - Flag pairs with large discrepancy (>1.5 points) │ │
│ │ • Typical keep rate: 70-85% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 4: LLM-AS-JUDGE VALIDATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Model: GPT-4o or Claude 3.5 Sonnet (for quality) │ │
│ │ • Prompts: See Section 11 │ │
│ │ • Criteria: Grammar, Fluency, Meaning Preservation, NER │ │
│ │ • Threshold: ≥ 3.5/5.0 overall to PASS │ │
│ │ • Typical keep rate: 75-90% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ STAGE 5: HUMAN SPOT-CHECK │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Sample: 10% of passed data, minimum 100 per dataset │ │
│ │ • Annotators: Native Indonesian speakers │ │
│ │ • Criteria: Same as LLM-as-judge + cultural appropriateness │ │
│ │ • Disagreement: Prompt re-validation │ │
│ │ • Typical keep rate: 95-99% │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ OVERALL KEEP RATE: 40-60% (from generated to final dataset) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
9.2 Quality Metrics by Task Type¶
| Task | Language ID | Deduplication | Semantic Filter | LLM-Judge | Human |
|---|---|---|---|---|---|
| Classification | 98% | 95% | N/A | 90% | 99% |
| Clustering | 98% | 90% | 85% (intra-cluster) | 85% | 98% |
| Reranking | 98% | 95% | 80% (pos/neg check) | 85% | 97% |
| STS | 98% | 90% | 75% (score verify) | 80% | 95% |
| Retrieval | 98% | 95% | 80% (relevance) | 85% | 97% |
| Instruction | 97% | 92% | 75% (instruction check) | 80% | 95% |
| Overall | 98% | 92% | 80% | 84% | 97% |
| --- | |||||
| ## 10. Indonesian Text Normalization | |||||
| ### 10.1 Preprocessing Pipeline | |||||
| ```python | |||||
| import re | |||||
| class IndonesianTextNormalizer: | |||||
| """ | |||||
| Normalizer for Indonesian text including slang, abbreviations, | |||||
| and code-mixing handling. | |||||
| """ | |||||
| def init(self): | |||||
| # Kamus Alay (Indonesian Slang Dictionary) - Sample entries | |||||
| self.slang_dict = { | |||||
| "yg": "yang", | |||||
| "utk": "untuk", | |||||
| "dgn": "dengan", | |||||
| "tdk": "tidak", | |||||
| "jg": "juga", | |||||
| "sdh": "sudah", | |||||
| "blm": "belum", | |||||
| "krn": "karena", | |||||
| "pd": "pada", | |||||
| "dpt": "dapat", | |||||
| "sy": "saya", | |||||
| "km": "kamu", | |||||
| "dr": "dari", | |||||
| "kek": "kayak", | |||||
| "gitu": "begitu", | |||||
| "sih": "", | |||||
| "deh": "", | |||||
| "dong": "", | |||||
| "lho": "", | |||||
| "kok": "", | |||||
| "lh": "lah", | |||||
| "cpt": "cepat", | |||||
| "bgt": "banget", | |||||
| "bsk": "besok", | |||||
| "mlm": "malam", | |||||
| "pgi": "pagi", | |||||
| "sii": "si", | |||||
| "yaa": "ya", | |||||
| "ka": "ke", | |||||
| "diya": "dia", | |||||
| "nya": "-nya", | |||||
| # Add more from comprehensive Kamus Alay | |||||
| } | |||||
| # Indonesian abbreviations | |||||
| self.abbrev_dict = { | |||||
| "ttd": "tertanda", | |||||
| "dlm": "dalam", | |||||
| "ths": "tahun", | |||||
| "bln": "bulan", | |||||
| "hri": "hari", | |||||
| "jln": "jalan", | |||||
| "no": "nomor", | |||||
| "tk": "toko", | |||||
| "pt": "perseroan", | |||||
| "cv": "curriculum vitae", | |||||
| "dll": "dan lain-lain", | |||||
| "dsb": "dan sebagainya", | |||||
| "dll": "dan lain-lain", | |||||
| "ybs": "yang bersangkutan", | |||||
| "ap": "asisten", | |||||
| "dr": "dokter", | |||||
| "ir": "insinyur", | |||||
| "drg": "dokter gigi", | |||||
| # Add more... | |||||
| } | |||||
| # Emoji/emoticon to text mapping | |||||
| self.emoji_dict = { | |||||
| ":)": "senyum", | |||||
| ":D": "tersenyum", | |||||
| ":(": "sedih", | |||||
| ":'(": "menangis", | |||||
| "<3": "cinta", | |||||
| ":)": "senyum", | |||||
| # Add more... | |||||
| } | |||||
| def normalize(self, text: str) -> str: | |||||
| """Full normalization pipeline.""" | |||||
| text = self._normalize_whitespace(text) | |||||
| text = self._expand_abbreviations(text) | |||||
| text = self._normalize_slang(text) | |||||
| text = self._handle_emoji(text) | |||||
| text = self._normalize_repetition(text) | |||||
| text = self._remove_special_chars(text) | |||||
| return text.strip() | |||||
| def _normalize_whitespace(self, text: str) -> str: | |||||
| """Normalize whitespace characters.""" | |||||
| return re.sub(r'\s+', ' ', text) | |||||
| def _expand_abbreviations(self, text: str) -> str: | |||||
| """Expand common Indonesian abbreviations.""" | |||||
| for abbr, full in self.abbrev_dict.items(): | |||||
| text = re.sub(r'\b' + abbr + r'\b', full, text) | |||||
| return text | |||||
| def _normalize_slang(self, text: str) -> str: | |||||
| """Normalize Indonesian slang (Bahasa Alay).""" | |||||
| words = text.split() | |||||
| normalized = [] | |||||
| for word in words: | |||||
| lower_word = word.lower() | |||||
| if lower_word in self.slang_dict: | |||||
| replacement = self.slang_dict[lower_word] | |||||
| if replacement: # Skip empty replacements | |||||
| normalized.append(replacement) | |||||
| else: | |||||
| normalized.append(word) | |||||
| return ' '.join(normalized) | |||||
| def _handle_emoji(self, text: str) -> str: | |||||
| """Convert emoji to text descriptions.""" | |||||
| for emoji, meaning in self.emoji_dict.items(): | |||||
| text = text.replace(emoji, f" {meaning} ") | |||||
| return text | |||||
| def _normalize_repetition(self, text: str) -> str: | |||||
| """Normalize repeated characters (e.g., 'sangaaat' -> 'sangat').""" | |||||
| text = re.sub(r'(.)\1{2,}', r'\1\1', text) | |||||
| return text | |||||
| def _remove_special_chars(self, text: str) -> str: | |||||
| """Remove unnecessary special characters while keeping Indonesian ones.""" | |||||
| # Keep Indonesian characters, numbers, and basic punctuation | |||||
| text = re.sub(r'[^\w\s\u0020-\u007E\u00A0-\u00FF]', '', text) | |||||
| return text | |||||
| def detect_formality(self, text: str) -> str: | |||||
| """ | |||||
| Detect if text is formal (baku) or informal (gaul). | |||||
| Returns: 'formal', 'informal', or 'mixed' | |||||
| """ | |||||
| informal_indicators = [ | |||||
| 'yg', 'utk', 'tdk', 'jg', 'sy', 'km', | |||||
| 'gue', 'lu', 'lo', 'ga', 'nggak', | |||||
| 'sih', 'deh', 'dong', 'lho', 'kok', | |||||
| 'bang', 'non', 'bos', 'kak' | |||||
| ] | |||||
| formal_indicators = [ | |||||
| 'yang', 'untuk', 'tidak', 'saya', 'kamu', | |||||
| 'adalah', 'merupakan', 'yaitu', 'tersebut', | |||||
| 'dalam', 'pada', 'oleh', 'dengan' | |||||
| ] | |||||
| words = text.lower().split() | |||||
| informal_count = sum(1 for w in words if w in informal_indicators) | |||||
| formal_count = sum(1 for w in words if w in formal_indicators) | |||||
| if informal_count == 0 and formal_count > 0: | |||||
| return 'formal' | |||||
| elif informal_count > 0 and formal_count == 0: | |||||
| return 'informal' | |||||
| elif informal_count > formal_count: | |||||
| return 'informal' | |||||
| elif formal_count > informal_count: | |||||
| return 'formal' | |||||
| else: | |||||
| return 'mixed' | |||||
| # Usage | |||||
| normalizer = IndonesianTextNormalizer() | |||||
| text_gaul = "Gw lagi di jalan nih, macet parah bang" | |||||
| normalized = normalizer.normalize(text_gaul) | |||||
| # Output: "Saya lagi di jalan ini macet parah" | |||||
| formality = normalizer.detect_formality(text_gaul) | |||||
| # Output: "informal" | |||||
| ``` |
10.2 Code-Mixed Text Handling¶
Indonesian text often contains code-mixing (Indonglish):
def detect_code_mixing(text: str) -> dict:
"""
Detect English-Indonesian code-mixing in text.
Returns:
dict: Contains ratio, words_by_language, mixed_segments
"""
# Simple word-level language detection
# In production, use trained model (IndoJavE, IndoRobusta)
english_words = set([
'the', 'of', 'and', 'to', 'in', 'is', 'you', 'that', 'it', 'he',
'was', 'for', 'on', 'are', 'as', 'with', 'his', 'they', 'at',
'be', 'this', 'have', 'from', 'or', 'one', 'had', 'by', 'word'
])
words = text.split()
id_words = []
en_words = []
mixed_segments = []
current_lang = None
current_segment = []
for word in words:
word_lower = word.lower().strip('.,!?;:')
if word_lower in english_words or any(c.isalpha() and ord(c) < 128 for c in word):
lang = 'en'
else:
lang = 'id'
if lang != current_lang:
if current_segment:
mixed_segments.append(' '.join(current_segment))
current_segment = [word]
current_lang = lang
else:
current_segment.append(word)
if lang == 'id':
id_words.append(word)
else:
en_words.append(word)
if current_segment:
mixed_segments.append(' '.join(current_segment))
return {
'id_ratio': len(id_words) / len(words) if words else 0,
'en_ratio': len(en_words) / len(words) if words else 0,
'is_code_mixed': 0.2 < len(en_words) / len(words) < 0.8 if words else False,
'mixed_segments': mixed_segments
}
# Example
text_mixed = "Meeting hari ini sangat productive, kita achieved semua goals yang disepakati."
result = detect_code_mixing(text_mixed)
# is_code_mixed: True
# mixed_segments: ['Meeting hari ini sangat productive', 'kita achieved', 'semua goals yang disepakati']
10.3 Normalization for Generation vs Evaluation¶
| Purpose | Normalization Level | Rationale |
|---|---|---|
| Training data generation | Light (preserve register) | Maintain natural Indonesian |
| Embedding training | Medium (standardize) | Reduce noise, improve quality |
| Evaluation | Light (preserve original) | Real-world performance |
| Clustering | Heavy (normalize all) | Group similar documents |
11. LLM-as-a-Judge Validation¶
11.1 Validation Framework¶
Based on VN-MTEB and TR-MTEB methodologies:
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM-AS-A-JUDGE VALIDATION PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CALIBRATION PHASE (Required for reliable validation) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ • Human annotation: 100-500 samples │ │
│ │ • Prompt iteration: Align LLM judgments with humans │ │
│ │ • Target: ≥85% agreement, ≥90% precision │ │
│ │ • TR-MTEB achieved: 85.2% agreement, 92.9% precision │ │
│ │ • Iterate until calibration targets met │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ VALIDATION CRITERIA (VN-MTEB 5-criteria adapted for Indonesian) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 1. Grammar (Tata Bahasa): │ │
│ │ - Correct Indonesian grammar and syntax │ │
│ │ - Proper verb conjugation │ │
│ │ - Correct affixation (me-, ber-, -kan, etc.) │ │
│ │ │ │
│ │ 2. NER (Named Entity Preservation): │ │
│ │ - Indonesian names preserved (Siti, Budi, Joko) │ │
│ │ - Place names preserved (Jakarta, Jogja, Surabaya) │ │
│ │ - Cultural terms preserved (gotong royong, adat) │ │
│ │ │ │
│ │ 3. Numbers/Links (Angka dan Tautan): │ │
│ │ - Numbers preserved correctly (17 Agustus 1945) │ │
│ │ - Dates preserved (tgl, thn, bulan) │ │
│ │ - URLs and links preserved │ │
│ │ │ │
│ │ 4. Fluency (Kefasan Bahasa): │ │
│ │ - Natural, native-like phrasing │ │
│ │ - Appropriate register (formal/informal) │ │
│ │ - No awkward calques from English │ │
│ │ │ │
│ │ 5. Meaning Preservation (Pelestarian Makna): │ │
│ │ - Semantic equivalence maintained │ │
│ │ - No information loss │ │
│ │ - No information gain (hallucination) │ │
│ │ │ │
│ │ Scoring: 1-5 scale per criterion, weighted average │ │
│ │ Threshold: ≥ 3.5/5.0 overall to PASS │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ CHAIN-OF-THOUGHT PROMPTING │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ "Evaluasi teks Bahasa Indonesia berikut: │ │
│ │ │ │
│ │ [GENERATED TEXT] │ │
│ │ │ │
│ │ Original: [SOURCE TEXT] │ │
│ │ │ │
│ │ Evaluasi langkah demi langkah: │ │
│ │ 1. Periksa kebenaran tata bahasa Indonesia │ │
│ │ 2. Verifikasi named entity tetap terjaga │ │
│ │ 3. Nilai kefasan dan kealamiban bahasa │ │
│ │ 4. Bandingkan makna dengan teks asli │ │
│ │ │ │
│ │ Output JSON: │ │
│ │ { │ │
│ │ 'grammar': 1-5, │ │
│ │ 'ner': 1-5, │ │
│ │ 'numbers': 1-5, │ │
│ │ 'fluency': 1-5, │ │
│ │ 'meaning': 1-5, │ │
│ │ 'overall': 1-5, │ │
│ │ 'pass': true/false, │ │
│ │ 'reason': '[Penjelasan singkat dalam ID]' │ │
│ │ }" │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
11.2 Calibration Results (TR-MTEB)¶
| Metric | Score | Target |
|---|---|---|
| Agreement | 85.2% | ≥85% |
| Precision | 92.9% | ≥90% |
| Recall | 84.4% | ≥80% |
| F1 Score | 88.4% | ≥85% |
11.3 LLM-as-Judge Prompts for Indonesian¶
LLM_AS_JUDGE_PROMPTS = {
"classification": """
Evaluasi contoh data klasifikasi Bahasa Indonesia berikut:
Kriteria evaluasi:
1. Keakurasan label: Apakah label sesuai dengan isi teks?
2. Kejelasan teks: Apakah teks jelas dan mudah dipahami?
3. Kecukupan informasi: Apakah teks memiliki cukup informasi untuk klasifikasi?
Output JSON:
{{
"label_accuracy": 1-5,
"text_clarity": 1-5,
"information_sufficiency": 1-5,
"overall": 1-5,
"pass": true/false,
"reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",
"retrieval": """
Evaluasi pasangan query-dokumen Bahasa Indonesia berikut:
Query: "{query}"
Document: "{document}"
Label: {label} (positive/negative)
Kriteria evaluasi:
1. Relevansi: Apakah dokumen relevan dengan query?
2. Kelengkapan: Apakah dokumen cukup menjawab query?
3. Akurasi: Apakah informasi dalam dokumen akurat?
Jika label adalah "positive", dokumen HARUS:
- Langsung menjawab query
- Memberikan informasi yang dibutuhkan
- Tidak menyesatkan atau menipu
Jika label adalah "negative", dokumen seharusnya:
- Tidak menjawab query
- Topik berbeda atau informasi kurang relevan
Output JSON:
{{
"relevance": 1-5,
"completeness": 1-5,
"accuracy": 1-5,
"overall": 1-5,
"pass": true/false,
"reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",
"sts": """
Evaluasi skor similarity untuk pasangan kalimat Bahasa Indonesia:
Kalimat 1: "{sent1}"
Kalimat 2: "{sent2}"
LLM Score: {llm_score}
Evaluasi apakah skor LLM sesuai dengan kemiripan sebenarnya.
Pertimbangkan:
1. Makna (meaning): Apakah menyampaikan informasi serupa?
2. Konteks (context): Apakah dalam konteks yang sama?
3. Entitas (entities): Apakah membahas entitas yang sama?
Output JSON:
{{
"estimated_similarity": 0-5,
"llm_score_correct": true/false,
"adjustment": -2 to +2 (jika perlu),
"reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",
"instruction_following": """
Evaluasi pasangan instruksi-respons Bahasa Indonesia:
Instruction: "{instruction}"
Response: "{response}"
Kriteria evaluasi:
1. Kepatuhan: Apakah respons mengikuti instruksi?
2. Kelengkapan: Apakah respons lengkap sesuai permintaan?
3. Akurasi: Apakah informasi dalam respons akurat?
4. Kejelasan: Apakah respons jelas dan mudah dipahami?
Output JSON:
{{
"instruction_following": 1-5,
"completeness": 1-5,
"accuracy": 1-5,
"clarity": 1-5,
"overall": 1-5,
"pass": true/false,
"reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
"""
}
11.4 Recommended Judge Models for Indonesian¶
| Model | Parameters | Recommendation | Cost | Best For |
|---|---|---|---|---|
| Claude 3.5 Sonnet | - | ★★★★★ Best | \(3/\)15 per 1M | Complex evaluation |
| GPT-4o | - | ★★★★★ Excellent | \(5/\)15 per 1M | Quality critical |
| Command R+ | 104B | ★★★★☆ Very Good | \(1/\)2 per 1M | Cost-efficient |
| Aya-23-35B | 35B | ★★★★☆ Good | Self-hosted | Indonesian-specialized |
| SEA-LION-v4 | 8B | ★★★☆☆ Fair | Self-hosted | Budget option |
| Qwen2.5-7B | 7B | ★★★☆☆ Fair | Self-hosted | Local evaluation |
Recommendation: Use Claude 3.5 Sonnet for calibration and final validation, Command R+ for large-scale filtering.
12. Indonesian-Specific Considerations¶
12.1 Linguistic Challenges¶
| Challenge | Description | Example | Mitigation |
|---|---|---|---|
| Formal vs Informal Register | Indonesian has formal (baku) and informal (gaul) variants | "Saya tidak setuju" vs "Gue nggak setuju" | Explicit register specification in prompts |
| Code-Mixing | English-Indonesian mixing common in urban areas | "Meeting ini very productive banget" | Include code-mixed examples or filter out |
| Reduplication | Common grammatical feature | "kata-kata", "orang-orang" | Ensure natural patterns in generation |
| Affixation | Complex prefix/suffix system | "me-lestar-i-kan", "ber-karya-a" | NLP-aware prompting |
| Regional Influence | Javanese/Sundanese influence | "Wis mbok" (Sundanese-influenced Javanese) | Specify standard Indonesian or include variations |
| Abbr. Informal | Common abbreviations in informal text | "yg", "utk", "tdk" | Normalize or preserve based on use case |
12.2 Cultural Considerations¶
| Aspect | Consideration | Implementation |
|---|---|---|
| Local Context | Indonesian cultural references | Use Indonesian topics in generation |
| Religious Sensitivity | Muslim-majority country | Respectful content guidelines, avoid sensitive topics |
| Geographic Diversity | 700+ ethnic groups across islands | Include topics from Sumatra, Java, Kalimantan, Sulawesi, Papua, etc. |
| Current Events | Local news and trends important | Include timely topics in training data |
| Cultural Concepts | Unique Indonesian concepts | Preserve terms like "gotong royong", "adat", "pancasila" |
12.3 Domain-Specific Indonesian Corpora¶
| Domain | Sources | Size/Availability | Use Case |
|---|---|---|---|
| News | detik.com, kompas.com, tempo.co, CNN Indonesia | High (web scraping) | Clustering, STS, Classification |
| E-commerce | Tokopedia, Shopee, Bukalapak | Medium (datasets exist) | Retrieval, Classification |
| Legal | JDIH, peraturan.go.id | Medium (official) | Reranking (legal domain) |
| Medical | Alodokter, Halodoc articles | Medium (public) | Reranking (medical domain) |
| Government | indonesia.go.id | Medium (official) | Classification |
| Social Media | Twitter/X, Instagram | High (API access) | Informal register, code-mixing |
| Encyclopedia | Wikipedia Indonesia | High (dump available) | General knowledge, STS |
| Literature | Indonesian short stories, poems | Medium (public domain) | STS, summarization |
12.4 Existing Indonesian Datasets¶
Text Classification¶
- IndoNLU: 12 tasks including sentiment, aspect, NER
- CLICK-ID: 15,000 clickbait headlines from 12 publishers
- Indonesian Hoax News: 600 documents (372 valid, 228 fake)
Natural Language Inference¶
- IndoNLI: 18K sentence pairs (entailment, contradiction, neutral)
- SNLI Indo: Translated SNLI dataset for Indonesian
Semantic Textual Similarity¶
- IndoSTS: Translated STS-B for Indonesian
- SICK-R Indo: Translated SICK-R dataset
Question Answering¶
- TyDi QA: Indonesian subset of TyDi QA
- XQuAD: Indonesian subset (from Wikipedia)
Summarization¶
- IndoSum: ~19K news article-summary pairs
Parallel / Regional Languages¶
- NusaX: 10 Indonesian local languages, parallel with Indonesian + English
- SEACrowd: 36 SEA indigenous languages
13. Failure Mode Analysis¶
13.1 Common LLM Generation Errors for Indonesian¶
┌─────────────────────────────────────────────────────────────────────────┐
│ COMMON GENERATION ERRORS & MITIGATION │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ERROR TYPE 1: OVER-FORMALIZATION │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM tends to generate overly formal Indonesian │ │
│ │ Example Input: "Gue lagi lapar nih" │ │
│ │ Generated: "Saya merasa lapar saat ini" │ │
│ │ Impact: Loss of register diversity │ │
│ │ Mitigation: Specify register in prompt, add few-shot examples │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ERROR TYPE 2: CODE-MIXING REMOVAL │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM removes English words from code-mixed text │ │
│ │ Example Input: "Meeting ini productive banget" │ │
│ │ Generated: "Pertemuan ini sangat produktif" │ │
│ │ Impact: Loss of authentic Indonesian social media patterns │ │
│ │ Mitigation: Explicitly preserve English words in prompts │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ERROR TYPE 3: REDUPLICATION LOSS │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM simplifies reduplicated words │ │
│ │ Example Input: "Orang-orang itu sedang berdiskusi" │ │
│ │ Generated: "Orang itu sedang berdiskusi" │ │
│ │ Impact: Loss of grammatical nuance │ │
│ │ Mitigation: Few-shot examples with reduplication │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ERROR TYPE 4: CULTURAL TERM ERASURE │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM translates/removes Indonesian cultural terms │ │
│ │ Example Input: "Gotong royong adalah budaya Indonesia" │ │
│ │ Generated: "Kerja sama adalah budaya Indonesia" │ │
│ │ Impact: Loss of cultural specificity │ │
│ │ Mitigation: Add cultural terms to protected entities list │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ERROR TYPE 5: HALLUCINATED REGIONAL VARIANTS │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM generates fake regional language words │ │
│ │ Example: Nonexistent Javanese or Sundanese vocabulary │ │
│ │ Impact: Low-quality training data │ │
│ │ Mitigation: Validate against NusaX dataset or native speakers │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ERROR TYPE 6: INCONSISTENT ABBREVIATION USAGE │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Description: LLM misuses informal abbreviations │ │
│ │ Example: "Yg utk dilaksanakan secepatnya" │ │
│ │ Issue: Mixed formal structure with informal abbreviations │ │
│ │ Impact: Unnatural text │ │
│ │ Mitigation: Register consistency checks in validation │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
13.2 Model-Specific Failure Patterns¶
| Model | Common Issues | Mitigation |
|---|---|---|
| GPT-4o | Over-formalization, cultural term erasure | Explicit cultural context in prompts |
| Claude 3.5 | Good with culture, sometimes overly literal | Few-shot examples with nuance |
| Command R+ | Generally good, occasional code-mixing issues | Specify code-mixing handling |
| SEA-LION-v4 | Good Indonesian, struggles with slang | Use formal prompts for slang generation |
| SahabatAI | Best for informal, sometimes misses formal | Register specification required |
| Qwen2.5 | Good multilingual, less Indonesia-specific | Add Indonesia context |
14. Implementation Roadmap¶
14.1 Generation Strategy by Task¶
| Task | Target Count | Generation Method | Primary Validation |
|---|---|---|---|
| Clustering | 10 datasets (50K docs) | Document clustering + LLM labeling | Intra-cluster similarity |
| Reranking | 10 datasets (5K queries) | Query + candidates (hard negatives) | LLM-as-judge ranking |
| STS | 10 datasets (15K pairs) | Paraphrase + thematic variation | Semantic similarity model |
| Classification | 5 datasets (25K samples) | Topic-based text generation | Label accuracy check |
| Pair Classification | 5 datasets (20K pairs) | NLI generation (IndoNLI style) | Logical consistency |
| Retrieval | 5 datasets (10K pairs) | Query-document (filling gaps) | Relevance scoring |
| Summarization | 2 datasets (5K pairs) | Article + summary generation | ROUGE + LLM-as-judge |
| Instruction Following | 5 datasets (50K pairs) | Instruction-response generation | Instruction adherence |
14.2 Resource Estimation¶
| Phase | Activity | Duration | Cost |
|---|---|---|---|
| 1. Preparation | Data collection, prompt design | 1 week | $100-200 |
| 2. Seed Generation | 5-10K samples via GPT-4o/Claude | 3-5 days | $30-50 |
| 3. Large-Scale Generation | 50-100K samples via Command R+ | 1-2 weeks | $50-100 |
| 4. Validation | LLM-as-judge + semantic filtering | 1 week | $30-50 |
| 5. Human Review | 500-1000 samples annotation | 1-2 weeks | $1,000-1,500 |
| 6. Integration | Format conversion, metadata | 3-5 days | $50-100 |
| Total | 4-6 weeks | $1,260-2,000 |
14.3 Quality Targets¶
| Metric | Target | Rationale |
|---|---|---|
| LLM-as-Judge Pass Rate | ≥80% | Slightly higher than VN-MTEB baseline |
| Semantic Similarity (retrieval) | ≥0.75 for positive | Standard threshold |
| Semantic Similarity (hard negative) | 0.5-0.8 | Not too high, not too low |
| Human Agreement | ≥85% | TR-MTEB calibration target |
| Deduplication Rate | <5% after filtering | MinHash-based filtering |
| Format Compliance | 100% | MTEB schema requirement |
| Indonesian Language ID | ≥95% | Language detection confidence |
14.4 Timeline¶
Week 1-2: Preparation & Seed Data
├─ Collect Indonesian corpora
├─ Design prompts for each task type
├─ Generate seed data (5K samples)
└─ Set up validation pipeline
Week 3-4: Large-Scale Generation
├─ Generate 50-100K samples per task
├─ Real-time quality monitoring
├─ Adjust prompts based on quality metrics
└─ Filter and deduplicate
Week 5: Validation & Human Review
├─ LLM-as-judge validation
├─ Human annotation (500-1000 samples)
├─ Calibrate LLM-as-judge
└─ Final filtering
Week 6: Integration & Documentation
├─ Format conversion to MTEB schema
├─ Metadata documentation
├─ HuggingFace upload
└─ Baseline model evaluation
15. Case Studies from Regional MTEBs¶
15.1 VN-MTEB (Vietnamese)¶
Methodology: Translation-first approach
- Translated 41 datasets from English using translation pipeline
- LLM-as-judge validation with 5 criteria
- 65-72% kept ratio after validation
- Focus on quality over quantity
Key Insights: - Translation requires careful post-processing - Cultural adaptation needed for idioms - LLM-as-judge calibration essential
15.2 TR-MTEB (Turkish)¶
Methodology: Hybrid synthetic + human data
- 34.2M training pairs generated
- 11 new datasets created
- 85.2% human agreement achieved
- 6 core tasks covered
Key Insights: - Self-instruct effective for Turkish - Domain-specific datasets (legal, medical) valuable - Calibration critical for LLM-as-judge
15.3 AfriMTEB (African Languages)¶
Methodology: Multicultural synthetic data
- 59 languages, 14 tasks, 38 datasets
- 6 new synthetic datasets created
- Cultural context preservation critical
- Focus on low-resource languages
Key Insights: - Cultural knowledge important for generation - Native speaker validation essential - Regional variations need attention
15.4 SEA-BED (Southeast Asia)¶
Methodology: Regional collaboration
- 169 datasets across 10 SEA languages
- 71% human-labeled
- Multilingual approach
- Focus on SEA-specific tasks
Key Insights: - Regional collaboration improves quality - Shared resources reduce cost - Cultural context across borders similar
15.5 ArabicMTEB (Arabic)¶
Methodology: Domain-specific synthetic data
- Command R+ for generation
- 40% synthetic data in training
- Dialectal variation (Egyptian, Moroccan)
- +16 points performance gain
Key Insights: - Dialectal generation requires specific prompts - Domain-specific data valuable - Hard negative mining essential
16. Key Takeaways¶
16.1 Methodology Recommendations¶
| Priority | Recommendation | Source |
|---|---|---|
| 1 | Use Command R+ or Command-light for generation | Cost-effective, quality output |
| 2 | Implement SPEED framework for scale | 10× cost reduction vs GPT-4 |
| 3 | LLM-as-judge with calibration | TR-MTEB: 88.4% F1 |
| 4 | Topic-based generation from Indonesian categories | SPEED finding |
| 5 | Domain-specific datasets (legal, medical, finance) | ArabicMTEB approach |
| 6 | Hard negative mining for retrieval/reranking | Core to embedding quality |
| 7 | Indonesian-specialized models (SEA-LION, SahabatAI) | Better Indonesian understanding |
| 8 | Register specification in prompts | Avoid over-formalization |
| 9 | Cultural term preservation | Maintain authenticity |
| 10 | Multi-stage validation (5 stages) | Quality assurance |
16.2 Critical Success Factors¶
- Calibration: Always calibrate LLM-as-judge with human labels (100-500 samples)
- Diversity: Use topic-based prompts to avoid mode collapse
- Validation: Multi-stage quality control (language → dedup → semantic → LLM judge → human)
- Indonesian Context: Localized prompts, cultural awareness, register specification
- Iterative Refinement: Start small, validate, then scale
- Cost Management: Use efficient models (Command-light, SPEED-aligned) for large scale
- Quality Over Quantity: Better to have 10K high-quality samples than 100K low-quality
- Native Speaker Review: Essential for cultural and linguistic nuances
16.3 Novelty Opportunities for Indonesia-MTEB¶
Based on comprehensive research, Indonesia-MTEB can introduce:
- Archipelago-Aware Generation: Regional variation in Indonesian (Javanese-influenced, Sundanese-influenced, Papuan-influenced)
- Formal Register Continuum: Explicit datasets across formal-informal spectrum (baku → gaul → alay)
- Code-Mixing Evaluation: Indonesian-English code-mixed data (realistic social media, Indonglish)
- Domain-Specific Forks: Legal Indonesian, Medical Indonesian, Financial Indonesian
- Cultural Knowledge: Indonesian-specific cultural queries from Wikipedia Indonesia
- Regional Language Integration: NusaX-style parallel data (10 regional languages + Indonesian)
- Real-Time Data: Dynamic dataset updates from current Indonesian news and trends
- Multi-Modal Embeddings: Image-text pairs for Indonesian e-commerce, tourism, food
17. References¶
Synthetic Data Frameworks¶
-
SPEED: Chen et al. (2024). "Little Giants: Synthesizing High-Quality Embedding Data at Scale." arXiv:2410.18634. [link]
-
Self-Instruct: Wang et al. (2023). "Self-Instruct: Aligning Language Models with Self Generated Instructions." ACL 2023.
-
LLM-Driven Synthetic Data: Long et al. (2024). "On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation." ACL Findings 2024.
Regional MTEB Synthetic Data¶
-
ArabicMTEB: Bhatia et al. (2025). "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks." NAACL 2025.
-
TR-MTEB: Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations." EMNLP 2025 Findings.
-
VN-MTEB: Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500.
-
AfriMTEB: Uemura et al. (2025). "AfriMTEB and AfriE5: Benchmarking and Adapting Text Embeddings for African Languages." arXiv:2510.23896.
-
SEA-BED: Ponwitayarat et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark." arXiv:2508.12243.
Indonesian LLM Models¶
-
SEA-LION: Ng et al. (2025). "SEA-LION: Southeast Asian Languages in One Network." IJCNLP 2025.
-
Cendol: Cahyawijaya et al. (2024). "Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian and Local Languages." arXiv:2404.06138.
-
SahabatAI: GoTo & CSA Lab (2025). "SahabatAI: Indonesian-Centric Large Language Models."
-
NusaCrowd: Cahyawijaya et al. (2023). "NusaCrowd: Open Source Initiative for Indonesian NLP Resources." ACL Findings 2023.
-
SEACrowd: Lovenia et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages." EMNLP 2024.
Indonesian Datasets¶
-
IndoNLI: Mahendra et al. (2021). "IndoNLI: A Natural Language Inference Dataset for Indonesian." EMNLP 2021.
-
SNLI Indo: Putra et al. (2024). "SNLI Indo: A Recognizing Textual Entailment Dataset in Indonesian." Journal of Physics: Conference Series.
-
CLICK-ID: William et al. (2020). "CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines." Data in Brief.
-
IndoSum: Kurniawan & Louvan (2018). "IndoSum: A New Benchmark Dataset for Indonesian Text Summarization." IALP 2018.
LLM-as-a-Judge¶
-
LLM-as-Judge Survey: arxiv.org/abs/2411.15594 (2024)
-
Chain-of-Thought for LLM-as-Judge: Arize AI (2025). "Evidence-Based Prompting Strategies for LLM-as-a-Judge."
Tools and Resources¶
-
HuggingFace Synthetic Data Generator: huggingface.co/blog/synthetic-data-generator
-
SPEED GitHub: github.com/haon-chen/SPEED
-
IndoNLP: github.com/IndoNLP
18. Next Steps (Document Roadmap)¶
| Document | Content | Status |
|---|---|---|
| 01 | Project Overview | ✅ Complete |
| 02 | MTEB Structure Analysis | ✅ Complete |
| 03 | Existing Indonesian Datasets | ✅ Complete |
| 04 | Regional MTEB Methodologies | ✅ Complete |
| 05 | Translation Models Benchmark | ✅ Complete (Enhanced v2.0) |
| 06 | AI Dataset Generation Methods | ✅ Complete (Enhanced v2.0) |
| 07 | Validation Strategies | Pending |
| 08 | ACL Dataset Paper Standards | Pending |
| 09 | Novelty Angle & Publication | Pending |
| 10 | Implementation Roadmap | Pending |
Appendix A: Quick Reference¶
Recommended Models for Each Task¶
| Task | Best Generator | Best Validator | Cost Efficiency |
|---|---|---|---|
| Seed Data | GPT-4o / Claude 3.5 | Same | Low priority, quality first |
| Large-Scale | Command-light | Claude 3.5 | ★★★★★ |
| Indonesian-Specific | SEA-LION-v4 / SahabatAI | GPT-4o | ★★★★☆ |
| Cost-Optimized | SPEED-aligned 8B | Command R+ | ★★★★★ |
Cost Calculator¶
For 10,000 samples generation:
- Command R+: ~$6-7
- GPT-4o: ~$30-35
- Claude 3.5: ~$22-27
- SPEED-aligned 8B (self-hosted): ~$2-3
For 10,000 samples validation:
- Claude 3.5: ~$15-20
- Command R+: ~$4-5
Savings with Command R+: 70-80% vs GPT-4o/Claude
"Synthetic data generation, when properly validated through LLM-as-judge and calibrated with human annotations, can fill critical dataset gaps while maintaining quality standards comparable to human-curated data. For Indonesia-MTEB, this approach enables rapid development of clustering, reranking, and STS datasets that are otherwise unavailable."
This document is a living record. Updated as research progresses.