Project: Indonesia-MTEB Benchmark Document: 06 - AI Dataset Generation Methods Last Updated: 2026-01-25 Version: 2.0 (Enhanced) Status: Research Phase

AI Dataset Generation Methods for Indonesia-MTEB¶

"Synthetic data generation is the key to filling critical gaps in Indonesia-MTEB—especially for Clustering, Reranking, and STS tasks where existing Indonesian datasets are scarce. This document provides a comprehensive guide to generating high-quality Indonesian embedding datasets at scale."

Table of Contents¶

Executive Summary
Synthetic Data Landscape
Model Selection for Indonesian Generation
Generation Frameworks
Cost Estimation and Budgeting
Task-Specific Generation Methods
Prompt Engineering with Indonesian Examples
Hard Negative Generation
Quality Validation Pipeline
Indonesian Text Normalization
LLM-as-a-Judge Validation
Indonesian-Specific Considerations
Failure Mode Analysis
Implementation Roadmap
Case Studies from Regional MTEBs
Key Takeaways
References

1. Executive Summary¶

1.1 The Synthetic Data Opportunity¶

Regional MTEBs have successfully used LLM-generated synthetic data to fill dataset gaps:

Benchmark	Synthetic Data Usage	Impact	Key Insight
ArabicMTEB	40% of training data	+16 points (Swan-Small)	Synthetic data significantly boosts performance
SPEED	920K embedding pairs	Outperforms E5-mistral with 1/10 GPT calls	Small models can generate high-quality data
VN-MTEB	Translation + validation	65-72% kept ratio	LLM-as-judge critical for quality control
TR-MTEB	34.2M training pairs	Competitive SOTA results	Synthetic + human data hybrid approach
AfriMTEB	6 new synthetic datasets	59 languages, 14 tasks	Multicultural synthetic data generation
SEA-BED	169 datasets (71% human)	10 SEA languages	Regional adaptation is critical

1.2 Key Findings¶

SPEED Framework (Chen et al., 2024) enables small 8B models to generate embedding data that outperforms GPT-4-only approaches with <1/10 API calls
Indonesian-optimized models (SEA-LION-v4, SahabatAI, Cendol) show promising generation capabilities
Three-stage quality control (language detection → semantic similarity → LLM-as-judge) is essential
Scaling law: Log-linear relationship between synthetic data size and embedding model performance
Cost efficiency: Command R+ at $1-2/1M tokens is 3-15× cheaper than GPT-4o/Claude for generation
Task-specific prompting with Indonesian examples significantly improves quality

1.3 Indonesia-MTEB Dataset Gaps¶

MTEB Task	Existing Indonesian Datasets	Gap	Synthetic Priority
Clustering	0	Complete absence	CRITICAL
Reranking	0	Complete absence	CRITICAL
STS	3 (limited)	Insufficient coverage	HIGH
Retrieval	2	Domain gaps	MEDIUM
Pair Classification	2 (IndoNLI, SNLI-Indo)	Limited domains	MEDIUM
Classification	8	Domain imbalance	LOW
Instruction Following	0	Complete absence	HIGH
Summarization	1 (IndoSum)	Single source	MEDIUM

2. Synthetic Data Landscape¶

2.1 State of Synthetic Data in NLP (2024-2025)¶

┌─────────────────────────────────────────────────────────────────────────┐
│              SYNTHETIC DATA GENERATION LANDSCAPE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  APPROACHES                                                              │
│  ├─ Pure LLM Generation (GPT-4, Claude, Command R+)                     │
│  ├─ Small Model Alignment (SPEED: 8B → GPT-4 quality)                   │
│  ├─ Self-Instruct (Bootstrap from seed examples)                        │
│  ├─ Hybrid (Synthetic + Human Curation)                                 │
│  └─ Translation-Based (MT → Target Language)                            │
│                                                                          │
│  APPLICATIONS                                                            │
│  ├─ Text Embeddings (classification, STS, retrieval)                    │
│  ├─ Question Answering                                                 │
│  ├─ Instruction Tuning                                                  │
│  ├─ Code Generation                                                    │
│  └─ Multimodal (vision-language)                                       │
│                                                                          │
│  QUALITY VALIDATION                                                      │
│  ├─ LLM-as-Judge (85.2% human agreement with calibration)              │
│  ├─ Semantic Similarity (threshold-based filtering)                    │
│  ├─ Statistical Validation (word length, distribution)                 │
│  ├─ Deduplication (MinHash, SimHash)                                   │
│  └─ Human Spot-Check (10% sample recommended)                          │
│                                                                          │
│  CHALLENGES                                                              │
│  ├─ Hallucination detection                                             │
│  ├─ Mode collapse (repetitive outputs)                                 │
│  ├─ Cultural bias                                                       │
│  ├─ Language register inconsistency                                    │
│  └─ Quality-cost tradeoff                                              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

2.2 Synthetic Data on HuggingFace¶

As of 2024, 300+ datasets on HuggingFace are tagged as "synthetic", with mainstream LLMs leveraging synthetic data for training. Key synthetic data hubs:

NusaCrowd: 121+ datasets for Indonesian and regional languages
SEACrowd: 36 SEA indigenous languages, 13 tasks
IndoNLP: Centralized Indonesian NLP resources

2.3 Cost-Benefit Analysis¶

Method	Quality	Cost (USD/1M tokens)	Speed	Recommendation for Indonesian
GPT-4o	★★★★★	$5.00 input / $15.00 output	Slow	For seed data only
Claude 3.5 Sonnet	★★★★★	$3.00 input / $15.00 output	Medium	For complex generation
Command R+	★★★★★	$1.00 input / $2.00 output	Fast	Recommended for quality
Command-light	★★★★☆	$0.30 input / $0.60 output	Fast	Best value for scale
Aya-23-35B	★★★★☆	Self-hosted	Fast	Alternative (SEA focus)
SPEED-aligned 8B	★★★★☆	$0.10-0.20 (API equivalent)	Fast	Recommended for scale
SEA-LION-v4	★★★☆☆	Self-hosted	Fast	For Indonesian-specific
Qwen2.5-7B	★★★★☆	Self-hosted	Fast	Multilingual capable

3. Model Selection for Indonesian Generation¶

3.1 Indonesian LLM Landscape (2025)¶

┌─────────────────────────────────────────────────────────────────────────┐
│              INDONESIAN LLM MODEL COMPARISON                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CLOSED-SOURCE API MODELS                                                │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Model           │ Params │ Input  │ Output │ ID Support  │ Use    │    │
│  ├─────────────────────────────────────────────────────────────────┤    │
│  │ Command R+      │ 104B   │ $1.00  │ $2.00  │ ★★★★★       │ Best   │    │
│  │ Command-light   │ ~       │ $0.30  │ $0.60  │ ★★★★☆       │ Value  │    │
│  │ Aya-23-35B      │ 35B    │ TBD    │ TBD    │ ★★★★☆       │ Multil │    │
│  │ GPT-4o          │ -      │ $5.00  │ $15.00 │ ★★★★★       │ Seed   │    │
│  │ Claude 3.5      │ -      │ $3.00  │ $15.00 │ ★★★★★       │ Complex│    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  OPEN-SOURCE MODELS (Self-Hosted)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Model           │ Params │ VRAM    │ ID Support  │ Use          │    │
│  ├─────────────────────────────────────────────────────────────────┤    │
│  │ SEA-LION-v4     │ 8B     │ 16GB    │ ★★★★★       │ ID-specialized│    │
│  │ SahabatAI-v1    │ 9B     │ 16GB    │ ★★★★★       │ ID + dialects │    │
│  │ Cendol          │ 7B     │ 14GB    │ ★★★★☆       │ ID tasks      │    │
│  │ Qwen2.5-7B      │ 7B     │ 14GB    │ ★★★★☆       │ Multilingual  │    │
│  │ LLaMA-3.1-8B    │ 8B     │ 16GB    │ ★★★☆☆       │ Fine-tune    │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

3.2 Model Selection by Use Case¶

Use Case	Recommended Model	Rationale
Large-scale generation	SPEED-aligned Qwen2.5-8B	10× cost savings, good quality
High-quality seed data	Command R+	Best Indonesian generation, reasonable cost
Domain-specific (legal/medical)	SEA-LION-v4 fine-tuned	Indonesian context understanding
Code-mixed data	SahabatAI-v1	Trained on ID-Javanese-Sundanese-English
Regional languages	NusaX-based models	10 Indonesian regional languages
Instruction following	Aya-23-35B	Strong instruction following in 23 languages

3.3 SEA-LION-v4 Analysis¶

SEA-LION-v4 (AI Singapore) is the most Indonesian-optimized model:

Training Data: 35% Indonesian sources (Wikipedia ID, news, social media)
Languages: 11 SEA languages (Indonesian, Malay, Vietnamese, Thai, Burmese, Lao, Filipino, Tamil, Khmer, Javanese, Sundanese)
Performance: State-of-the-art on SEA-HELM benchmark
Tokenization: 1.2 tokens/word for Indonesian (best in class)
VRAM: 16GB (BF16) / 5GB (INT4)

3.4 SahabatAI-v1 Analysis¶

SahabatAI-v1 (GoTo/CSA Lab) is Indonesian-fine-tuned:

Base: Gemma2-9B
Languages: Indonesian, Javanese, Sundanese with code-mixing support
Training: Continued pre-training on 20B Indonesian tokens
Use Case: Best for informal/formal Indonesian generation
Cost: Self-hosted, requires 16GB VRAM

3.5 Cendol Model Analysis¶

Cendol (IndoLLM) family includes:

Cendol-7B: Indonesian-optimized instruction model
Languages: Indonesian + 5 regional languages (Javanese, Sundanese, Balinese, Minangkabau, Buginese)
Evaluation: 15 datasets including cultural reasoning
Use Case: Culturally-aware generation

4. Generation Frameworks¶

4.1 SPEED Framework¶

SPEED (Synthesizing High-Quality Embedding Data at Scale) aligns small 8B models to generate embedding data, achieving better performance than GPT-4-only approaches with 1/10 the API calls.

Paper: Chen et al. (2024). "Little Giants: Synthesizing High-Quality Embedding Data at Scale."
arXiv: 2410.18634
Code: github.com/haon-chen/SPEED

4.2 SPEED Architecture¶

┌─────────────────────────────────────────────────────────────────────────┐
│                        SPEED FRAMEWORK                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STAGE 1: TASK BRAINSTORMING                                              │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • GPT-4 generates diverse task descriptions                       │    │
│  │ • Topics sampled from Open Directory Project (ODP)              │    │
│  │ • For Indonesian: Use ID-specific topics (see Section 7.4)      │    │
│  │ • Output: Task pool T                                            │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STAGE 2: JUNIOR GENERATOR (SFT)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • GPT-4 generates small seed dataset D_seed (5K-10K samples)     │    │
│  │ • SFT on small model (Qwen2.5-8B or LLaMA-3-8B) → π_Jr          │    │
│  │ • Objective: Standard supervised loss on (prompt, task, data)  │    │
│  │ • Temperature: 0.8-1.0 (diversity)                               │    │
│  │ • Output: Basic data synthesis capability                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STAGE 3: SENIOR GENERATOR (DPO)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • π_Jr generates root data D_root (50K-100K samples)            │    │
│  │ • GPT-4 evaluates best/worst in each list (preference pairs)    │    │
│  │ • DPO optimizes → π_Sr (senior generator)                       │    │
│  │ • β (DPO) = 0.1 (alignment vs reference tradeoff)              │    │
│  │ • Output: High-quality synthesis model                          │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STAGE 4: DATA REVISOR (Self-Improvement)                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • GPT-4 evaluates D_root on 3 aspects:                           │    │
│  │   1. Relevance to task                                           │    │
│  │   2. Completeness per requirements                               │    │
│  │   3. Factual accuracy                                            │    │
│  │ • Produces revision signals → π_Re (revisor)                    │    │
│  │ • Refines synthetic data with minimal inference cost            │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  FINAL PIPELINE                                                           │
│  π_Sr generates large-scale data → π_Re refines → High-quality dataset │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

4.3 SPEED Results¶

Model	GPT API Calls	GPT Tokens	MTEB Score	Cost Efficiency
E5-mistral (GPT-only)	500K	180M	63.2	Baseline
SPEED (8B aligned)	45K	32M	64.8	10× fewer calls
Mistral_llama3	230K	-	62.6	2× worse than SPEED

4.4 SPEED Scaling Law¶

SPEED discovered a log-linear relationship between embedding model performance and synthetic data size:

Performance = α × log(data_size) + β

Where:
- α ≈ 2.5-3.0 (slope)
- β ≈ 45-50 (intercept)
- Diminishing returns beyond ~1M samples

Practical implication for Indonesia-MTEB: Target 50K-100K high-quality samples per task type for optimal performance.

4.5 SPEED Key Hyperparameters¶

Component	Hyperparameter	Optimal Value	Notes for Indonesian
Junior Generator	Temperature	0.8-1.0	Balance diversity/quality
	Training Samples	25K-50K	Use Indonesian seed data
Senior Generator (DPO)	β (DPO)	0.1	Trade-off alignment/reference
	Training Samples	10K-15K	High-quality Indonesian pairs
Data Revisor	Training Samples	25K-35K	Easier than synthesis

4.6 Self-Instruct Framework¶

Self-Instruct (Wang et al., 2023) bootstraps instruction-following data:

┌─────────────────────────────────────────────────────────────────────────┐
│                    SELF-INSTRUCT FRAMEWORK                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STEP 1: SEED GENERATION                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Human writes ~175 seed instruction-response pairs             │    │
│  │ • For Indonesian: Include bilingual examples                    │    │
│  │ • Cover diverse task types                                      │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STEP 2: BOOTSTRAP GENERATION                                             │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • For each seed: Generate 8 new instructions                   │    │
│  │ • Prompt: "Generate 8 diverse instructions for..."             │    │
│  │ • Language model generates both instruction and response       │    │
│  │ • ~1,400 new pairs from 175 seeds                             │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STEP 3: FILTERING                                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Remove low-quality outputs                                    │    │
│  │ • Filter by Indonesian language detection                      │    │
│  │ • Remove near-duplicates (MinHash)                             │    │
│  │ • Typical keep rate: 50-70%                                    │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  STEP 4: ITERATION                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Add filtered data to training pool                            │    │
│  │ • Fine-tune model on new data                                  │    │
│  │ • Repeat from Step 2 (typically 3-5 iterations)                │    │
│  │ • Final dataset: 50K-100K instruction pairs                    │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

5. Cost Estimation and Budgeting¶

5.1 API Pricing Comparison (2025)¶

Model	Input (USD/1M)	Output (USD/1M)	Context	Indonesian Support
GPT-4o	$5.00	$15.00	128K	★★★★★
Claude 3.5 Sonnet	$3.00	$15.00	200K	★★★★★
Command R+	$1.00	$2.00	128K	★★★★★
Command-light	$0.30	$0.60	128K	★★★★☆
GPT-4o-mini	$0.15	$0.60	128K	★★★★☆

5.2 Cost Estimation by Task¶

Assuming 10,000 samples per task type with average token counts:

Task	Tokens/Sample	Total Tokens	Command R+ Cost	GPT-4o Cost	Savings
Classification	150	1.5M	$2.25	$11.25	80%
Clustering	300	3.0M	$4.50	$22.50	80%
Reranking	500	5.0M	$7.50	$37.50	80%
STS	200	2.0M	$3.00	$15.00	80%
Retrieval	400	4.0M	$6.00	$30.00	80%
Instruction	250	2.5M	$3.75	$18.75	80%
Total	-	18M	$27.00	$135.00	$108

Self-hosted alternative (Qwen2.5-7B): - Hardware: 1× RTX 4090 (24GB VRAM) @ $0.50/hour spot - Generation time: ~50 hours for 60K samples - Total cost: ~$25 + electricity - Break-even: ~1M tokens vs Command R+

5.3 Budget Recommendations for Indonesia-MTEB¶

Component	Recommended Approach	Estimated Cost
Seed data (5K samples)	GPT-4o or Claude 3.5	$20-30
Large-scale generation (50K+)	SPEED-aligned 8B or Command R+	$50-100
Validation (LLM-as-judge)	Claude 3.5 or GPT-4o	$30-50
Human annotation (500 samples)	$2-3/sample	$1,000-1,500
Infrastructure	Cloud GPU or on-premise	$100-200
Total		$1,200-2,000

6. Task-Specific Generation Methods¶

6.1 Clustering Dataset Generation¶

Challenge: Indonesia-MTEB has zero dedicated clustering datasets.

┌─────────────────────────────────────────────────────────────────────────┐
│              CLUSTERING DATASET GENERATION PIPELINE                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INDONESIAN DATA SOURCES                                                 │
│  ├─ News: detik.com, kompas.com, tempo.co, CNN Indonesia             │
│  ├─ Wikipedia Indonesia articles (id.wikipedia.org)                   │
│  ├─ Social media: Twitter/X, Instagram, TikTok                        │
│  ├─ E-commerce: Tokopedia, Shopee product descriptions               │
│  └─ Government: indonesia.go.id publications                          │
│                                                                          │
│  GENERATION METHOD                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Step 1: Document Collection                                        │    │
│  │ • Scraping from sources above (target: 50K-100K documents)       │    │
│  │ • Clean and normalize text (see Section 10)                      │    │
│  │                                                                  │    │
│  │ Step 2: LLM-based Clustering                                       │    │
│  │ • Prompt: See Section 7.2 (Indonesian clustering prompt)        │    │
│  │ • Output: Document + cluster_id + cluster_label                  │    │
│  │                                                                  │    │
│  │ Step 3: Cluster Description Generation                            │    │
│  │ • Generate semantic descriptions for each cluster               │    │
│  │ • Identify cluster themes and topics                             │    │
│  │                                                                  │    │
│  │ Step 4: Hard Negative Generation                                  │    │
│  │ • Generate documents near cluster boundaries                     │    │
│  │ • Output: Boundary case documents for evaluation                │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  VALIDATION METRICS                                                      │
│  ├─ Semantic coherence (avg intra-cluster cosine similarity)          │
│  ├─ Cluster separation (inter-cluster distance)                       │
│  ├─ Silhouette score                                                   │
│  └─ Human verification (100-200 samples per dataset)                   │
│                                                                          │
│  TARGET DATASETS (10)                                                    │
│  ├─ News clustering (politics, sports, entertainment, etc.)           │
│  ├─ Product clustering (e-commerce categories)                        │
│  ├─ Social media topic clustering                                      │
│  ├─ Wikipedia article clustering                                       │
│  ├─ Scientific document clustering                                    │
│  └─ ... (5 more specialized domains)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

6.2 Reranking Dataset Generation¶

Challenge: Indonesia-MTEB has zero reranking datasets.

Data Structure: (query, candidates, ranking)

┌─────────────────────────────────────────────────────────────────────────┐
│              RERANKING DATASET GENERATION PIPELINE                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GENERATION METHOD                                                       │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Step 1: Query Generation                                          │    │
│  │ • Sources:                                                       │    │
│  │   - Indonesian search logs (Google Trends ID)                   │    │
│  │   - FAQ websites                                                  │    │
│  │   - Yahoo Answers Indonesia (archive)                           │    │
│  │ • LLM generation: See Section 7.3 (Reranking prompt)            │    │
│  │ • Target: 5,000 diverse queries                                  │    │
│  │                                                                  │    │
│  │ Step 2: Passage Candidate Generation                              │    │
│  │ • Sources: Wikipedia Indonesia, news articles                   │    │
│  │ • For each query:                                                │    │
│  │   - 1 positive (highly relevant)                                 │    │
│  │   - 3-5 hard negatives (semantically similar but wrong)         │    │
│  │   - 5-10 random negatives                                       │    │
│  │ • Hard negative generation: See Section 8                        │    │
│  │                                                                  │    │
│  │ Step 3: Ranking Annotation                                         │    │
│  │ • LLM-as-Judge: Rank candidates by relevance                      │    │
│  │ • Output format: [pos, neg1, neg2, ...] (descending relevance)   │    │
│  │ • Human verification: 10% sample (500 queries)                   │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  DOMAIN-SPECIALIZED GENERATION                                          │
│  ├─ Legal: Indonesian law documents (UU, PP) with queries             │
│  │   Sources: JDIH, peraturan.go.id                                  │    │
│  ├─ Medical: Health articles with symptom/diagnosis queries           │
│  │   Sources: Alodokter, Halodoc articles                            │    │
│  ├─ Finance: Financial news with analysis queries                     │
│  │   Sources: Kontan, Bisnis Indonesia, CNBC Indonesia              │    │
│  └─ News: Current events with fact-based queries                       │
│      Sources: Detik, Kompas, Tempo                                    │    │
│                                                                          │
│  TARGET DATASETS (10)                                                    │
│  ├─ General web search reranking                                      │
│  ├─ Legal document reranking                                          │
│  ├─ Medical Q&A reranking                                             │
│  ├─ Financial news reranking                                          │
│  ├─ E-commerce product search                                         │
│  └─ ... (5 more specialized domains)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

6.3 STS (Semantic Textual Similarity) Generation¶

Challenge: Indonesia-MTEB has only 3 limited STS datasets (IndoSTS, translated STS-B, translated SICK-R).

┌─────────────────────────────────────────────────────────────────────────┐
│                    STS DATASET GENERATION PIPELINE                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  GENERATION APPROACHES                                                   │
│                                                                          │
│  Approach 1: Paraphrase Generation (High Similarity: 4.0-5.0/5.0)         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Input: Indonesian sentence                                     │    │
│  │ • LLM: Generate 3-5 paraphrases with high similarity            │    │
│  │ • Example:                                                      │    │
│  │   Source: "Pemerintah menaikkan harga bbm."                     │    │
│  │   Paraphrase 1: "Harga bbm dinaikkan oleh pemerintah."          │    │
│  │   Paraphrase 2: "Kenaikan bbm dilakukan pemerintah."             │    │
│  │   Paraphrase 3: "Pemerintah resmikan kenaikan harga bbm."       │    │
│  │   Similarity: 4.5-5.0                                           │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  Approach 2: Thematic Variation (Medium Similarity: 2.5-3.5/5.0)          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Input: Topic + context                                        │    │
│  │ • LLM: Generate sentences on same theme, different wording      │    │
│  │ • Example:                                                      │    │
│  │   Sentence 1: "Timnas Indonesia menang 3-0 melawan Thailand."    │    │
│  │   Sentence 2: "Pertandingan sepak bola berakhir dengan skor 3-0."│    │
│  │   Similarity: 2.8 (same event, different focus)                 │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  Approach 3: Dissimilar Generation (Low Similarity: 0.0-1.5/5.0)          │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Input: Two different topics                                   │    │
│  │ • LLM: Generate sentences on unrelated themes                   │    │
│  │ • Example:                                                      │    │
│  │   Sentence 1: "Gempa bermagnit 5.4 mengguncang Jogjakarta."      │    │
│  │   Sentence 2: "Harga emas mengalami kenaikan hari ini."         │    │
│  │   Similarity: 0.5 (completely unrelated)                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ANNOTATION METHODOLOGY                                                  │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • LLM-as-Judge: Annotate similarity scores (0-5)                 │    │
│  │ • Verification: Semantic similarity model (gte-Qwen2-7B)        │    │
│  │   - Compute cosine similarity between embeddings               │    │
│  │   - Filter: Remove pairs where similarity < 0.7 for high label │    │
│  │ • Calibration: Human annotators for 500 sample pairs            │    │
│  │   - Target: ≥85% correlation with LLM annotations              │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  TARGET DATASETS (10)                                                    │
│  ├─ News STS (headline paraphrases)                                   │
│  ├─ Social media STS (informal vs formal)                            │
│  ├─ Wikipedia STS (article similarity)                               │
│  ├─ Question STS (question paraphrase)                               │
│  ├─ Discussion STS (forum comment similarity)                        │
│  └─ ... (5 more specialized domains)                                  │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

6.4 Instruction Following Dataset Generation¶

Challenge: Indonesia-MTEB has zero instruction following datasets.

┌─────────────────────────────────────────────────────────────────────────┐
│            INSTRUCTION FOLLOWING DATASET GENERATION                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  INSTRUCTION CATEGORIES                                                  │
│  ├─ General Q&A (pengetahuan umum)                                    │
│  ├─ Summarization (ringkasan)                                         │
│  ├─ Translation (terjemahan)                                          │
│  ├─ Creative writing (menulis kreatif)                                │
│  ├─ Code generation (pembuatan kode)                                  │
│  ├─ Reasoning (penalaran)                                            │
│  ├─ Classification (klasifikasi)                                      │
│  └─ Extraction (ekstraksi informasi)                                  │
│                                                                          │
│  GENERATION PIPELINE                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Step 1: Instruction Creation                                      │    │
│  │ • Seed: 200-500 manually written Indonesian instructions         │    │
│  │ • Bootstrap: Use LLM to generate 10× more instructions            │    │
│  │ • Filter: Remove low-quality/repetitive instructions             │    │
│  │                                                                  │    │
│  │ Step 2: Response Generation                                      │    │
│  │ • For each instruction, generate response                        │    │
│  │ • Ensure response is appropriate and accurate                   │    │
│  │ • Verify: LLM-as-judge checks response quality                   │    │
│  │                                                                  │    │
│  │ Step 3: Quality Control                                           │    │
│  │ • Human verification: 500-1000 samples                           │    │
│  │ • Criteria: Relevance, accuracy, completeness, fluency          │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  INSTRUCTION EXAMPLES (Indonesian)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Q: Jelaskan perbedaan antara gotong royong dan kerja bakti.     │    │
│  │ A: Gotong royong adalah budaya saling membantu dalam pekerjaan  │    │
│  │    yang bersifat timbal balik dan sukarela, sedangkan kerja     │    │
│  │    bakti lebih fokus pada kegiatan sosial kemasyarakatan.        │    │
│  │                                                                  │    │
│  │ Q: Buatlah ringkasan dari artikel berikut dalam 3 kalimat.       │    │
│  │ [ARTICLE]                                                         │    │
│  │ A: [RINGKASAN]                                                     │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

7. Prompt Engineering with Indonesian Examples¶

7.1 Effective Prompting Strategies¶

Strategy	Description	Example for Indonesian
Topic-Based Brainstorming	Sample topics from Indonesian categories	"Generate retrieval tasks for: Olahraga/Sepak Bola/Timnas"
Few-Shot Examples	Provide Indonesian examples in prompt	3-5 Indonesian examples per task type
Structured Output	Require JSON format with Indonesian text	`{"teks": "...", "label": "..."}`
Register Specification	Specify formal/informal Indonesian	"Gunakan bahasa Indonesia baku (formal)"
Domain Specification	Specify Indonesian domain context	"Generate dokumen domain HUKUM Indonesia"

7.2 Clustering Generation Prompt (Indonesian)¶

CLUSTERING_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas clustering dokumen Bahasa Indonesia.

TUGAS:
Buat 10 cluster dari dokumen-dokumen berikut ini. Setiap cluster harus memiliki tema yang jelas.

DOKUMEN:
{documents}

OUTPUT FORMAT (JSON):
```json
{{
  "clusters": [
    {{
      "cluster_id": 0,
      "cluster_name": "[Nama cluster dalam Bahasa Indonesia]",
      "description": "[Deskripsi singkat tema cluster]",
      "documents": [0, 3, 7, ...],
      "sample_document": "[Contoh dokumen representatif]"
    }}
  ]
}}

PERSYARATAN: 1. Gunakan Bahasa Indonesia yang natural dan baku 2. Setiap cluster minimal 5 dokumen 3. Cluster harus saling eksklusif (tidak ada tumpang tindih) 4. Beri nama cluster yang spesifik dan informatif 5. Deskripsi harus menjelaskan tema cluster dengan jelas

CONTOH CLUSTER NAME: - "Berita Politik dan Pemerintahan" - "Olahraga Sepak Bola" - "Teknologi dan Gadget" - "Ekonomi dan Bisnis" """

Hard negative generation for clustering¶

CLUSTER_HARD_NEGATIVE_PROMPT = """ Buat 2 dokumen yang TIDAK termasuk dalam cluster "{cluster_name}" tetapi memiliki kata kunci yang mirip.

Cluster description: {description}

Dokumen harus terlihat mirip dengan topik cluster tetapi membahas hal yang berbeda.

Output dalam format JSON:

{{
  "hard_negatives": [
    {{
      "text": "[Isi dokumen]",
      "reason": "Alasan mengapa ini mirip tapi berbeda"
    }}
  ]
}}

"""

### 7.3 Reranking Generation Prompt (Indonesian)

```python
RERANKING_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas reranking Bahasa Indonesia.

TUGAS:
Generate pasangan (query, dokumen) dengan berbagai tingkat relevansi.

QUERY: "{query_domain}"

OUTPUT FORMAT (JSON):
```json
{{
  "query": "[Pertanyaan natural dalam Bahasa Indonesia]",
  "positive": "[Dokumen yang menjawab query dengan benar]",
  "hard_negatives": [
    {{
      "text": "[Dokumen mirip tapi tidak menjawab]",
      "reason": "Alasan mengapa ini hard negative"
    }}
  ],
  "random_negatives": [
    {{
      "text": "[Dokumen topik berbeda]",
      "reason": "Alasan mengapa ini random negative"
    }}
  ]
}}

PERSYARATAN: 1. Query harus natural seperti yang ditulis pengguna Indonesia 2. Query length: 10-30 kata 3. Positive dokumen: 100-300 kata, langsung menjawab query 4. Hard negative: 3-5 dokumen, mirip topik tapi salah jawab 5. Random negative: 5-10 dokumen, topik benar-benar berbeda

CONTOH: Query: "Apa itu gotong royong?" Positive: "Gotong royong adalah budaya gotong tolong menolong yang sudah ... Hard negative: "Kegiatan kerja bakti dilakukan oleh masyarakat untuk..." (Salah karena kerja bakti ≠ gotong royong) """

DOMAIN_SPECIFIC_PROMPTS = { "legal": """ DOMAIN: HUKUM Indonesia Sumber: UU, PP, Peraturan Pemerintah

Query harus terkait dengan:
- Penjelasan pasal undang-undang
- Perbandingan regulasi
- Implikasi hukum

Positive: Kutipan langsung dari dokumen hukum yang relevan
Hard negative: Dokumen hukum topik mirip tapi tidak menjawab
""",

"medical": """
DOMAIN: KESEHATAN

Query harus terkait dengan:
- Gejala penyakit
- Diagnosis medis
- Rekomendasi pengobatan umum

Positive: Informasi medis akurat dari sumber terpercaya
Hard negative: Penyakit dengan gejala mirip tapi berbeda
""",

"news": """
DOMAIN: BERITA Indonesia

Query harus terkait dengan:
- Fakta peristiwa berita
- Analisis berita
- Konteks peristiwa

Positive: Berita yang langsung menjawab pertanyaan
Hard negative: Berita topik mirip tapi peristiwa berbeda
"""

} ```

7.4 STS Generation Prompt (Indonesian)¶

python STS_GENERATION_PROMPT = """ Anda adalah generator dataset untuk Semantic Textual Similarity (STS) Bahasa Indonesia. TUGAS: Generate pasangan kalimat dengan berbagai tingkat kemiripan. TOPIK: {topic} OUTPUT FORMAT (JSON):json {{ "pairs": [ {{ "sentence1": "[Kalimat pertama dalam Bahasa Indonesia]", "sentence2": "[Kalimat kedua dalam Bahasa Indonesia]", "similarity": 4.5, "label": "paraphrase" }} ] }} ```

TINGKAT KEMIRIPAN: - 4.5-5.0: Parafrase hampir identik (paraphrase) - 3.5-4.4: Makna sama, redaksi berbeda (high similarity) - 2.5-3.4: Topik sama, aspek berbeda (medium similarity) - 1.5-2.4: Sedikit kemiripan (low similarity) - 0.0-1.4: Hampir tidak mirip (dissimilar)

PERSYARATAN: 1. Gunakan Bahasa Indonesia natural (baku atau gaul sesuai konteks) 2. Kalimat length: 10-30 kata 3. Hindari kata-kata pengisi yang tidak perlu 4. Pastikan skor similarity sesuai dengan tingkat kemiripan sebenarnya

CONTOH: Score 5.0: - "Pemerintah menaikkan harga bbm." - "Harga bbm dinaikkan oleh pemerintah."

Score 3.0: - "Timnas Indonesia menang 3-0 atas Thailand." - "Pertandingan sepak bola berakhir dengan skor 3-0."

Score 1.0: - "Gempa mengguncang wilayah Jogjakarta." - "Harga emas mengalami kenaikan hari ini." """

SIMILARITY_CALIBRATION_PROMPT = """ Berikan skor similarity (0-5) untuk pasangan kalimat berikut:

PASANGAN 1: Kalimat 1: "{sent1}" Kalimat 2: "{sent2}"

Pertimbangkan: 1. Makna (meaning) - apakah menyampaikan informasi yang sama? 2. Entitas (entities) - apakah subjek/objeknya sama? 3. Konteks (context) - apakah dalam konteks yang sama?

Output JSON saja:

{{
  "similarity": 0.0-5.0,
  "reason": "[Penjelasan singkat dalam Bahasa Indonesia]"
}}

"""

### 7.5 Classification Generation Prompt (Indonesian)

```python
CLASSIFICATION_GENERATION_PROMPT = """
Anda adalah generator dataset untuk tugas klasifikasi teks Bahasa Indonesia.

KONTEKS:
{task_description}

LABELS: {labels}

Generate 5 contoh untuk setiap label.

OUTPUT FORMAT (JSON):
```json
{{
  "examples": [
    {{
      "text": "[Teks Bahasa Indonesia]",
      "label": "[label]"
    }}
  ]
}}

PERSYARATAN: 1. Text length: 50-200 kata 2. Gunakan Bahasa Indonesia natural 3. Hindari bias label (setiap label harus punya ciri unik) 4. Sertakan variasi gaya penulisan (formal/informal sesuai konteks) 5. Labels harus mutually exclusive

CONTOH untuk Sentimen Analysis: Label: positif, negatif, netral

Positif: "Produk ini sangat bagus, pengiriman cepat dan kualitas terjamin!" Negatif: "Sangat kecewa, barang rusak saat sampai dan tidak bisa direturun." Netral: "Barang sudah diterima, akan dicoba nanti." """

Domain-specific classification prompts¶

DOMAIN_CLASSIFICATION_PROMPTS = { "news_category": """ TUGAS: Klasifikasi kategori berita Indonesia

LABELS:
- politik: Berita tentang pemerintahan, pemilu, kebijakan
- ekonomi: Berita bisnis, pasar, investasi
- olahraga: Berita tentang atlet, pertandingan, kompetisi
- teknologi: Berita gadget, software, startup
- entertainment: Berita selebriti, film, musik
""",

"clickbait": """
TUGAS: Klasifikasi headline clickbait

LABELS:
- clickbait: Headline yang menyesatkan/mengada-ada untuk klik
- legitimate: Headline yang jujur dan akurat

KARAKTERISTIK CLICKBAIT:
- Menggunakan kata-kata sensasional ("MENGHEBOHKAN", "TERSERA")
- Menggunakan ellipsis (...) yang menggantung
- Tidak memberikan informasi jelas
- Overstatement (melebih-lebihkan)
""",

"formality": """
TUGAS: Klasifikasi level keformalan Bahasa Indonesia

LABELS:
- formal: Bahasa baku, sesuai EYD, untuk tulisan resmi
- informal: Bahasa gaul/slang, untuk percakapan sehari-hari
- mixed: Campuran formal dan informal

KARAKTERISTIK:
- Formal: gunakan "saya", "adalah", tidak ada singkatan
- Informal: gunakan "aku", "gue", ada singkatan (yg, utk, dll)
"""

} ```

7.6 Indonesian Topic Categories (for ODP Sampling)¶

Based on Indonesian content sources, here are recommended topic categories: ┌─────────────────────────────────────────────────────────────────────────┐ │ INDONESIAN TOPIC CATEGORIES │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ NEWS (Berita) │ │ ├─ Politik & Pemerintahan │ │ │ ├─ Pemilihan Umum │ │ │ ├─ Kebijakan Pemerintah │ │ │ ├─ Partai Politik │ │ │ └─ Pemerintahan Daerah │ │ ├─ Ekonomi & Bisnis │ │ │ ├─ Pasar Saham & Investasi │ │ │ ├─ UMKM │ │ │ ├─ Startup & Teknologi Finansial │ │ │ └─ Harga & Inflasi │ │ ├─ Olahraga │ │ │ ├─ Sepak Bola (Timnas, Liga) │ │ │ ├─ Badminton │ │ │ ├─ Olahraga Elektronik │ │ │ └─ PON & Sea Games │ │ └─ Hiburan │ │ ├─ Film & Sinema Indonesia │ │ ├─ Musik & Konser │ │ └─ Selebriti Tanah Air │ │ │ │ LIFESTYLE (Gaya Hidup) │ │ ├─ Kuliner │ │ │ ├─ Resep Masakan Indonesia │ │ │ ├─ Street Food (Nasi Goreng, Sate, Bakso) │ │ │ └─ Review Restoran │ │ ├─ Wisata │ │ │ ├─ Bali & Lombok │ │ │ ├─ Yogyakarta & Borobudur │ │ │ ├─ Raja Ampat & Bunaken │ │ │ └─ Wisata Kuliner │ │ └─ Fashion │ │ ├─ Batik & Tenun │ │ ├─ Muslim Fashion │ │ └─ Local Brands │ │ │ │ TECHNOLOGY (Teknologi) │ │ ├─ Smartphones & Gadgets │ │ ├─ Aplikasi Indonesia (Gojek, Traveloka, dll) │ │ ├─ Startup │ │ └─ Gaming │ │ │ │ CULTURE (Budaya) │ │ ├─ Gotong Royong & Nilai Kebangsaan │ │ ├─ Batik, Wayang, Tradisi │ │ ├─ Hari Raya (Idul Fitri, Natal, Imlek, Nyepi) │ │ └─ Bahasa Daerah │ │ │ │ SOCIETY (Masyarakat) │ │ ├─ Pendidikan │ │ ├─ Kesehatan │ │ ├─ Transportasi (MRT, LRT, Tol) │ │ └─ Infrastruktur │ │ │ └─────────────────────────────────────────────────────────────────────────┘

8. Hard Negative Generation¶

8.1 Hard Negative Strategies¶

┌─────────────────────────────────────────────────────────────────────────┐ │ HARD NEGATIVE GENERATION STRATEGIES │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ STRATEGY 1: KEYWORD OVERLAP (Mirip tapi Salah) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Query: "Kapan kemerdekaan Indonesia diperingati?" │ │ │ │ Positive: "Proklamasi kemerdekaan Indonesia dibaca pada ..." │ │ │ │ Hard Negative: "Peringatan kemerdekaan negara lain ..." │ │ │ │ Reason: Kata "kemerdekaan" muncul tapi konteks berbeda │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ STRATEGY 2: ENTITY SUBSTITUTION (Entitas Salah) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Query: "Siapa presiden pertama Indonesia?" │ │ │ │ Positive: "Ir. Soekarno adalah presiden pertama RI..." │ │ │ │ Hard Negative: "Ir. Hatta adalah wakil presiden pertama..." │ │ │ │ Reason: Entitas tokoh mirip tapi jawaban salah │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ STRATEGY 3: TOPIC DRIFT (Topik Mirip, Beda Aspek) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Query: "Manfaat minum air putih bagi kesehatan" │ │ │ │ Positive: "Minum air putih membantu hidrasi tubuh..." │ │ │ │ Hard Negative: "Sumber air bersih semakin langka..." │ │ │ │ Reason: Topik sama (air) tapi beda aspek │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ STRATEGY 4: TEMPORAL MISMATCH (Waktu Salah) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Query: "Hasil Piala AFF 2024" │ │ │ │ Positive: "Timnas Indonesia juara AFF 2024..." │ │ │ │ Hard Negative: "Timnas Indonesia juara AFF 2022..." │ │ │ │ Reason: Entitas dan topik sama tapi tahun berbeda │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ STRATEGY 5: NUMERICAL DIFFERENCE (Angka Beda) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Query: "Berapa provinsi di Indonesia?" │ │ │ │ Positive: "Indonesia memiliki 38 provinsi..." │ │ │ │ Hard Negative: "DPR memiliki 560 anggota..." │ │ │ │ Reason: Ada angka tapi menjawab pertanyaan berbeda │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘

8.2 Hard Negative Generation Prompt¶

python HARD_NEGATIVE_GENERATION_PROMPT = """ Anda adalah generator hard negative untuk tugas retrieval Bahasa Indonesia. TUGAS: Generate 3-5 hard negatives untuk query berikut. QUERY: "{query}" POSITIVE DOCUMENT: "{positive}" HARD NEGATIVE ADALAH: Dokumen yang: 1. Mengandung kata kunci mirip dengan query atau positive 2. Topiknya terkait tapi TIDAK menjawab query dengan benar 3. Menipiskan model retrieval (mirip secara semantik tapi salah) STRATEGIES: - Ganti entitas penting (nama, tempat, angka) - Ubah konteks waktu (tahun, periode) - Bedakan aspek dari topik yang sama - Gunakan kata kunci mirip tapi arti berbeda OUTPUT FORMAT (JSON):json {{ "hard_negatives": [ {{ "text": "[Dokumen hard negative dalam Bahasa Indonesia]", "strategy": "[nama strategy yang digunakan]", "reason": "[Alasan mengapa ini hard negative]" }} ] }} ```

CONTOH: Query: "Kapan proklamasi kemerdekaan Indonesia?" Positive: "Proklamasi kemerdekaan Indonesia dibacakan oleh Ir. Soekarno pada tanggal 17 Agustus 1945..."

Hard Negative 1 (Entity substitution): "Proklamasi kemerdekaan direncanakan oleh BPUPKI pada tanggal 1 Juni 1945..." Reason: Ada "proklamasi" dan "kemerdekaan" tapi tanggal bukan 17 Agustus

Hard Negative 2 (Topic drift): "Peringatan kemerdekaan Indonesia diperingati setiap tanggal 17 Agustus..." Reason: Topik sama (kemerdekaan) tapi bukan menjawab "kapan" (tanggal proklamasi)

Hard Negative 3 (Related entity): "Mohammad Hatta adalah proklamator bersama Ir. Soekarno..." Reason: Menyebut tokoh terkait tapi tidak menjawab pertanyaan tanggal """

Domain-specific hard negative generation¶

DOMAIN_HARD_NEGATIVE_PROMPTS = { "legal": """ STRATEGI UNTUK DOMAIN HUKUM:

Query: "Apa pasal pembunuuh dalam KUHP?"
Positive: "Pasal 338 KUHP mengatur tentang pembunuhan..."

Hard Negative Ideas:
- Pasal terkait tapi bukan pembunuhan (mis: penganiayaan)
- Pasal pembunuhan di undang-undang lain
- Penjelasan pasal tapi tanpa isi pasalnya
""",

"medical": """
STRATEGI UNTUK DOMAIN KESEHATAN:

Query: "Apa gejala demam berdarah?"
Positive: "Gejala demam berdarah meliputi demam tinggi, nyeri sendi..."

Hard Negative Ideas:
- Penyakit dengan gejala mirip (demam tifoid, malaria)
- Komplikasi demam berdarah
- Pengobatan demam berdarah (bukan gejala)
""",

"news": """
STRATEGI UNTUK DOMAIN BERITA:

Query: "Hasil pertandingan Indonesia vs Thailand tadi malam"
Positive: "Timnas Indonesia menang 3-0 atas Thailand dalam..."

Hard Negative Ideas:
- Pertandingan Indonesia vs Thailand di turnamen berbeda
- Klasemen grup (bukan hasil pertandingan)
- Preview sebelum pertandingan (bukan hasil)
"""

} ```

8.3 Hard Negative Evaluation¶

python HARD_NEGATIVE_EVALUATION_PROMPT = """ Evaluasi apakah dokumen berikut merupakan hard negative yang baik. QUERY: "{query}" POSITIVE: "{positive}" CANDIDATE: "{candidate}" Jawab dengan YA jika candidate adalah hard negative yang baik, TIDAK jika bukan. Hard negative yang baik: - Secara semantik mirip dengan positive - Mengandung kata kunci dari query - TIDAK menjawab query dengan benar - Akan menipiskan model retrieval Output JSON saja:json {{ "is_hard_negative": true/false, "score": 0-10 (10 = sangat baik), "reason": "[Penjelasan dalam Bahasa Indonesia]" }} """

9. Quality Validation Pipeline¶

9.1 Multi-Stage Validation Framework¶

┌─────────────────────────────────────────────────────────────────────────┐ │ QUALITY VALIDATION PIPELINE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ STAGE 1: LANGUAGE DETECTION │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Tool: fastText, langdetect, or polyglot │ │ │ │ • Threshold: Indonesian confidence ≥ 0.8 │ │ │ │ • Reject: Non-Indonesian or code-mixed without ID │ │ │ │ • Typical keep rate: 95-98% │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ STAGE 2: DEDUPLICATION │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Method: MinHash with LSH (Locality Sensitive Hashing) │ │ │ │ • Threshold: Jaccard similarity < 0.85 │ │ │ │ • N-gram size: 3-5 for Indonesian │ │ │ │ • Reject: Near-duplicates │ │ │ │ • Typical keep rate: 90-95% │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ STAGE 3: SEMANTIC SIMILARITY FILTERING │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Model: gte-Qwen2-7B-instruct or SEA-LION-v4-embeddings │ │ │ │ • For retrieval: cosine similarity with positive │ │ │ │ - Hard negatives: 0.5-0.8 similarity (not too low/high) │ │ │ │ - Random negatives: < 0.3 similarity │ │ │ │ • For STS: Verify LLM score with embedding similarity │ │ │ │ - Flag pairs with large discrepancy (>1.5 points) │ │ │ │ • Typical keep rate: 70-85% │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ STAGE 4: LLM-AS-JUDGE VALIDATION │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Model: GPT-4o or Claude 3.5 Sonnet (for quality) │ │ │ │ • Prompts: See Section 11 │ │ │ │ • Criteria: Grammar, Fluency, Meaning Preservation, NER │ │ │ │ • Threshold: ≥ 3.5/5.0 overall to PASS │ │ │ │ • Typical keep rate: 75-90% │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ ↓ │ │ STAGE 5: HUMAN SPOT-CHECK │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Sample: 10% of passed data, minimum 100 per dataset │ │ │ │ • Annotators: Native Indonesian speakers │ │ │ │ • Criteria: Same as LLM-as-judge + cultural appropriateness │ │ │ │ • Disagreement: Prompt re-validation │ │ │ │ • Typical keep rate: 95-99% │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ OVERALL KEEP RATE: 40-60% (from generated to final dataset) │ │ │ └─────────────────────────────────────────────────────────────────────────┘

9.2 Quality Metrics by Task Type¶

Task	Language ID	Deduplication	Semantic Filter	LLM-Judge	Human
Classification	98%	95%	N/A	90%	99%
Clustering	98%	90%	85% (intra-cluster)	85%	98%
Reranking	98%	95%	80% (pos/neg check)	85%	97%
STS	98%	90%	75% (score verify)	80%	95%
Retrieval	98%	95%	80% (relevance)	85%	97%
Instruction	97%	92%	75% (instruction check)	80%	95%
Overall	98%	92%	80%	84%	97%

---

## 10. Indonesian Text Normalization

### 10.1 Preprocessing Pipeline

```python
import re

class IndonesianTextNormalizer:
"""
Normalizer for Indonesian text including slang, abbreviations,
and code-mixing handling.
"""

def init(self):
# Kamus Alay (Indonesian Slang Dictionary) - Sample entries
self.slang_dict = {
"yg": "yang",
"utk": "untuk",
"dgn": "dengan",
"tdk": "tidak",
"jg": "juga",
"sdh": "sudah",
"blm": "belum",
"krn": "karena",
"pd": "pada",
"dpt": "dapat",
"sy": "saya",
"km": "kamu",
"dr": "dari",
"kek": "kayak",
"gitu": "begitu",
"sih": "",
"deh": "",
"dong": "",
"lho": "",
"kok": "",
"lh": "lah",
"cpt": "cepat",
"bgt": "banget",
"bsk": "besok",
"mlm": "malam",
"pgi": "pagi",
"sii": "si",
"yaa": "ya",
"ka": "ke",
"diya": "dia",
"nya": "-nya",
# Add more from comprehensive Kamus Alay
}

# Indonesian abbreviations
self.abbrev_dict = {
"ttd": "tertanda",
"dlm": "dalam",
"ths": "tahun",
"bln": "bulan",
"hri": "hari",
"jln": "jalan",
"no": "nomor",
"tk": "toko",
"pt": "perseroan",
"cv": "curriculum vitae",
"dll": "dan lain-lain",
"dsb": "dan sebagainya",
"dll": "dan lain-lain",
"ybs": "yang bersangkutan",
"ap": "asisten",
"dr": "dokter",
"ir": "insinyur",
"drg": "dokter gigi",
# Add more...
}

# Emoji/emoticon to text mapping
self.emoji_dict = {
":)": "senyum",
":D": "tersenyum",
":(": "sedih",
":'(": "menangis",
"<3": "cinta",
":)": "senyum",
# Add more...
}

def normalize(self, text: str) -> str:
"""Full normalization pipeline."""
text = self._normalize_whitespace(text)
text = self._expand_abbreviations(text)
text = self._normalize_slang(text)
text = self._handle_emoji(text)
text = self._normalize_repetition(text)
text = self._remove_special_chars(text)
return text.strip()

def _normalize_whitespace(self, text: str) -> str:
"""Normalize whitespace characters."""
return re.sub(r'\s+', ' ', text)

def _expand_abbreviations(self, text: str) -> str:
"""Expand common Indonesian abbreviations."""
for abbr, full in self.abbrev_dict.items():
text = re.sub(r'\b' + abbr + r'\b', full, text)
return text

def _normalize_slang(self, text: str) -> str:
"""Normalize Indonesian slang (Bahasa Alay)."""
words = text.split()
normalized = []
for word in words:
lower_word = word.lower()
if lower_word in self.slang_dict:
replacement = self.slang_dict[lower_word]
if replacement: # Skip empty replacements
normalized.append(replacement)
else:
normalized.append(word)
return ' '.join(normalized)

def _handle_emoji(self, text: str) -> str:
"""Convert emoji to text descriptions."""
for emoji, meaning in self.emoji_dict.items():
text = text.replace(emoji, f" {meaning} ")
return text

def _normalize_repetition(self, text: str) -> str:
"""Normalize repeated characters (e.g., 'sangaaat' -> 'sangat')."""
text = re.sub(r'(.)\1{2,}', r'\1\1', text)
return text

def _remove_special_chars(self, text: str) -> str:
"""Remove unnecessary special characters while keeping Indonesian ones."""
# Keep Indonesian characters, numbers, and basic punctuation
text = re.sub(r'[^\w\s\u0020-\u007E\u00A0-\u00FF]', '', text)
return text

def detect_formality(self, text: str) -> str:
"""
Detect if text is formal (baku) or informal (gaul).
Returns: 'formal', 'informal', or 'mixed'
"""
informal_indicators = [
'yg', 'utk', 'tdk', 'jg', 'sy', 'km',
'gue', 'lu', 'lo', 'ga', 'nggak',
'sih', 'deh', 'dong', 'lho', 'kok',
'bang', 'non', 'bos', 'kak'
]

formal_indicators = [
'yang', 'untuk', 'tidak', 'saya', 'kamu',
'adalah', 'merupakan', 'yaitu', 'tersebut',
'dalam', 'pada', 'oleh', 'dengan'
]

words = text.lower().split()
informal_count = sum(1 for w in words if w in informal_indicators)
formal_count = sum(1 for w in words if w in formal_indicators)

if informal_count == 0 and formal_count > 0:
return 'formal'
elif informal_count > 0 and formal_count == 0:
return 'informal'
elif informal_count > formal_count:
return 'informal'
elif formal_count > informal_count:
return 'formal'
else:
return 'mixed'

# Usage
normalizer = IndonesianTextNormalizer()

text_gaul = "Gw lagi di jalan nih, macet parah bang"
normalized = normalizer.normalize(text_gaul)
# Output: "Saya lagi di jalan ini macet parah"

formality = normalizer.detect_formality(text_gaul)
# Output: "informal"
```

10.2 Code-Mixed Text Handling¶

Indonesian text often contains code-mixing (Indonglish):

def detect_code_mixing(text: str) -> dict:
    """
    Detect English-Indonesian code-mixing in text.

    Returns:
        dict: Contains ratio, words_by_language, mixed_segments
    """
    # Simple word-level language detection
    # In production, use trained model (IndoJavE, IndoRobusta)

    english_words = set([
        'the', 'of', 'and', 'to', 'in', 'is', 'you', 'that', 'it', 'he',
        'was', 'for', 'on', 'are', 'as', 'with', 'his', 'they', 'at',
        'be', 'this', 'have', 'from', 'or', 'one', 'had', 'by', 'word'
    ])

    words = text.split()
    id_words = []
    en_words = []
    mixed_segments = []

    current_lang = None
    current_segment = []

    for word in words:
        word_lower = word.lower().strip('.,!?;:')

        if word_lower in english_words or any(c.isalpha() and ord(c) < 128 for c in word):
            lang = 'en'
        else:
            lang = 'id'

        if lang != current_lang:
            if current_segment:
                mixed_segments.append(' '.join(current_segment))
            current_segment = [word]
            current_lang = lang
        else:
            current_segment.append(word)

        if lang == 'id':
            id_words.append(word)
        else:
            en_words.append(word)

    if current_segment:
        mixed_segments.append(' '.join(current_segment))

    return {
        'id_ratio': len(id_words) / len(words) if words else 0,
        'en_ratio': len(en_words) / len(words) if words else 0,
        'is_code_mixed': 0.2 < len(en_words) / len(words) < 0.8 if words else False,
        'mixed_segments': mixed_segments
    }

# Example
text_mixed = "Meeting hari ini sangat productive, kita achieved semua goals yang disepakati."
result = detect_code_mixing(text_mixed)
# is_code_mixed: True
# mixed_segments: ['Meeting hari ini sangat productive', 'kita achieved', 'semua goals yang disepakati']

10.3 Normalization for Generation vs Evaluation¶

Purpose	Normalization Level	Rationale
Training data generation	Light (preserve register)	Maintain natural Indonesian
Embedding training	Medium (standardize)	Reduce noise, improve quality
Evaluation	Light (preserve original)	Real-world performance
Clustering	Heavy (normalize all)	Group similar documents

11. LLM-as-a-Judge Validation¶

11.1 Validation Framework¶

Based on VN-MTEB and TR-MTEB methodologies:

┌─────────────────────────────────────────────────────────────────────────┐
│                 LLM-AS-A-JUDGE VALIDATION PIPELINE                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  CALIBRATION PHASE (Required for reliable validation)                   │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ • Human annotation: 100-500 samples                              │    │
│  │ • Prompt iteration: Align LLM judgments with humans             │    │
│  │ • Target: ≥85% agreement, ≥90% precision                         │    │
│  │ • TR-MTEB achieved: 85.2% agreement, 92.9% precision            │    │
│  │ • Iterate until calibration targets met                          │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  VALIDATION CRITERIA (VN-MTEB 5-criteria adapted for Indonesian)       │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ 1. Grammar (Tata Bahasa):                                        │    │
│  │    - Correct Indonesian grammar and syntax                       │    │
│  │    - Proper verb conjugation                                    │    │
│  │    - Correct affixation (me-, ber-, -kan, etc.)                 │    │
│  │                                                                  │    │
│  │ 2. NER (Named Entity Preservation):                             │    │
│  │    - Indonesian names preserved (Siti, Budi, Joko)               │    │
│  │    - Place names preserved (Jakarta, Jogja, Surabaya)           │    │
│  │    - Cultural terms preserved (gotong royong, adat)             │    │
│  │                                                                  │    │
│  │ 3. Numbers/Links (Angka dan Tautan):                            │    │
│  │    - Numbers preserved correctly (17 Agustus 1945)              │    │
│  │    - Dates preserved (tgl, thn, bulan)                          │    │
│  │    - URLs and links preserved                                  │    │
│  │                                                                  │    │
│  │ 4. Fluency (Kefasan Bahasa):                                    │    │
│  │    - Natural, native-like phrasing                             │    │
│  │    - Appropriate register (formal/informal)                     │    │
│  │    - No awkward calques from English                            │    │
│  │                                                                  │    │
│  │ 5. Meaning Preservation (Pelestarian Makna):                     │    │
│  │    - Semantic equivalence maintained                            │    │
│  │    - No information loss                                        │    │
│  │    - No information gain (hallucination)                        │    │
│  │                                                                  │    │
│  │ Scoring: 1-5 scale per criterion, weighted average              │    │
│  │ Threshold: ≥ 3.5/5.0 overall to PASS                             │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                              ↓                                           │
│  CHAIN-OF-THOUGHT PROMPTING                                              │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ "Evaluasi teks Bahasa Indonesia berikut:                       │    │
│  │                                                                  │    │
│  │  [GENERATED TEXT]                                                │    │
│  │                                                                  │    │
│  │  Original: [SOURCE TEXT]                                       │    │
│  │                                                                  │    │
│  │  Evaluasi langkah demi langkah:                                │    │
│  │  1. Periksa kebenaran tata bahasa Indonesia                    │    │
│  │  2. Verifikasi named entity tetap terjaga                      │    │
│  │  3. Nilai kefasan dan kealamiban bahasa                        │    │
│  │  4. Bandingkan makna dengan teks asli                          │    │
│  │                                                                  │    │
│  │  Output JSON:                                                   │    │
│  │  {                                                              │    │
│  │    'grammar': 1-5,                                              │    │
│  │    'ner': 1-5,                                                 │    │
│  │    'numbers': 1-5,                                             │    │
│  │    'fluency': 1-5,                                             │    │
│  │    'meaning': 1-5,                                             │    │
│  │    'overall': 1-5,                                             │    │
│  │    'pass': true/false,                                         │    │
│  │    'reason': '[Penjelasan singkat dalam ID]'                   │    │
│  │  }"                                                            │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

11.2 Calibration Results (TR-MTEB)¶

Metric	Score	Target
Agreement	85.2%	≥85%
Precision	92.9%	≥90%
Recall	84.4%	≥80%
F1 Score	88.4%	≥85%

11.3 LLM-as-Judge Prompts for Indonesian¶

LLM_AS_JUDGE_PROMPTS = {
    "classification": """
Evaluasi contoh data klasifikasi Bahasa Indonesia berikut:

Teks: "{text}" Label: "{label}"

Kriteria evaluasi:
1. Keakurasan label: Apakah label sesuai dengan isi teks?
2. Kejelasan teks: Apakah teks jelas dan mudah dipahami?
3. Kecukupan informasi: Apakah teks memiliki cukup informasi untuk klasifikasi?

Output JSON:
{{
  "label_accuracy": 1-5,
  "text_clarity": 1-5,
  "information_sufficiency": 1-5,
  "overall": 1-5,
  "pass": true/false,
  "reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",

    "retrieval": """
Evaluasi pasangan query-dokumen Bahasa Indonesia berikut:

Query: "{query}"
Document: "{document}"
Label: {label} (positive/negative)

Kriteria evaluasi:
1. Relevansi: Apakah dokumen relevan dengan query?
2. Kelengkapan: Apakah dokumen cukup menjawab query?
3. Akurasi: Apakah informasi dalam dokumen akurat?

Jika label adalah "positive", dokumen HARUS:
- Langsung menjawab query
- Memberikan informasi yang dibutuhkan
- Tidak menyesatkan atau menipu

Jika label adalah "negative", dokumen seharusnya:
- Tidak menjawab query
- Topik berbeda atau informasi kurang relevan

Output JSON:
{{
  "relevance": 1-5,
  "completeness": 1-5,
  "accuracy": 1-5,
  "overall": 1-5,
  "pass": true/false,
  "reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",

    "sts": """
Evaluasi skor similarity untuk pasangan kalimat Bahasa Indonesia:

Kalimat 1: "{sent1}"
Kalimat 2: "{sent2}"
LLM Score: {llm_score}

Evaluasi apakah skor LLM sesuai dengan kemiripan sebenarnya.

Pertimbangkan:
1. Makna (meaning): Apakah menyampaikan informasi serupa?
2. Konteks (context): Apakah dalam konteks yang sama?
3. Entitas (entities): Apakah membahas entitas yang sama?

Output JSON:
{{
  "estimated_similarity": 0-5,
  "llm_score_correct": true/false,
  "adjustment": -2 to +2 (jika perlu),
  "reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
""",

    "instruction_following": """
Evaluasi pasangan instruksi-respons Bahasa Indonesia:

Instruction: "{instruction}"
Response: "{response}"

Kriteria evaluasi:
1. Kepatuhan: Apakah respons mengikuti instruksi?
2. Kelengkapan: Apakah respons lengkap sesuai permintaan?
3. Akurasi: Apakah informasi dalam respons akurat?
4. Kejelasan: Apakah respons jelas dan mudah dipahami?

Output JSON:
{{
  "instruction_following": 1-5,
  "completeness": 1-5,
  "accuracy": 1-5,
  "clarity": 1-5,
  "overall": 1-5,
  "pass": true/false,
  "reason": "[Penjelasan dalam Bahasa Indonesia]"
}}
"""
}

11.4 Recommended Judge Models for Indonesian¶

Model	Parameters	Recommendation	Cost	Best For
Claude 3.5 Sonnet	-	★★★★★ Best	$3/$15 per 1M	Complex evaluation
GPT-4o	-	★★★★★ Excellent	$5/$15 per 1M	Quality critical
Command R+	104B	★★★★☆ Very Good	$1/$2 per 1M	Cost-efficient
Aya-23-35B	35B	★★★★☆ Good	Self-hosted	Indonesian-specialized
SEA-LION-v4	8B	★★★☆☆ Fair	Self-hosted	Budget option
Qwen2.5-7B	7B	★★★☆☆ Fair	Self-hosted	Local evaluation

Recommendation: Use Claude 3.5 Sonnet for calibration and final validation, Command R+ for large-scale filtering.

12. Indonesian-Specific Considerations¶

12.1 Linguistic Challenges¶

Challenge	Description	Example	Mitigation
Formal vs Informal Register	Indonesian has formal (baku) and informal (gaul) variants	"Saya tidak setuju" vs "Gue nggak setuju"	Explicit register specification in prompts
Code-Mixing	English-Indonesian mixing common in urban areas	"Meeting ini very productive banget"	Include code-mixed examples or filter out
Reduplication	Common grammatical feature	"kata-kata", "orang-orang"	Ensure natural patterns in generation
Affixation	Complex prefix/suffix system	"me-lestar-i-kan", "ber-karya-a"	NLP-aware prompting
Regional Influence	Javanese/Sundanese influence	"Wis mbok" (Sundanese-influenced Javanese)	Specify standard Indonesian or include variations
Abbr. Informal	Common abbreviations in informal text	"yg", "utk", "tdk"	Normalize or preserve based on use case

12.2 Cultural Considerations¶

Aspect	Consideration	Implementation
Local Context	Indonesian cultural references	Use Indonesian topics in generation
Religious Sensitivity	Muslim-majority country	Respectful content guidelines, avoid sensitive topics
Geographic Diversity	700+ ethnic groups across islands	Include topics from Sumatra, Java, Kalimantan, Sulawesi, Papua, etc.
Current Events	Local news and trends important	Include timely topics in training data
Cultural Concepts	Unique Indonesian concepts	Preserve terms like "gotong royong", "adat", "pancasila"

12.3 Domain-Specific Indonesian Corpora¶

Domain	Sources	Size/Availability	Use Case
News	detik.com, kompas.com, tempo.co, CNN Indonesia	High (web scraping)	Clustering, STS, Classification
E-commerce	Tokopedia, Shopee, Bukalapak	Medium (datasets exist)	Retrieval, Classification
Legal	JDIH, peraturan.go.id	Medium (official)	Reranking (legal domain)
Medical	Alodokter, Halodoc articles	Medium (public)	Reranking (medical domain)
Government	indonesia.go.id	Medium (official)	Classification
Social Media	Twitter/X, Instagram	High (API access)	Informal register, code-mixing
Encyclopedia	Wikipedia Indonesia	High (dump available)	General knowledge, STS
Literature	Indonesian short stories, poems	Medium (public domain)	STS, summarization

12.4 Existing Indonesian Datasets¶

Text Classification¶

IndoNLU: 12 tasks including sentiment, aspect, NER
CLICK-ID: 15,000 clickbait headlines from 12 publishers
Indonesian Hoax News: 600 documents (372 valid, 228 fake)

Natural Language Inference¶

IndoNLI: 18K sentence pairs (entailment, contradiction, neutral)
SNLI Indo: Translated SNLI dataset for Indonesian

Semantic Textual Similarity¶

IndoSTS: Translated STS-B for Indonesian
SICK-R Indo: Translated SICK-R dataset

Question Answering¶

TyDi QA: Indonesian subset of TyDi QA
XQuAD: Indonesian subset (from Wikipedia)

Summarization¶

IndoSum: ~19K news article-summary pairs

Parallel / Regional Languages¶

NusaX: 10 Indonesian local languages, parallel with Indonesian + English
SEACrowd: 36 SEA indigenous languages

13. Failure Mode Analysis¶

13.1 Common LLM Generation Errors for Indonesian¶

┌─────────────────────────────────────────────────────────────────────────┐
│              COMMON GENERATION ERRORS & MITIGATION                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ERROR TYPE 1: OVER-FORMALIZATION                                      │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM tends to generate overly formal Indonesian       │    │
│  │ Example Input: "Gue lagi lapar nih"                              │    │
│  │ Generated: "Saya merasa lapar saat ini"                          │    │
│  │ Impact: Loss of register diversity                                │    │
│  │ Mitigation: Specify register in prompt, add few-shot examples   │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ERROR TYPE 2: CODE-MIXING REMOVAL                                    │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM removes English words from code-mixed text     │    │
│  │ Example Input: "Meeting ini productive banget"                   │    │
│  │ Generated: "Pertemuan ini sangat produktif"                      │    │
│  │ Impact: Loss of authentic Indonesian social media patterns      │    │
│  │ Mitigation: Explicitly preserve English words in prompts        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ERROR TYPE 3: REDUPLICATION LOSS                                     │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM simplifies reduplicated words                  │    │
│  │ Example Input: "Orang-orang itu sedang berdiskusi"            │    │
│  │ Generated: "Orang itu sedang berdiskusi"                       │    │
│  │ Impact: Loss of grammatical nuance                               │    │
│  │ Mitigation: Few-shot examples with reduplication                │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ERROR TYPE 4: CULTURAL TERM ERASURE                                  │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM translates/removes Indonesian cultural terms   │    │
│  │ Example Input: "Gotong royong adalah budaya Indonesia"           │    │
│  │ Generated: "Kerja sama adalah budaya Indonesia"                 │    │
│  │ Impact: Loss of cultural specificity                             │    │
│  │ Mitigation: Add cultural terms to protected entities list      │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ERROR TYPE 5: HALLUCINATED REGIONAL VARIANTS                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM generates fake regional language words         │    │
│  │ Example: Nonexistent Javanese or Sundanese vocabulary           │    │
│  │ Impact: Low-quality training data                              │    │
│  │ Mitigation: Validate against NusaX dataset or native speakers  │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
│  ERROR TYPE 6: INCONSISTENT ABBREVIATION USAGE                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Description: LLM misuses informal abbreviations                  │    │
│  │ Example: "Yg utk dilaksanakan secepatnya"                      │    │
│  │ Issue: Mixed formal structure with informal abbreviations        │    │
│  │ Impact: Unnatural text                                         │    │
│  │ Mitigation: Register consistency checks in validation         │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

13.2 Model-Specific Failure Patterns¶

Model	Common Issues	Mitigation
GPT-4o	Over-formalization, cultural term erasure	Explicit cultural context in prompts
Claude 3.5	Good with culture, sometimes overly literal	Few-shot examples with nuance
Command R+	Generally good, occasional code-mixing issues	Specify code-mixing handling
SEA-LION-v4	Good Indonesian, struggles with slang	Use formal prompts for slang generation
SahabatAI	Best for informal, sometimes misses formal	Register specification required
Qwen2.5	Good multilingual, less Indonesia-specific	Add Indonesia context

14. Implementation Roadmap¶

14.1 Generation Strategy by Task¶

Task	Target Count	Generation Method	Primary Validation
Clustering	10 datasets (50K docs)	Document clustering + LLM labeling	Intra-cluster similarity
Reranking	10 datasets (5K queries)	Query + candidates (hard negatives)	LLM-as-judge ranking
STS	10 datasets (15K pairs)	Paraphrase + thematic variation	Semantic similarity model
Classification	5 datasets (25K samples)	Topic-based text generation	Label accuracy check
Pair Classification	5 datasets (20K pairs)	NLI generation (IndoNLI style)	Logical consistency
Retrieval	5 datasets (10K pairs)	Query-document (filling gaps)	Relevance scoring
Summarization	2 datasets (5K pairs)	Article + summary generation	ROUGE + LLM-as-judge
Instruction Following	5 datasets (50K pairs)	Instruction-response generation	Instruction adherence

14.2 Resource Estimation¶

Phase	Activity	Duration	Cost
1. Preparation	Data collection, prompt design	1 week	$100-200
2. Seed Generation	5-10K samples via GPT-4o/Claude	3-5 days	$30-50
3. Large-Scale Generation	50-100K samples via Command R+	1-2 weeks	$50-100
4. Validation	LLM-as-judge + semantic filtering	1 week	$30-50
5. Human Review	500-1000 samples annotation	1-2 weeks	$1,000-1,500
6. Integration	Format conversion, metadata	3-5 days	$50-100
Total		4-6 weeks	$1,260-2,000

14.3 Quality Targets¶

Metric	Target	Rationale
LLM-as-Judge Pass Rate	≥80%	Slightly higher than VN-MTEB baseline
Semantic Similarity (retrieval)	≥0.75 for positive	Standard threshold
Semantic Similarity (hard negative)	0.5-0.8	Not too high, not too low
Human Agreement	≥85%	TR-MTEB calibration target
Deduplication Rate	<5% after filtering	MinHash-based filtering
Format Compliance	100%	MTEB schema requirement
Indonesian Language ID	≥95%	Language detection confidence

14.4 Timeline¶

Week 1-2: Preparation & Seed Data
├─ Collect Indonesian corpora
├─ Design prompts for each task type
├─ Generate seed data (5K samples)
└─ Set up validation pipeline

Week 3-4: Large-Scale Generation
├─ Generate 50-100K samples per task
├─ Real-time quality monitoring
├─ Adjust prompts based on quality metrics
└─ Filter and deduplicate

Week 5: Validation & Human Review
├─ LLM-as-judge validation
├─ Human annotation (500-1000 samples)
├─ Calibrate LLM-as-judge
└─ Final filtering

Week 6: Integration & Documentation
├─ Format conversion to MTEB schema
├─ Metadata documentation
├─ HuggingFace upload
└─ Baseline model evaluation

15. Case Studies from Regional MTEBs¶

15.1 VN-MTEB (Vietnamese)¶

Methodology: Translation-first approach

Translated 41 datasets from English using translation pipeline
LLM-as-judge validation with 5 criteria
65-72% kept ratio after validation
Focus on quality over quantity

Key Insights: - Translation requires careful post-processing - Cultural adaptation needed for idioms - LLM-as-judge calibration essential

15.2 TR-MTEB (Turkish)¶

Methodology: Hybrid synthetic + human data

34.2M training pairs generated
11 new datasets created
85.2% human agreement achieved
6 core tasks covered

Key Insights: - Self-instruct effective for Turkish - Domain-specific datasets (legal, medical) valuable - Calibration critical for LLM-as-judge

15.3 AfriMTEB (African Languages)¶

Methodology: Multicultural synthetic data

59 languages, 14 tasks, 38 datasets
6 new synthetic datasets created
Cultural context preservation critical
Focus on low-resource languages

Key Insights: - Cultural knowledge important for generation - Native speaker validation essential - Regional variations need attention

15.4 SEA-BED (Southeast Asia)¶

Methodology: Regional collaboration

169 datasets across 10 SEA languages
71% human-labeled
Multilingual approach
Focus on SEA-specific tasks

Key Insights: - Regional collaboration improves quality - Shared resources reduce cost - Cultural context across borders similar

15.5 ArabicMTEB (Arabic)¶

Methodology: Domain-specific synthetic data

Command R+ for generation
40% synthetic data in training
Dialectal variation (Egyptian, Moroccan)
+16 points performance gain

Key Insights: - Dialectal generation requires specific prompts - Domain-specific data valuable - Hard negative mining essential

16. Key Takeaways¶

16.1 Methodology Recommendations¶

Priority	Recommendation	Source
1	Use Command R+ or Command-light for generation	Cost-effective, quality output
2	Implement SPEED framework for scale	10× cost reduction vs GPT-4
3	LLM-as-judge with calibration	TR-MTEB: 88.4% F1
4	Topic-based generation from Indonesian categories	SPEED finding
5	Domain-specific datasets (legal, medical, finance)	ArabicMTEB approach
6	Hard negative mining for retrieval/reranking	Core to embedding quality
7	Indonesian-specialized models (SEA-LION, SahabatAI)	Better Indonesian understanding
8	Register specification in prompts	Avoid over-formalization
9	Cultural term preservation	Maintain authenticity
10	Multi-stage validation (5 stages)	Quality assurance

16.2 Critical Success Factors¶

Calibration: Always calibrate LLM-as-judge with human labels (100-500 samples)
Diversity: Use topic-based prompts to avoid mode collapse
Validation: Multi-stage quality control (language → dedup → semantic → LLM judge → human)
Indonesian Context: Localized prompts, cultural awareness, register specification
Iterative Refinement: Start small, validate, then scale
Cost Management: Use efficient models (Command-light, SPEED-aligned) for large scale
Quality Over Quantity: Better to have 10K high-quality samples than 100K low-quality
Native Speaker Review: Essential for cultural and linguistic nuances

16.3 Novelty Opportunities for Indonesia-MTEB¶

Based on comprehensive research, Indonesia-MTEB can introduce:

Archipelago-Aware Generation: Regional variation in Indonesian (Javanese-influenced, Sundanese-influenced, Papuan-influenced)
Formal Register Continuum: Explicit datasets across formal-informal spectrum (baku → gaul → alay)
Code-Mixing Evaluation: Indonesian-English code-mixed data (realistic social media, Indonglish)
Domain-Specific Forks: Legal Indonesian, Medical Indonesian, Financial Indonesian
Cultural Knowledge: Indonesian-specific cultural queries from Wikipedia Indonesia
Regional Language Integration: NusaX-style parallel data (10 regional languages + Indonesian)
Real-Time Data: Dynamic dataset updates from current Indonesian news and trends
Multi-Modal Embeddings: Image-text pairs for Indonesian e-commerce, tourism, food

17. References¶

Synthetic Data Frameworks¶

SPEED: Chen et al. (2024). "Little Giants: Synthesizing High-Quality Embedding Data at Scale." arXiv:2410.18634. [link]
Self-Instruct: Wang et al. (2023). "Self-Instruct: Aligning Language Models with Self Generated Instructions." ACL 2023.
LLM-Driven Synthetic Data: Long et al. (2024). "On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation." ACL Findings 2024.

Regional MTEB Synthetic Data¶

ArabicMTEB: Bhatia et al. (2025). "Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks." NAACL 2025.
TR-MTEB: Baysan & Güngör (2025). "TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations." EMNLP 2025 Findings.
VN-MTEB: Pham et al. (2025). "VN-MTEB: Vietnamese Massive Text Embedding Benchmark." arXiv:2507.21500.
AfriMTEB: Uemura et al. (2025). "AfriMTEB and AfriE5: Benchmarking and Adapting Text Embeddings for African Languages." arXiv:2510.23896.
SEA-BED: Ponwitayarat et al. (2025). "SEA-BED: Southeast Asia Embedding Benchmark." arXiv:2508.12243.

Indonesian LLM Models¶

SEA-LION: Ng et al. (2025). "SEA-LION: Southeast Asian Languages in One Network." IJCNLP 2025.
Cendol: Cahyawijaya et al. (2024). "Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian and Local Languages." arXiv:2404.06138.
SahabatAI: GoTo & CSA Lab (2025). "SahabatAI: Indonesian-Centric Large Language Models."
NusaCrowd: Cahyawijaya et al. (2023). "NusaCrowd: Open Source Initiative for Indonesian NLP Resources." ACL Findings 2023.
SEACrowd: Lovenia et al. (2024). "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages." EMNLP 2024.

Indonesian Datasets¶

IndoNLI: Mahendra et al. (2021). "IndoNLI: A Natural Language Inference Dataset for Indonesian." EMNLP 2021.
SNLI Indo: Putra et al. (2024). "SNLI Indo: A Recognizing Textual Entailment Dataset in Indonesian." Journal of Physics: Conference Series.
CLICK-ID: William et al. (2020). "CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines." Data in Brief.
IndoSum: Kurniawan & Louvan (2018). "IndoSum: A New Benchmark Dataset for Indonesian Text Summarization." IALP 2018.

LLM-as-a-Judge¶

LLM-as-Judge Survey: arxiv.org/abs/2411.15594 (2024)
Chain-of-Thought for LLM-as-Judge: Arize AI (2025). "Evidence-Based Prompting Strategies for LLM-as-a-Judge."

Tools and Resources¶

HuggingFace Synthetic Data Generator: huggingface.co/blog/synthetic-data-generator
SPEED GitHub: github.com/haon-chen/SPEED
IndoNLP: github.com/IndoNLP

18. Next Steps (Document Roadmap)¶

Document	Content	Status
01	Project Overview	✅ Complete
02	MTEB Structure Analysis	✅ Complete
03	Existing Indonesian Datasets	✅ Complete
04	Regional MTEB Methodologies	✅ Complete
05	Translation Models Benchmark	✅ Complete (Enhanced v2.0)
06	AI Dataset Generation Methods	✅ Complete (Enhanced v2.0)
07	Validation Strategies	Pending
08	ACL Dataset Paper Standards	Pending
09	Novelty Angle & Publication	Pending
10	Implementation Roadmap	Pending

Appendix A: Quick Reference¶

Recommended Models for Each Task¶

Task	Best Generator	Best Validator	Cost Efficiency
Seed Data	GPT-4o / Claude 3.5	Same	Low priority, quality first
Large-Scale	Command-light	Claude 3.5	★★★★★
Indonesian-Specific	SEA-LION-v4 / SahabatAI	GPT-4o	★★★★☆
Cost-Optimized	SPEED-aligned 8B	Command R+	★★★★★

Cost Calculator¶

For 10,000 samples generation:
- Command R+: ~$6-7
- GPT-4o: ~$30-35
- Claude 3.5: ~$22-27
- SPEED-aligned 8B (self-hosted): ~$2-3

For 10,000 samples validation:
- Claude 3.5: ~$15-20
- Command R+: ~$4-5

Savings with Command R+: 70-80% vs GPT-4o/Claude

"Synthetic data generation, when properly validated through LLM-as-judge and calibrated with human annotations, can fill critical dataset gaps while maintaining quality standards comparable to human-curated data. For Indonesia-MTEB, this approach enables rapid development of clustering, reranking, and STS datasets that are otherwise unavailable."

This document is a living record. Updated as research progresses.

Model	Parameters	Recommendation	Cost	Best For
Claude 3.5 Sonnet	-	★★★★★ Best	\(3/\)15 per 1M	Complex evaluation
GPT-4o	-	★★★★★ Excellent	\(5/\)15 per 1M	Quality critical
Command R+	104B	★★★★☆ Very Good	\(1/\)2 per 1M	Cost-efficient
Aya-23-35B	35B	★★★★☆ Good	Self-hosted	Indonesian-specialized
SEA-LION-v4	8B	★★★☆☆ Fair	Self-hosted	Budget option
Qwen2.5-7B	7B	★★★☆☆ Fair	Self-hosted	Local evaluation