Document 10: Implementation Roadmap
Overview
This document provides a comprehensive implementation roadmap for the Indonesia-MTEB project, including phases, timeline, resource requirements, technical specifications, team structure, risk management, and deployment strategy. It serves as the practical guide for transforming the research vision into a production-ready benchmark.
1. Project Timeline Overview
1.1 Gantt Chart Summary
Phase 1: Foundation Month 1-2 ████████
Phase 2: Data Aggregation Month 2-4 ████████████
Phase 3: Translation Pipeline Month 3-6 ████████████████████
Phase 4: AI Generation Month 5-7 ██████████████
Phase 5: Validation Month 6-8 ████████████████
Phase 6: Integration Month 8-9 ████████
Phase 7: Benchmark Month 9-10 ████████
Phase 8: Publication Month 10-12 ████████████████
1.2 Phase Summary
| Phase |
Duration |
Key Deliverables |
Dependencies |
| 1. Foundation |
2 months |
Infrastructure, licenses, source datasets |
None |
| 2. Aggregation |
2 months |
50+ existing datasets formatted |
Phase 1 |
| 3. Translation |
3 months |
Translated MTEB datasets |
Phase 1 |
| 4. AI Generation |
2 months |
Novel datasets for gaps |
Phase 1 |
| 5. Validation |
2 months |
Quality validation, kept ratios |
Phases 2-4 |
| 6. Integration |
1 month |
MTEB PR, HuggingFace upload |
Phase 5 |
| 7. Benchmark |
1 month |
Model evaluation results |
Phase 6 |
| 8. Publication |
3 months |
Paper submitted, arXiv preprint |
Phase 7 |
2. Phase 1: Foundation (Months 1-2)
2.1 Objectives
2.2 Infrastructure Setup
2.2.1 Development Environment
# Repository structure
indonesia-mteb/
├── data/
│ ├── raw/ # Source datasets
│ ├── processed/ # Aggregated datasets
│ ├── translated/ # Machine-translated datasets
│ ├── generated/ # AI-generated datasets
│ └── validated/ # Final validated datasets
├── code/
│ ├── aggregation/ # Dataset aggregation scripts
│ ├── translation/ # Translation pipeline
│ ├── generation/ # AI dataset generation
│ ├── validation/ # Quality validation
│ └── evaluation/ # Benchmark evaluation
├── configs/
│ ├── datasets.yaml # Dataset configurations
│ ├── models.yaml # Model configurations
│ └── evaluation.yaml # Evaluation configurations
├── docs/
│ ├── dataset_cards/ # HuggingFace README templates
│ └── licenses/ # License tracking
└── tests/
├── unit/ # Unit tests
├── integration/ # Integration tests
└── quality/ # Quality assurance tests
2.2.2 Cloud Infrastructure
| Service |
Purpose |
Est. Cost |
| GPU Compute |
Translation, validation |
$500-1000/month |
| Storage |
Dataset storage (S3/GCS) |
$50-100/month |
| HuggingFace |
Dataset hosting |
Free - $20/month |
| GitHub |
Code repository |
Free |
| CI/CD |
Automated testing |
Free |
2.3 License Acquisition
# License tracking template
LICENSE_TRACKING = {
"dataset_name": {
"source": "Original dataset URL",
"license": "CC-BY-4.0",
"attribution": "Citation string",
"restrictions": ["ShareAlike", "NC"],
"compatible": True,
"derivative_license": "CC-BY-4.0"
}
}
# Required licenses to track:
# - IndoNLU datasets (MIT License)
# - NusaCrowd datasets (varies)
# - SEACrowd datasets (varies)
# - MTEB original datasets (varies)
# - Machine translation models
2.4 Deliverables
3. Phase 2: Data Aggregation (Months 2-4)
3.1 Objectives
3.2 Dataset Inventory
| Task Category |
Existing Datasets |
Target |
Gap |
| Classification |
15+ |
15+ |
0 |
| Clustering |
0 |
5+ |
5+ |
| Pair Classification |
2+ |
3+ |
1-2 |
| Reranking |
0 |
3+ |
3+ |
| Retrieval |
5+ |
15+ |
10+ |
| STS |
3+ |
5+ |
2-3 |
| Summarization |
5+ |
5+ |
0 |
| Instruction Following |
0 |
3+ |
3+ |
3.3 Aggregation Pipeline
# Dataset aggregation framework
class DatasetAggregator:
"""
Aggregates existing Indonesian datasets into MTEB format.
"""
def __init__(self, source_dir: str, output_dir: str):
self.source_dir = source_dir
self.output_dir = output_dir
self.formatters = {
"classification": self.format_classification,
"retrieval": self.format_retrieval,
"sts": self.format_sts,
# ... other formatters
}
def format_classification(self, raw_data: dict) -> dict:
"""Convert raw classification data to MTEB format."""
return {
"texts": raw_data["text"],
"labels": raw_data["label"],
"split": raw_data.get("split", "test")
}
def validate_mteb_compatibility(self, dataset: dict) -> bool:
"""Validate dataset matches MTEB schema."""
required_fields = ["texts", "labels"]
return all(field in dataset for field in required_fields)
def create_dataset_card(self, dataset_name: str,
metadata: dict) -> str:
"""Generate HuggingFace dataset card."""
# Returns README.md content with YAML metadata
3.4 Deliverables
4. Phase 3: Translation Pipeline (Months 3-6)
4.1 Objectives
4.2 Translation Infrastructure
# Translation pipeline implementation
class TranslationPipeline:
"""
3-stage translation pipeline adapted from VN-MTEB.
"""
def __init__(self, config: dict):
# Stage 1: Language detection
self.language_detector = Qwen2_5_3B_Instruct()
# Stage 2: Translation
self.translator = TranslateGemma_12B()
# Stage 3: Validation
self.semantic_model = gte_Qwen2_7B_instruct()
self.llm_judge = Llama_SEA_LION_v4_70B_IT()
def translate_dataset(self, source_data: list,
task_type: str) -> dict:
"""
Translate dataset with 3-stage quality control.
"""
# Stage 1: Language detection
english_samples = self.detect_english(source_data)
# Stage 2: Translation
translated = self.batch_translate(english_samples)
# Stage 3: Validation
validated = self.validate_translation(
english_samples,
translated,
task_type
)
return {
"original": english_samples,
"translated": validated["texts"],
"kept_indices": validated["kept_indices"],
"kept_ratio": len(validated["texts"]) / len(english_samples)
}
def validate_translation(self, source: list, translated: list,
task_type: str) -> dict:
"""
3-step validation:
1. Language contamination check
2. Semantic similarity threshold
3. LLM-as-a-judge scoring
"""
# Step 1: Language check
lang_valid = self.check_indonesian(translated)
# Step 2: Semantic similarity
sem_valid = self.semantic_similarity_filter(
source, translated,
threshold=THRESHOLDS[task_type]
)
# Step 3: LLM-as-a-judge
llm_valid = self.llm_judge_scoring(
source, translated,
criteria=CULTURAL_CRITERIA
)
# Combine results
kept_indices = self.combine_validations(
lang_valid, sem_valid, llm_valid
)
return {
"texts": [translated[i] for i in kept_indices],
"kept_indices": kept_indices
}
4.3 Threshold Configuration
| Task Type |
Semantic Threshold |
LLM Judge Threshold |
Expected Kept Ratio |
| Classification |
≥0.75 |
≥3.5/5.0 |
70-75% |
| Clustering |
≥0.75 |
≥3.5/5.0 |
70-75% |
| Pair Classification |
≥0.75 |
≥3.5/5.0 |
65-70% |
| Reranking |
≥0.70 |
≥3.0/5.0 |
60-65% |
| Retrieval |
≥0.75 |
≥3.5/5.0 |
65-70% |
| STS |
≥0.80 |
≥4.0/5.0 |
55-60% |
| Summarization |
≥0.70 |
≥3.0/5.0 |
60-65% |
| Instruction Following |
≥0.75 |
≥3.5/5.0 |
65-70% |
4.4 Resource Estimation
| Metric |
Value |
Notes |
| Total tokens to translate |
~500M tokens |
Based on MTEB EN datasets |
| Translation speed |
~3,800 tokens/sec |
With 4x H100 GPUs |
| Estimated time |
~28 days |
675 hours compute time |
| GPU hours required |
~2,700 H100 hours |
Including validation |
| Estimated cost |
$5,000-10,000 |
Cloud GPU costs |
4.5 Deliverables
5. Phase 4: AI Dataset Generation (Months 5-7)
5.1 Objectives
5.2 Target Tasks for AI Generation
| Task |
Reason for AI Generation |
Sample Size Target |
| Clustering |
No existing Indonesian clustering datasets |
5 datasets, 100K+ samples each |
| Reranking |
Limited Indonesian reranking data |
3 datasets, 10K+ samples each |
| Instruction Following |
Novel task for Indonesian |
3 datasets, 5K+ samples each |
5.3 Generation Pipeline
# AI dataset generation framework
class AIDatasetGenerator:
"""
Generate synthetic datasets for tasks with limited Indonesian data.
"""
def __init__(self, llm, embedder):
self.llm = llm # Llama-SEA-LION-v4-70B-IT
self.embedder = embedder # gte-Qwen2-7B-instruct
def generate_clustering_data(self, domain: str,
n_samples: int) -> dict:
"""
Generate clustering dataset for Indonesian text.
Strategy: Use LLM to generate diverse Indonesian texts
within specific domains, then validate semantic diversity.
"""
prompt = f"""
Generate {n_samples} diverse Indonesian texts about {domain}.
Each text should be 2-4 sentences long.
Cover different aspects and subtopics within {domain}.
Ensure natural Indonesian phrasing.
Output format: JSON list with "text" field.
"""
raw_texts = self.llm.generate(prompt, temperature=0.8)
# Validate diversity
validated = self.validate_diversity(raw_texts)
return {
"texts": validated,
"domain": domain,
"generated_by": "AI"
}
def generate_reranking_data(self, query: str,
n_docs: int) -> dict:
"""
Generate reranking dataset.
Strategy: Generate query-document pairs with
varying relevance levels.
"""
prompt = f"""
Given the query: "{query}"
Generate {n_docs} Indonesian documents.
- 30% highly relevant
- 40% somewhat relevant
- 30% not relevant
Output format: JSON with "query", "documents", "relevance" fields
"""
raw_data = self.llm.generate(prompt, temperature=0.7)
# Validate relevance discrimination
validated = self.validate_relevance(raw_data)
return validated
def validate_diversity(self, texts: list) -> list:
"""
Validate semantic diversity of generated texts.
Remove near-duplicates using embedding similarity.
"""
embeddings = self.embedder.encode(texts)
diverse = self.diversity_filter(embeddings, threshold=0.85)
return [texts[i] for i in diverse]
def validate_relevance(self, data: dict) -> dict:
"""
Validate relevance labels using LLM-as-a-judge.
"""
reviewed = self.llm_judge.review_relevance(data)
return reviewed
5.4 Quality Validation
# AI-generated data validation
def validate_ai_dataset(dataset: dict, task_type: str) -> dict:
"""
Comprehensive validation of AI-generated datasets.
"""
results = {
"diversity_score": measure_diversity(dataset["texts"]),
"fluency_score": measure_fluency(dataset["texts"]),
"label_quality": measure_label_quality(dataset),
"human_sample": human_check_sample(dataset, n=100)
}
# Pass validation if:
# - Diversity score ≥ 0.7
# - Fluency score ≥ 0.8
# - Label quality F1 ≥ 0.85
# - Human agreement ≥ 0.8
passed = all([
results["diversity_score"] >= 0.7,
results["fluency_score"] >= 0.8,
results["label_quality"]["f1"] >= 0.85,
results["human_sample"]["agreement"] >= 0.8
])
return {"passed": passed, "details": results}
5.5 Deliverables
6. Phase 5: Validation (Months 6-8)
6.1 Objectives
6.2 Cultural Validation Framework
# Cultural term preservation validation
CULTURAL_TERMS = {
# Social concepts
"gotong_royong", "pancasila", "rukun", "siskamling", "musyawarah",
# Religious/cultural
"lebaran", "puasa", "halal_bil_halal", "nyepi", "waisak", "galungan",
# Culinary
"warung", "nasi_goreng", "rendang", "sate", "bakso", "gado_gado",
# Arts/crafts
"batik", "wayang", "gamelan", "keris", "ikat", "songket",
# Geographic/identity
"merantau", "kampung", "desa", "kos", "rumah_tinggi"
}
def validate_cultural_preservation(dataset: dict) -> dict:
"""
Validate preservation of Indonesian cultural terms.
"""
source_texts = dataset.get("source", [])
translated_texts = dataset.get("translated", [])
results = []
for src, trans in zip(source_texts, translated_texts):
source_terms = [t for t in CULTURAL_TERMS if t in src.lower()]
trans_terms = [t for t in source_terms if t in trans.lower()]
if source_terms:
preservation_rate = len(trans_terms) / len(source_terms)
results.append({
"source": src,
"translated": trans,
"source_terms": source_terms,
"preserved_terms": trans_terms,
"preservation_rate": preservation_rate
})
overall_rate = sum(r["preservation_rate"] for r in results) / len(results) if results else 1.0
return {
"overall_preservation_rate": overall_rate,
"term_level_results": results,
"passes_threshold": overall_rate >= 0.9
}
6.3 Code-Mixing Validation
# Code-mixing detection for Indonesian-English
def detect_code_mixing(text: str) -> dict:
"""
Detect Indonesian-English code-mixing in text.
"""
# Word-level language identification
tokens = word_tokenize(text)
lang_ids = [detect_language_word(t) for t in tokens]
# Count switches
switches = sum(1 for i in range(1, len(lang_ids))
if lang_ids[i] != lang_ids[i-1])
# Calculate mixing ratio
en_ratio = sum(1 for l in lang_ids if l == "en") / len(lang_ids)
return {
"has_code_mixing": switches > 0,
"switch_count": switches,
"english_ratio": en_ratio,
"dominant_lang": max(set(lang_ids), key=lang_ids.count)
}
def validate_code_mixing_dataset(dataset: dict) -> dict:
"""
Validate code-mixing annotations in dataset.
"""
results = [detect_code_mixing(text) for text in dataset["texts"]]
mixing_stats = {
"total_samples": len(dataset["texts"]),
"code_mixed_samples": sum(r["has_code_mixing"] for r in results),
"avg_switches": sum(r["switch_count"] for r in results) / len(results),
"avg_english_ratio": sum(r["english_ratio"] for r in results) / len(results)
}
return mixing_stats
6.4 Deliverables
7. Phase 6: Integration (Months 8-9)
7.1 Objectives
7.2 HuggingFace Organization Setup
# Organization structure
huggingface.co/indonesia-mteb/
├── datasets/
│ ├── classification/
│ │ ├── indo-sentiment/
│ │ ├── indo-emotion/
│ │ └── ...
│ ├── clustering/
│ ├── reranking/
│ ├── retrieval/
│ ├── sts/
│ ├── summarization/
│ └── instruction-following/
└── models/ # For future Indonesian embedding models
7.3 Dataset Upload Script
# Automated dataset upload to HuggingFace
from huggingface_hub import HfApi, Repository
def upload_dataset_to_huggingface(
dataset_path: str,
dataset_name: str,
organization: str = "indonesia-mteb"
):
"""
Upload validated dataset to HuggingFace.
"""
api = HfApi()
# Create repository
repo_id = f"{organization}/{dataset_name}"
api.create_repo(repo_id, repo_type="dataset", exist_ok=True)
# Upload files
api.upload_folder(
repo_id=repo_id,
folder_path=dataset_path,
repo_type="dataset"
)
return repo_id
# Batch upload all datasets
def upload_all_datasets(validated_dir: str):
"""
Upload all validated datasets to HuggingFace.
"""
datasets = os.listdir(validated_dir)
for dataset_name in datasets:
dataset_path = os.path.join(validated_dir, dataset_name)
upload_dataset_to_huggingface(dataset_path, dataset_name)
print(f"Uploaded: {dataset_name}")
7.4 MTEB Integration PR
# MTEB dataset loader template
# File: mteb/indonesia_mteb/__init__.py
from . import (
IndoSentimentClassification,
IndoClustering,
IndoReranking,
# ... all datasets
)
__all__ = [
"IndoSentimentClassification",
"IndoClustering",
"IndoReranking",
# ...
]
# File: mteb/indonesia_mteb/indo_sentiment.py
from mteb.abstasks.AbsTask import AbsTaskClassification
from mteb.abstasks.TaskMetadata import TaskMetadata
class IndoSentimentClassification(AbsTaskClassification):
metadata = TaskMetadata(
name="IndoSentimentClassification",
description="Indonesian sentiment classification dataset",
reference="https://huggingface.co/datasets/indonesia-mteb/indo-sentiment",
dataset={
"path": "indonesia-mteb/indo-sentiment",
"revision": "main"
},
type="Classification",
category="s2s",
eval_splits=["test"],
eval_langs=["id-ID"],
main_score="accuracy",
date=("2024-01-01", "2024-12-31"),
form=["written"],
domains=["Social", "Reviews"],
task_subtypes=["Sentiment classification"],
license="CC-BY-4.0",
annotations_creators="human-verified",
dialect=[],
sample_creation="found",
bibtex_citation="""@dataset{indo_sentiment_2024,
title={Indonesian Sentiment Classification},
author={Indonesia-MTEB Team},
year={2024}
}"""
)
7.5 Deliverables
8. Phase 7: Benchmark (Months 9-10)
8.1 Objectives
8.2 Model Selection
| Model Type |
Models to Evaluate |
Reasoning |
| Multilingual (APE) |
bge-m3, m-e5-large, gte-multilingual |
Baseline comparison |
| Multilingual (RoPE) |
e5-mistral-7b, gte-Qwen2-7B |
State-of-the-art |
| Instruct-tuned |
m-e5-large-instruct, bge-large-instruct |
Instruction following |
| Indonesian-specialized |
IndoBERT, SEA-LION variants |
Local comparison |
8.3 Evaluation Script
# Comprehensive benchmark evaluation
import mteb
from sentence_transformers import SentenceTransformer
class IndonesiaMTEBBenchmark:
"""
Benchmark models on Indonesia-MTEB datasets.
"""
def __init__(self):
self.evaluation = mteb.MTEB(tasks=mteb.get_indonesian_tasks())
self.results = {}
def evaluate_model(self, model_name: str, model_path: str):
"""
Evaluate a single model on all Indonesia-MTEB tasks.
"""
model = SentenceTransformer(model_path)
results = self.evaluation.run(
model,
output_folder=f"results/{model_name}",
eval_splits=["test"],
batch_size=32,
verbosity=2
)
self.results[model_name] = results
return results
def generate_leaderboard(self) -> dict:
"""
Generate leaderboard from evaluation results.
"""
leaderboard = {}
for model_name, results in self.results.items():
scores = {}
for task, task_results in results.items():
scores[task] = task_results.get("main_score", 0)
leaderboard[model_name] = {
"scores": scores,
"average": sum(scores.values()) / len(scores)
}
# Sort by average score
sorted_leaderboard = dict(
sorted(leaderboard.items(),
key=lambda x: x[1]["average"],
reverse=True)
)
return sorted_leaderboard
def compare_architectures(self) -> dict:
"""
Compare APE vs RoPE vs Instruct-tuned performance.
"""
comparison = {
"APE": [],
"RoPE": [],
"Instruct": []
}
for model_name, results in self.results.items():
arch_type = classify_architecture(model_name)
avg_score = sum(r.get("main_score", 0)
for r in results.values()) / len(results)
comparison[arch_type].append(avg_score)
return {
arch: scores and sum(scores) / len(scores)
for arch, scores in comparison.items()
}
8.4 Deliverables
9. Phase 8: Publication (Months 10-12)
9.1 Objectives
9.2 Paper Writing Timeline
| Week |
Task |
Deliverable |
| 1-2 |
Outline & Abstract |
Paper structure |
| 3-4 |
Introduction & Related Work |
Sections 1-2 |
| 5-6 |
Methodology |
Section 3 |
| 7-8 |
Experiments & Results |
Sections 4-5 |
| 9 |
Discussion & Conclusion |
Sections 6-7 |
| 10 |
Ethics & Broader Impact |
Sections A-B |
| 11 |
Internal Review |
Revised draft |
| 12 |
ARR Submission |
Submitted paper |
# Community engagement checklist
COMMUNITY_TASKS = [
# Pre-submission
("Create GitHub repository", "Month 10"),
("Post arXiv preprint", "Month 10"),
("Write blog post", "Month 10"),
("Social media announcement", "Month 10"),
# Post-submission
("Submit to workshop", "Month 11"),
("Create demo/explanation", "Month 11"),
("Reach out to Indonesian NLP community", "Month 11"),
("Submit MTEB integration PR", "Month 11"),
# Post-acceptance
("Prepare presentation", "Month 12"),
("Release code and data", "Month 12"),
("Create tutorial notebooks", "Month 12")
]
9.4 Deliverables
10. Resource Requirements
10.1 Team Structure
| Role |
Responsibilities |
FTE |
| Principal Investigator |
Overall direction, paper writing |
0.5 |
| Project Lead |
Day-to-day management, coordination |
1.0 |
| ML Engineer |
Translation pipeline, validation |
1.0 |
| Data Engineer |
Aggregation, formatting |
1.0 |
| Backend Developer |
Infrastructure, integration |
0.5 |
| Research Assistant |
Literature review, testing |
0.5 |
| Indonesian Linguist |
Cultural validation, annotation |
0.5 |
Total: 5 FTE
10.2 Compute Requirements
| Phase |
GPU Type |
GPU Hours |
Est. Cost |
| Translation |
4x H100 |
2,700 |
$5,000 |
| Validation |
4x A100 |
500 |
$500 |
| Benchmark |
4x A100 |
800 |
$800 |
| Total |
- |
4,000 |
$6,300 |
10.3 Budget Summary
| Category |
Item |
Cost (USD) |
| Compute |
GPU hours |
$6,300 |
| Storage |
Cloud storage (1 year) |
$600 |
| Personnel |
5 FTE × 12 months |
$150,000 |
| Contingency |
10% buffer |
$15,700 |
| Total |
|
$172,600 |
11. Risk Management
11.1 Risk Register
| Risk |
Probability |
Impact |
Mitigation |
| Translation quality too low |
Medium |
High |
Use better models (TranslateGemma-12B), lower thresholds |
| MTEB PR rejected |
Low |
Medium |
Early engagement with maintainers, independent release |
| Insufficient compute |
Low |
High |
Secure funding allocation, use spot instances |
| License issues |
Low |
High |
Comprehensive tracking, legal review |
| Team turnover |
Medium |
Medium |
Documentation, knowledge transfer |
| Competition (SEA-BED) |
Low |
Low |
Differentiate with cultural framework |
11.2 Contingency Plans
If translation kept ratios < 50%:
- Increase LLM-as-judge threshold
- Use human annotation for critical samples
- Consider alternative translation models
If MTEB integration delayed:
- Release independent evaluation framework
- Create compatibility layer
- Community fork if necessary
If budget insufficient:
- Prioritize core tasks (Classification, Retrieval, STS)
- Use smaller GPU clusters
- Seek additional funding
12. Quality Assurance
12.1 Testing Strategy
# Test suite for Indonesia-MTEB
class TestIndonesiaMTEB:
"""
Comprehensive test suite for all components.
"""
def test_dataset_format(self):
"""Test all datasets match MTEB format."""
for dataset in DATASETS:
assert validate_mteb_format(dataset)
def test_translation_quality(self):
"""Test translation quality metrics."""
for task in TASKS:
assert kept_ratio(task) >= MIN_KEPT_RATIOS[task]
def test_cultural_preservation(self):
"""Test cultural term preservation."""
for dataset in TRANSLATED_DATASETS:
assert cultural_preservation(dataset) >= 0.9
def test_mteb_integration(self):
"""Test MTEB framework integration."""
for task in MTEB_TASKS:
assert mteb.run_evaluation(task, MODEL)
def test_reproducibility(self):
"""Test result reproducibility."""
results1 = run_benchmark(MODEL, SEED=42)
results2 = run_benchmark(MODEL, SEED=42)
assert results1 == results2
12.2 Validation Checklist
Each dataset must pass:
13. Success Criteria
13.1 Phase-Based Milestones
| Phase |
Success Criteria |
| Foundation |
Infrastructure operational, licenses secured |
| Aggregation |
50+ datasets in MTEB format |
| Translation |
≥60% average kept ratio across tasks |
| AI Generation |
All generated datasets pass validation |
| Validation |
Cultural preservation ≥90% |
| Integration |
All datasets on HuggingFace |
| Benchmark |
18 models evaluated |
| Publication |
Paper submitted to tier-1 venue |
13.2 Overall Project Success
The Indonesia-MTEB project will be considered successful if:
- ✅ Coverage: All 8 MTEB task categories have Indonesian datasets
- ✅ Quality: Average kept ratio ≥65% across translated datasets
- ✅ Integration: Successfully integrated into MTEB leaderboard
- ✅ Community: At least 5 external uses/citations within 6 months
- ✅ Publication: Paper accepted at tier-1 NLP venue
14. Maintenance Plan
14.1 Long-Term Maintenance
| Activity |
Frequency |
Responsibility |
| Dataset updates |
Quarterly |
Project Lead |
| Bug fixes |
As needed |
ML Engineer |
| New model evaluation |
Monthly |
Community + Team |
| Documentation updates |
As needed |
Research Assistant |
| Community support |
Ongoing |
Project Lead |
14.2 Version Control
v1.0.0 (2026): Initial release with 50-100 datasets
v1.1.0 (2026): Add regional language tasks
v1.2.0 (2027): Add new datasets from community
v2.0.0 (2027): Major expansion with new task types
15. Summary
15.1 Implementation Timeline
Month 1-2: Foundation setup
Month 2-4: Dataset aggregation
Month 3-6: Translation pipeline
Month 5-7: AI dataset generation
Month 6-8: Validation and quality control
Month 8-9: MTEB integration
Month 9-10: Benchmark evaluation
Month 10-12: Paper submission and publication
15.2 Key Deliverables
| Phase |
Deliverable |
| Foundation |
Infrastructure, licenses, environment |
| Aggregation |
50+ existing datasets formatted |
| Translation |
40+ translated datasets |
| Generation |
11 novel AI-generated datasets |
| Validation |
Quality validation reports |
| Integration |
HuggingFace release, MTEB PR |
| Benchmark |
Leaderboard, analysis |
| Publication |
Conference paper, arXiv preprint |
15.3 Resource Summary
| Resource |
Quantity |
| Team |
5 FTE |
| Duration |
12 months |
| Compute |
4,000 GPU hours |
| Budget |
$172,600 |
| Datasets |
50-100+ |
| Models Evaluated |
18 |
This roadmap provides the practical foundation for transforming Indonesia-MTEB from vision to reality, ensuring all phases are properly planned, resourced, and executed.