Skip to content

Xertus AI Research Foundation

Welcome to the Xertus AI Research Foundation documentation.

Indonesia-MTEB Benchmark

A comprehensive text embedding benchmark for the Indonesian language.

11 documents covering: - 3-pronged data strategy (aggregation, translation, AI generation) - Cultural term preservation framework - Code-mixing evaluation - Regional language support


Quick Navigation

Indonesia-MTEB Benchmark

In Progress

This is an active research project. Documentation is being updated regularly.

Document Description
Project Overview Problem statement, 3-pronged strategy
MTEB Structure 8 task categories, formats, metrics
Indonesian Datasets Inventory of 50+ datasets
Regional MTEBs C-MTEB, VN-MTEB, TR-MTEB analysis
Translation Models EN-ID translation comparison
AI Generation LLM-based dataset generation
Validation 3-stage pipeline, quality thresholds
ACL Standards ARR submission, licensing
Novelty & Publication Unique contributions
Implementation 12-month timeline
Python Package Building the package

About

Xertus AI focuses on advancing NLP capabilities for Indonesian and Southeast Asian languages through open research.