Skip to content

Document 08: ACL Dataset Paper Standards

Overview

This document outlines the standards, requirements, and best practices for publishing Indonesia-MTEB as a dataset paper in ACL venues (ACL, EMNLP, NAACL, COLING) via the ACL Rolling Review (ARR) system. It covers submission formats, dataset documentation standards, licensing requirements, responsible NLP considerations, and integration with the MTEB ecosystem.


1. ACL Rolling Review (ARR) Submission Requirements

1.1 Paper Format Specifications

Requirement Specification
Paper Size A4 (21 cm × 29.7 cm) - strictly enforced
Templates ACL 2025 style files (LaTeX/Word)
Page Limits Long papers: 8 pages + unlimited references; Short papers: 4 pages + unlimited references
Anonymity No anonymity period required (since Feb 2024)
File Format PDF only
Supplementary Materials Optional .tgz or .zip archive (max 200MB)

1.2 ARR Submission Checklist

  • Paper formatted with ACL template
  • Responsible NLP Research checklist completed (required since Dec 2024)
  • All cited artifacts properly attributed
  • License/terms of use discussed for all datasets
  • Supplementary materials prepared (code, data, appendices)
  • At least one author registered as reviewer for the same cycle (required since April 2024)

1.3 Responsible NLP Checklist (Required)

The ARR Responsible NLP Research checklist addresses: - Research Ethics: Human subjects, data collection ethics - Societal Impact: Potential harms, mitigation strategies - Reproducibility: Code availability, data access, experimental details - Attribution: Proper citation of all artifacts used

Critical Note: Since December 2024, ARR enforces desk rejections for incorrect, incomplete, or misleading Responsible NLP checklist filings.


2. Dataset Paper Structure

Based on analysis of VN-MTEB, TR-MTEB, C-MTEB, and SEA-BED papers, a regional MTEB dataset paper should follow this structure:

1. Abstract
2. Introduction
   - Motivation (language coverage gap)
   - Problem statement
   - Contributions
3. Related Work
   - MTEB and existing benchmarks
   - Regional MTEBs (C-MTEB, VN-MTEB, TR-MTEB, etc.)
   - Indonesian NLP resources (IndoNLU, NusaCrowd, Indo4B)
4. Methodology
   - Dataset collection strategy
   - Translation pipeline (if applicable)
   - Quality validation procedures
5. Indonesia-MTEB Benchmark
   - Task categories covered
   - Dataset overview table
   - Statistics and kept ratios
6. Experiments
   - Implementation details (models, hyperparameters)
   - Benchmark results
   - Analysis and insights
7. Conclusion
8. Limitations
9. Ethics Statement
A. Broader Impact Statement
B. Dataset Licenses
C. Supplementary Materials

2.2 Required Tables and Figures

Element Description
Dataset Overview Table Task, # datasets, samples (before/after), kept ratio
Benchmark Results Table Model performance across tasks
Translation Pipeline Figure Visual representation of methodology
Kept Ratio Boxplot Quality distribution by task type
Model Performance vs. Size Correlation analysis

2.3 Abstract Template

We introduce [Indonesia-MTEB], a comprehensive text embedding benchmark
for Indonesian covering [X] datasets across [Y] tasks. Despite Indonesian
being spoken by [270M+] people, existing embedding benchmarks lack
comprehensive Indonesian coverage. Our benchmark addresses this gap through
[3-pronged strategy: aggregation, translation, AI generation]. We evaluate
[Z] embedding models, revealing insights about [key findings]. Datasets are
available at: [HuggingFace URL]

3. HuggingFace Dataset Card Standards

3.1 YAML Metadata Format

Each dataset must include a README.md with YAML metadata:

---
language:
  - id
  - en
  - jv
  - su
  - ms
license: cc-by-4.0
task_categories:
  - text-classification
  - clustering
  - pair-classification
  - reranking
  - retrieval
  - semantic-similarity
  - summarization
  - instruction-following
task_ids:
  - BitextMining
  - Classification
  - Clustering
  - PairClassification
  - Reranking
  - Retrieval
  - STS
  - Summarization
multilinguality:
  - translation
  - multilingual
size_categories:
  - 10K<n<100K
  - 100K<n<1M
  - 1M<n<10M
source_datasets:
  - original
  - extended
pretty_name: Indonesia MTEB
dataset_info:
  config_names:
    - default
  features:
    - name: text
      dtype: string
    - name: label
      dtype: string
  splits:
    - name: test
      num_bytes: X
      num_examples: Y
  download_size: X
  dataset_size: Y
---

3.2 Dataset Card Sections

# Dataset Name

## Dataset Description
- Brief overview (2-3 sentences)
- Homepage/Project page URL
- Repository URL
- Paper URL

## Languages
- Language codes (ISO 639-1)
- Dialect/variety information

## Dataset Structure
### Data Instances
Example format for each task type

### Data Fields
- Field descriptions and types
- Label descriptions

### Data Splits
Train/validation/test split sizes

## Dataset Creation
### Curation Rationale
Why this dataset was created

### Source Data
Original data sources and licensing

### Annotations
Annotation process, annotator information

### Personal and Sensitive Information
Any PII considerations

## Considerations for Using the Data
### Social Impact of Dataset
Potential societal implications

### Discussion of Biases
Known biases in the data

### Other Known Limitations
Technical or quality limitations

## Additional Information
### Dataset Curators
Who created/maintained the dataset

### Licensing Information
License type and restrictions

### Citation Information
BibTeX citation

### Contributions
How to contribute or report issues

3.3 License Metadata Standards

For Indonesia-MTEB datasets:

License Type When to Use Notes
CC-BY-4.0 Default for new translations Recommended by OpenAIRE for academic data
CC-BY-SA-4.0 Derived from CC-BY-SA sources Share-alike requirement
CC0 Public domain dedications For fully permissive use
ODC-BY-1.0 Open data commons Alternative to CC licenses
Custom Original dataset licenses Must specify terms clearly

4. Licensing and Attribution Requirements

4.1 Indonesian Dataset Licenses

Dataset License Citation Requirement
Indo4B Custom (check IndoNLP) Wilie et al. (2020)
IndoNLU MIT License Wilie et al. (2020)
NusaCrowd varies by dataset Cahyawijaya et al. (2023)
SEACrowd varies by dataset SEACrowd Consortium

4.2 Translation Derivative Works

Key Legal Considerations:

  1. Machine Translation Creates Derivative Works: Under copyright law, translations are derivative works of the original
  2. CC License Compatibility: When translating CC-BY-SA content, translations must also be CC-BY-SA (ShareAlike requirement)
  3. Source License Propagation: Track all source licenses as they may impose restrictions on derivative works

Best Practices: - Maintain a license tracking table for all source datasets - Clearly document any license incompatibilities - Use CC-BY-4.0 (no ShareAlike) when possible to avoid downstream restrictions - Provide attribution in README.md and paper acknowledgments

4.3 Attribution Format

@dataset{indonesia_mteb_2025,
  title        = {Indonesia-MTEB: Indonesian Massive Text Embedding Benchmark},
  author       = {Author Names},
  year         = {2025},
  publisher    = {Hugging Face},
  version      = {1.0.0},
  url          = {https://huggingface.co/datasets/indonesia-mteb/...},
  license      = {CC-BY-4.0}
}

Required Attribution for Sources: - Original MTEB datasets (Muennighoff et al., 2023) - Indo4B/IndoNLU (Wilie et al., 2020) - NusaCrowd (Cahyawijaya et al., 2023) - Translation models used (e.g., TranslateGemma-12B)

4.4 Indonesia Personal Data Protection Law (PDP Law)

Law No. 27/2022 Compliance: - Came into force October 17, 2022 - Full compliance required since October 17, 2024 - Data breach reporting within 72 hours - Explicit consent requirements for personal data processing

Implications for Indonesia-MTEB: - Review all datasets for potential PII - Anonymize or exclude personal data - Document privacy measures in paper - Consider IRB review if using human-subject data


5. MTEB Integration Requirements

5.1 Dataset Submission to MTEB

Contribution Points System (as of 2026): - First dataset for a language × task: 4 bonus points - Additional datasets: 1 point per language × task combination - New languages (≥12): Extra recognition

PR Requirements: 1. Follow MTEB contribution guidelines 2. Include dataset loader in mteb/ directory 3. Add metadata to leaderboard 4. Ensure compatibility with MTEB evaluation scripts

5.2 Dataset Format Requirements

# Example: MTEB-compatible dataset class
from mteb.abstasks import AbsTask
from mteb.abstasks.TaskMetadata import TaskMetadata

class IndonesianDataset(AbsTask):
    metadata = TaskMetadata(
        name="DatasetName",
        description="Description of dataset",
        reference="https://arxiv.org/xxxx.xxxxx",
        dataset={
            "path": "indonesia-mteb/dataset-name",
            "revision": "revision-hash"
        },
        type="Classification",  # or Retrieval, STS, etc.
        category="s2s",  # or s2p, p2p
        eval_splits=["test"],
        eval_langs=["id-ID"],
        main_score="accuracy",
        date=("2024-01-01", "2024-12-31"),
        form=["written"],
        domains=["Social", "Reviews"],
        task_subtypes=["Sentiment classification"],
        license="CC-BY-4.0",
        annotations_creators="human-verified",
        dialect=[],
        sample_creation="created",
        bibtex_citation="""@article{...}"""
    )

5.3 Leaderboard Integration

Steps to Add Indonesia-MTEB to Leaderboard:

  1. Create HuggingFace organization: indonesia-mteb
  2. Upload all datasets with proper YAML metadata
  3. Submit PR to embeddings-benchmark/mteb GitHub
  4. Update leaderboard configuration
  5. Run evaluation on reference models

Benchmark Configuration:

# mteb/leaderboard/benchmark_configs/indonesia_mteb.yaml
name: Indonesia-MTEB
description: Indonesian Massive Text Embedding Benchmark
version: 1.0.0
languages: [id]
tasks:
  - Classification
  - Clustering
  - PairClassification
  - Reranking
  - Retrieval
  - STS
datasets:
  - dataset1
  - dataset2
  # ...


6. Responsible NLP Checklist Responses

6.1 Required Checklist Items

Category Checklist Item Indonesia-MTEB Response
A. Ethics A1. Human subjects Existing datasets only; no new human data collection
A2. Personal data PII removed; PDP Law compliance documented
A3. Consent Public domain/explicitly licensed datasets
B. Attribution B1. Artifact creators cited Yes (MTEB, IndoNLU, NusaCrowd, etc.)
B2. License terms discussed Yes, see Appendix B
C. Impact C1. Societal impact Discussed in Broader Impact
C2. Risks mitigated Data filtering, bias documentation
D. Reproducibility D1. Code availability GitHub repo
D2. Data availability HuggingFace datasets
D3. Experimental details Full methodology in paper

6.2 Ethics Statement Template

Ethics Statement

This work aggregates and translates existing publicly available datasets.
No new human subjects research was conducted. All source datasets were
collected with appropriate consent or are public domain. Personal
identifying information has been removed where present. The work complies
with Indonesia's Personal Data Protection Law (Law No. 27/2022).

Potential risks include amplification of biases present in source data.
We document known biases and limitations in Section X.

6.3 Broader Impact Statement Template

Broader Impact

Positive Impact:
- Enables better Indonesian NLP applications (search, recommendation)
- Supports Indonesian language preservation in AI
- Provides evaluation resources for Indonesian model development

Potential Negative Impact:
- May encode biases from source datasets
- Translation may introduce cultural artifacts
- Could be used for surveillance or content moderation

Mitigation:
- Bias documentation for each dataset
- Open licensing enables community audit
- Responsible use guidelines provided

7. Dataset Documentation Standards

7.1 Datasheets for Datasets (Gebru et al., 2018)

Recommended sections for Indonesia-MTEB datasheets:

  1. Motivation: Why was the dataset created?
  2. Composition: What are the instances? What fields?
  3. Collection Process: How was data collected?
  4. Preprocessing: What cleaning/filtering was applied?
  5. Uses: What is the intended use? Unintended uses?
  6. Distribution: How is the dataset distributed?
  7. Maintenance: Will the dataset be updated?

7.2 GEM Data Card Standards

For natural language generation tasks (Summarization, Instruction Following):

data_card:
  motivation:
    rationale: "Fill in rationale"
    primary_use: "Describe primary use"
    other_uses: ["List", "other", "uses"]
  composition:
    data_format: "Format description"
    data_fields: [
      {name: "field1", description: "...", type: "..."}
    ]
  collection:
    collection_process: "Description"
    source_datasets: ["List", "sources"]
  preprocessing:
    cleaning_steps: ["Step1", "Step2"]
    filtering_criteria: "Criteria description"
  annotation:
    annotation_process: "Description"
    annotator_demographics: "If known"
  uses:
    intended_uses: ["Use1", "Use2"]
    out_of_scope_uses: ["Misuse1", "Misuse2"]
  distribution:
    license: "CC-BY-4.0"
  maintenance:
    update_frequency: "As needed"
    contact: "maintainer@email.com"

7.3 Example Dataset Card (Classification Task)

---
language:
  - id
license: cc-by-4.0
task_categories:
  - text-classification
pretty_name: Indonesian Sentiment Classification
---

# Indonesian Sentiment Classification

## Dataset Description
This dataset contains Indonesian text samples labeled for sentiment analysis.
Translated from [Original Dataset] using [Translation Model].

## Languages
- Indonesian (id-ID)

## Dataset Structure

### Data Instances
{ "text": "Produk ini sangat bagus dan pengiriman cepat.", "label": "positive" }
### Data Fields
- `text`: Indonesian text string
- `label`: Sentiment label (positive/negative/neutral)

### Data Splits
- Test: 3,424 samples

## Dataset Creation

### Source Data
Original: [Source dataset name and citation]
License: [Original license]

### Translation
- Model: TranslateGemma-12B
- Validation: 3-stage pipeline (language detection, semantic similarity, LLM-as-judge)
- Kept Ratio: 72.5%

### Quality Assurance
- Semantic similarity threshold: ≥0.75
- LLM-as-judge calibration: 88.4% F1 target

## Considerations

### Known Biases
- E-commerce domain bias
- Formal Indonesian bias (less colloquial)

### Limitations
- Machine translation may lose cultural nuances
- Limited to specific domains

## Citation
```bibtex
@dataset{indo_sentiment_2025,
  author = {Indonesia-MTEB Team},
  title = {Indonesian Sentiment Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/indonesia-mteb/sentiment}
}
---

## 8. Citation and Bibliography Standards

### 8.1 Required Citations

**Must Cite**:
```bibtex
@inproceedings{muennighoff2023mteb,
  title={MTEB: Massive Text Embedding Benchmark},
  author={Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils},
  booktitle={EACL},
  year={2023}
}

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Wilie, Bryan and Vincentio, Karissa and Winata, Genta Indra and Cahyawijaya, Samuel},
  booktitle={AACL},
  year={2020}
}

@inproceedings{cahyawijaya2023nusacrowd,
  title={NusaCrowd: A Collaborative Project to Collect Indonesian and Regional Languages Datasets},
  author={Cahyawijaya, Samuel and others},
  booktitle={Findings of ACL},
  year={2023}
}

8.2 Translation Model Citations

@article{translategemma2024,
  title={TranslateGemma: High-Quality Machine Translation with Gemma},
  author={Gemma Team},
  year={2024}
}

@article{sea_lion_v4,
  title={SEA-LION-v4: Southeast Asian Language Model},
  author={AI Singapore},
  year={2024}
}

8.3 Regional MTEB Citations

@article{pham2025vnmteb,
  title={VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  author={Pham, Loc and Luu, Tung and Vo, Thu and others},
  journal={arXiv preprint arXiv:2507.21500},
  year={2025}
}

@article{baysan2025trmteb,
  title={TR-MTEB: A Comprehensive Benchmark for Turkish Text Embeddings},
  author={Baysan, Muhammed Selim and others},
  booktitle={Findings of EMNLP},
  year={2025}
}

@article{ponwitayarat2025seabed,
  title={SEA-BED: Southeast Asia Embedding Benchmark},
  author={Ponwitayarat, Wuttikorn and others},
  journal={arXiv preprint arXiv:2508.12243},
  year={2025}
}

9. Submission Timeline and Milestones

9.1 Pre-Submission Checklist

  • All datasets uploaded to HuggingFace with complete README.md
  • YAML metadata validated
  • License tracking table completed
  • Responsible NLP checklist filled out
  • Code repository prepared and documented
  • Benchmark results replicated with baseline models
  • Paper formatted with ACL template
  • Supplementary materials (appendices, license table) prepared

9.2 MTEB Integration Timeline

Phase Duration Deliverable
Preparation 2-4 weeks Datasets uploaded, README.md complete
PR Submission 1 week MTEB PR submitted
Review 2-4 weeks MTEB maintainer review
Revision 1-2 weeks Address feedback
Integration 1 week Merged to main branch

9.3 ARR Submission Timeline

Phase Duration Deadline
ARR Submission - February 15, 2025 (example)
Reviews Available ~2 months April 15, 2025
Rebuttal Period 1 week After reviews
Final Decision ~1 month May 2025
Conference Selection - ACL/EMNLP/NAACL

10. Common Pitfalls to Avoid

10.1 Licensing Mistakes

Mistake Consequence Solution
License not specified Desk rejection, unusable dataset Always include license in YAML
Wrong license type Legal issues downstream Track source licenses properly
Ignoring ShareAlike License violation CC-BY-SA → CC-BY-SA for derivatives
Missing attribution Plagiarism concerns Cite all sources explicitly

10.2 Documentation Mistakes

Mistake Consequence Solution
Incomplete README Dataset not discoverable Use HuggingFace template
No contact info Cannot report issues Include maintainer email
Unclear data format Integration problems Provide examples and schema
Missing splits info Wrong evaluation Document train/dev/test sizes

10.3 Submission Mistakes

Mistake Consequence Solution
Wrong paper size Formatting rejection Use A4 only (21cm × 29.7cm)
Incomplete checklist Desk rejection (since Dec 2024) Fill all items carefully
Missing supplementary Reviewer concerns Upload code/data even if optional
Anonymity violation Historical issue only No anonymity period since Feb 2024

11. Implementation Checklist

11.1 Dataset Preparation

# For each dataset:
- [ ] Verify source license
- [ ] Apply translation pipeline
- [ ] Run validation (3-stage)
- [ ] Format to MTEB structure
- [ ] Create README.md with YAML
- [ ] Upload to HuggingFace
- [ ] Test with mteb.run_evaluation()

11.2 Paper Preparation

# Paper components:
- [ ] Abstract (200-250 words)
- [ ] Introduction (motivation, contributions)
- [ ] Related Work (MTEB, regional MTEBs, Indonesian NLP)
- [ ] Methodology (3-pronged approach)
- [ ] Benchmark overview (tables, figures)
- [ ] Experiments (implementation, results)
- [ ] Analysis and insights
- [ ] Conclusion and limitations
- [ ] Ethics statement
- [ ] Broader impact
- [ ] Acknowledgments
- [ ] References (BibTeX)
- [ ] Appendix (license table, examples)

11.3 Submission Package

# Final submission:
- [ ] Main paper PDF (A4 format)
- [ ] Supplementary materials (.tgz, max 200MB)
  - [ ] Evaluation code
  - [ ] Dataset statistics
  - [ ] Additional examples
  - [ ] License compatibility table
- [ ] Responsible NLP checklist (submitted via ARR portal)
- [ ] Author information (if no anonymity)
- [ ] ORCID IDs (recommended)

12. References and Resources

12.1 Official Resources

Resource URL
ACL Rolling Review http://aclrollingreview.org
ARR Author Guidelines http://aclrollingreview.org/authors
Responsible NLP Checklist http://aclrollingreview.org/responsibleNLPresearch
MTEB GitHub https://github.com/embeddings-benchmark/mteb
MTEB Leaderboard https://huggingface.co/spaces/mteb/leaderboard
HuggingFace Datasets Docs https://huggingface.co/docs/datasets

12.2 Dataset Documentation Standards

Standard Citation
Datasheets for Datasets Gebru et al., 2018, arXiv:1803.09010
Model Cards Mitchell et al., 2019
GEM Data Cards Gehrmann et al., 2021
Data Cards for Dataset Documentation Pushkarna et al., 2022
Topic Resource
CC for AI Training https://creativecommons.org/using-cc-licensed-works-for-ai-training
Dataset Licensing Audit Longpré et al., 2024, Nature Machine Intelligence
Indonesia PDP Law Law No. 27 of 2022 on Personal Data Protection
Derivative Works US Copyright Office, Circular 14

Summary

Publishing Indonesia-MTEB as an ACL dataset paper requires:

  1. ARR Compliance: Paper format, Responsible NLP checklist, proper anonymization
  2. Dataset Documentation: Complete HuggingFace cards with YAML metadata
  3. Licensing Clarity: Track all source licenses, specify downstream terms
  4. MTEB Integration: Follow contribution guidelines, ensure compatibility
  5. Ethics and Impact: Ethics statement, broader impact discussion, bias documentation
  6. Attribution: Cite all source datasets, models, and related work

By following these standards, Indonesia-MTEB can achieve: - Acceptance at top-tier NLP venues - Integration with the MTEB leaderboard - Community adoption and reproducibility - Legal clarity for downstream users