Document 08: ACL Dataset Paper Standards¶

Overview¶

This document outlines the standards, requirements, and best practices for publishing Indonesia-MTEB as a dataset paper in ACL venues (ACL, EMNLP, NAACL, COLING) via the ACL Rolling Review (ARR) system. It covers submission formats, dataset documentation standards, licensing requirements, responsible NLP considerations, and integration with the MTEB ecosystem.

1. ACL Rolling Review (ARR) Submission Requirements¶

1.1 Paper Format Specifications¶

Requirement	Specification
Paper Size	A4 (21 cm × 29.7 cm) - strictly enforced
Templates	ACL 2025 style files (LaTeX/Word)
Page Limits	Long papers: 8 pages + unlimited references; Short papers: 4 pages + unlimited references
Anonymity	No anonymity period required (since Feb 2024)
File Format	PDF only
Supplementary Materials	Optional .tgz or .zip archive (max 200MB)

1.2 ARR Submission Checklist¶

Paper formatted with ACL template
Responsible NLP Research checklist completed (required since Dec 2024)
All cited artifacts properly attributed
License/terms of use discussed for all datasets
Supplementary materials prepared (code, data, appendices)
At least one author registered as reviewer for the same cycle (required since April 2024)

1.3 Responsible NLP Checklist (Required)¶

The ARR Responsible NLP Research checklist addresses: - Research Ethics: Human subjects, data collection ethics - Societal Impact: Potential harms, mitigation strategies - Reproducibility: Code availability, data access, experimental details - Attribution: Proper citation of all artifacts used

Critical Note: Since December 2024, ARR enforces desk rejections for incorrect, incomplete, or misleading Responsible NLP checklist filings.

2. Dataset Paper Structure¶

Based on analysis of VN-MTEB, TR-MTEB, C-MTEB, and SEA-BED papers, a regional MTEB dataset paper should follow this structure:

2.1 Recommended Sections¶

1. Abstract
2. Introduction
   - Motivation (language coverage gap)
   - Problem statement
   - Contributions
3. Related Work
   - MTEB and existing benchmarks
   - Regional MTEBs (C-MTEB, VN-MTEB, TR-MTEB, etc.)
   - Indonesian NLP resources (IndoNLU, NusaCrowd, Indo4B)
4. Methodology
   - Dataset collection strategy
   - Translation pipeline (if applicable)
   - Quality validation procedures
5. Indonesia-MTEB Benchmark
   - Task categories covered
   - Dataset overview table
   - Statistics and kept ratios
6. Experiments
   - Implementation details (models, hyperparameters)
   - Benchmark results
   - Analysis and insights
7. Conclusion
8. Limitations
9. Ethics Statement
A. Broader Impact Statement
B. Dataset Licenses
C. Supplementary Materials

2.2 Required Tables and Figures¶

Element	Description
Dataset Overview Table	Task, # datasets, samples (before/after), kept ratio
Benchmark Results Table	Model performance across tasks
Translation Pipeline Figure	Visual representation of methodology
Kept Ratio Boxplot	Quality distribution by task type
Model Performance vs. Size	Correlation analysis

2.3 Abstract Template¶

We introduce [Indonesia-MTEB], a comprehensive text embedding benchmark
for Indonesian covering [X] datasets across [Y] tasks. Despite Indonesian
being spoken by [270M+] people, existing embedding benchmarks lack
comprehensive Indonesian coverage. Our benchmark addresses this gap through
[3-pronged strategy: aggregation, translation, AI generation]. We evaluate
[Z] embedding models, revealing insights about [key findings]. Datasets are
available at: [HuggingFace URL]

3. HuggingFace Dataset Card Standards¶

3.1 YAML Metadata Format¶

Each dataset must include a README.md with YAML metadata:

---
language:
  - id
  - en
  - jv
  - su
  - ms
license: cc-by-4.0
task_categories:
  - text-classification
  - clustering
  - pair-classification
  - reranking
  - retrieval
  - semantic-similarity
  - summarization
  - instruction-following
task_ids:
  - BitextMining
  - Classification
  - Clustering
  - PairClassification
  - Reranking
  - Retrieval
  - STS
  - Summarization
multilinguality:
  - translation
  - multilingual
size_categories:
  - 10K<n<100K
  - 100K<n<1M
  - 1M<n<10M
source_datasets:
  - original
  - extended
pretty_name: Indonesia MTEB
dataset_info:
  config_names:
    - default
  features:
    - name: text
      dtype: string
    - name: label
      dtype: string
  splits:
    - name: test
      num_bytes: X
      num_examples: Y
  download_size: X
  dataset_size: Y
---

3.2 Dataset Card Sections¶

# Dataset Name

## Dataset Description
- Brief overview (2-3 sentences)
- Homepage/Project page URL
- Repository URL
- Paper URL

## Languages
- Language codes (ISO 639-1)
- Dialect/variety information

## Dataset Structure
### Data Instances
Example format for each task type

### Data Fields
- Field descriptions and types
- Label descriptions

### Data Splits
Train/validation/test split sizes

## Dataset Creation
### Curation Rationale
Why this dataset was created

### Source Data
Original data sources and licensing

### Annotations
Annotation process, annotator information

### Personal and Sensitive Information
Any PII considerations

## Considerations for Using the Data
### Social Impact of Dataset
Potential societal implications

### Discussion of Biases
Known biases in the data

### Other Known Limitations
Technical or quality limitations

## Additional Information
### Dataset Curators
Who created/maintained the dataset

### Licensing Information
License type and restrictions

### Citation Information
BibTeX citation

### Contributions
How to contribute or report issues

3.3 License Metadata Standards¶

For Indonesia-MTEB datasets:

License Type	When to Use	Notes
CC-BY-4.0	Default for new translations	Recommended by OpenAIRE for academic data
CC-BY-SA-4.0	Derived from CC-BY-SA sources	Share-alike requirement
CC0	Public domain dedications	For fully permissive use
ODC-BY-1.0	Open data commons	Alternative to CC licenses
Custom	Original dataset licenses	Must specify terms clearly

4. Licensing and Attribution Requirements¶

4.1 Indonesian Dataset Licenses¶

Dataset	License	Citation Requirement
Indo4B	Custom (check IndoNLP)	Wilie et al. (2020)
IndoNLU	MIT License	Wilie et al. (2020)
NusaCrowd	varies by dataset	Cahyawijaya et al. (2023)
SEACrowd	varies by dataset	SEACrowd Consortium

4.2 Translation Derivative Works¶

Key Legal Considerations:

Machine Translation Creates Derivative Works: Under copyright law, translations are derivative works of the original
CC License Compatibility: When translating CC-BY-SA content, translations must also be CC-BY-SA (ShareAlike requirement)
Source License Propagation: Track all source licenses as they may impose restrictions on derivative works

Best Practices: - Maintain a license tracking table for all source datasets - Clearly document any license incompatibilities - Use CC-BY-4.0 (no ShareAlike) when possible to avoid downstream restrictions - Provide attribution in README.md and paper acknowledgments

4.3 Attribution Format¶

@dataset{indonesia_mteb_2025,
  title        = {Indonesia-MTEB: Indonesian Massive Text Embedding Benchmark},
  author       = {Author Names},
  year         = {2025},
  publisher    = {Hugging Face},
  version      = {1.0.0},
  url          = {https://huggingface.co/datasets/indonesia-mteb/...},
  license      = {CC-BY-4.0}
}

Required Attribution for Sources: - Original MTEB datasets (Muennighoff et al., 2023) - Indo4B/IndoNLU (Wilie et al., 2020) - NusaCrowd (Cahyawijaya et al., 2023) - Translation models used (e.g., TranslateGemma-12B)

4.4 Indonesia Personal Data Protection Law (PDP Law)¶

Law No. 27/2022 Compliance: - Came into force October 17, 2022 - Full compliance required since October 17, 2024 - Data breach reporting within 72 hours - Explicit consent requirements for personal data processing

Implications for Indonesia-MTEB: - Review all datasets for potential PII - Anonymize or exclude personal data - Document privacy measures in paper - Consider IRB review if using human-subject data

5. MTEB Integration Requirements¶

5.1 Dataset Submission to MTEB¶

Contribution Points System (as of 2026): - First dataset for a language × task: 4 bonus points - Additional datasets: 1 point per language × task combination - New languages (≥12): Extra recognition

PR Requirements: 1. Follow MTEB contribution guidelines 2. Include dataset loader in mteb/ directory 3. Add metadata to leaderboard 4. Ensure compatibility with MTEB evaluation scripts

5.2 Dataset Format Requirements¶

# Example: MTEB-compatible dataset class
from mteb.abstasks import AbsTask
from mteb.abstasks.TaskMetadata import TaskMetadata

class IndonesianDataset(AbsTask):
    metadata = TaskMetadata(
        name="DatasetName",
        description="Description of dataset",
        reference="https://arxiv.org/xxxx.xxxxx",
        dataset={
            "path": "indonesia-mteb/dataset-name",
            "revision": "revision-hash"
        },
        type="Classification",  # or Retrieval, STS, etc.
        category="s2s",  # or s2p, p2p
        eval_splits=["test"],
        eval_langs=["id-ID"],
        main_score="accuracy",
        date=("2024-01-01", "2024-12-31"),
        form=["written"],
        domains=["Social", "Reviews"],
        task_subtypes=["Sentiment classification"],
        license="CC-BY-4.0",
        annotations_creators="human-verified",
        dialect=[],
        sample_creation="created",
        bibtex_citation="""@article{...}"""
    )

5.3 Leaderboard Integration¶

Steps to Add Indonesia-MTEB to Leaderboard:

Create HuggingFace organization: indonesia-mteb
Upload all datasets with proper YAML metadata
Submit PR to embeddings-benchmark/mteb GitHub
Update leaderboard configuration
Run evaluation on reference models

Benchmark Configuration:

# mteb/leaderboard/benchmark_configs/indonesia_mteb.yaml
name: Indonesia-MTEB
description: Indonesian Massive Text Embedding Benchmark
version: 1.0.0
languages: [id]
tasks:
  - Classification
  - Clustering
  - PairClassification
  - Reranking
  - Retrieval
  - STS
datasets:
  - dataset1
  - dataset2
  # ...

6. Responsible NLP Checklist Responses¶

6.1 Required Checklist Items¶

Category	Checklist Item	Indonesia-MTEB Response
A. Ethics	A1. Human subjects	Existing datasets only; no new human data collection
	A2. Personal data	PII removed; PDP Law compliance documented
	A3. Consent	Public domain/explicitly licensed datasets
B. Attribution	B1. Artifact creators cited	Yes (MTEB, IndoNLU, NusaCrowd, etc.)
	B2. License terms discussed	Yes, see Appendix B
C. Impact	C1. Societal impact	Discussed in Broader Impact
	C2. Risks mitigated	Data filtering, bias documentation
D. Reproducibility	D1. Code availability	GitHub repo
	D2. Data availability	HuggingFace datasets
	D3. Experimental details	Full methodology in paper

6.2 Ethics Statement Template¶

Ethics Statement

This work aggregates and translates existing publicly available datasets.
No new human subjects research was conducted. All source datasets were
collected with appropriate consent or are public domain. Personal
identifying information has been removed where present. The work complies
with Indonesia's Personal Data Protection Law (Law No. 27/2022).

Potential risks include amplification of biases present in source data.
We document known biases and limitations in Section X.

6.3 Broader Impact Statement Template¶

Broader Impact

Positive Impact:
- Enables better Indonesian NLP applications (search, recommendation)
- Supports Indonesian language preservation in AI
- Provides evaluation resources for Indonesian model development

Potential Negative Impact:
- May encode biases from source datasets
- Translation may introduce cultural artifacts
- Could be used for surveillance or content moderation

Mitigation:
- Bias documentation for each dataset
- Open licensing enables community audit
- Responsible use guidelines provided

7. Dataset Documentation Standards¶

7.1 Datasheets for Datasets (Gebru et al., 2018)¶

Recommended sections for Indonesia-MTEB datasheets:

Motivation: Why was the dataset created?
Composition: What are the instances? What fields?
Collection Process: How was data collected?
Preprocessing: What cleaning/filtering was applied?
Uses: What is the intended use? Unintended uses?
Distribution: How is the dataset distributed?
Maintenance: Will the dataset be updated?

7.2 GEM Data Card Standards¶

For natural language generation tasks (Summarization, Instruction Following):

data_card:
  motivation:
    rationale: "Fill in rationale"
    primary_use: "Describe primary use"
    other_uses: ["List", "other", "uses"]
  composition:
    data_format: "Format description"
    data_fields: [
      {name: "field1", description: "...", type: "..."}
    ]
  collection:
    collection_process: "Description"
    source_datasets: ["List", "sources"]
  preprocessing:
    cleaning_steps: ["Step1", "Step2"]
    filtering_criteria: "Criteria description"
  annotation:
    annotation_process: "Description"
    annotator_demographics: "If known"
  uses:
    intended_uses: ["Use1", "Use2"]
    out_of_scope_uses: ["Misuse1", "Misuse2"]
  distribution:
    license: "CC-BY-4.0"
  maintenance:
    update_frequency: "As needed"
    contact: "maintainer@email.com"

7.3 Example Dataset Card (Classification Task)¶

---
language:
  - id
license: cc-by-4.0
task_categories:
  - text-classification
pretty_name: Indonesian Sentiment Classification
---

# Indonesian Sentiment Classification

## Dataset Description
This dataset contains Indonesian text samples labeled for sentiment analysis.
Translated from [Original Dataset] using [Translation Model].

## Languages
- Indonesian (id-ID)

## Dataset Structure

### Data Instances

{ "text": "Produk ini sangat bagus dan pengiriman cepat.", "label": "positive" }

### Data Fields
- `text`: Indonesian text string
- `label`: Sentiment label (positive/negative/neutral)

### Data Splits
- Test: 3,424 samples

## Dataset Creation

### Source Data
Original: [Source dataset name and citation]
License: [Original license]

### Translation
- Model: TranslateGemma-12B
- Validation: 3-stage pipeline (language detection, semantic similarity, LLM-as-judge)
- Kept Ratio: 72.5%

### Quality Assurance
- Semantic similarity threshold: ≥0.75
- LLM-as-judge calibration: 88.4% F1 target

## Considerations

### Known Biases
- E-commerce domain bias
- Formal Indonesian bias (less colloquial)

### Limitations
- Machine translation may lose cultural nuances
- Limited to specific domains

## Citation
```bibtex
@dataset{indo_sentiment_2025,
  author = {Indonesia-MTEB Team},
  title = {Indonesian Sentiment Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/indonesia-mteb/sentiment}
}

---

## 8. Citation and Bibliography Standards

### 8.1 Required Citations

**Must Cite**:
```bibtex
@inproceedings{muennighoff2023mteb,
  title={MTEB: Massive Text Embedding Benchmark},
  author={Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils},
  booktitle={EACL},
  year={2023}
}

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Wilie, Bryan and Vincentio, Karissa and Winata, Genta Indra and Cahyawijaya, Samuel},
  booktitle={AACL},
  year={2020}
}

@inproceedings{cahyawijaya2023nusacrowd,
  title={NusaCrowd: A Collaborative Project to Collect Indonesian and Regional Languages Datasets},
  author={Cahyawijaya, Samuel and others},
  booktitle={Findings of ACL},
  year={2023}
}

8.2 Translation Model Citations¶

@article{translategemma2024,
  title={TranslateGemma: High-Quality Machine Translation with Gemma},
  author={Gemma Team},
  year={2024}
}

@article{sea_lion_v4,
  title={SEA-LION-v4: Southeast Asian Language Model},
  author={AI Singapore},
  year={2024}
}

8.3 Regional MTEB Citations¶

@article{pham2025vnmteb,
  title={VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  author={Pham, Loc and Luu, Tung and Vo, Thu and others},
  journal={arXiv preprint arXiv:2507.21500},
  year={2025}
}

@article{baysan2025trmteb,
  title={TR-MTEB: A Comprehensive Benchmark for Turkish Text Embeddings},
  author={Baysan, Muhammed Selim and others},
  booktitle={Findings of EMNLP},
  year={2025}
}

@article{ponwitayarat2025seabed,
  title={SEA-BED: Southeast Asia Embedding Benchmark},
  author={Ponwitayarat, Wuttikorn and others},
  journal={arXiv preprint arXiv:2508.12243},
  year={2025}
}

9. Submission Timeline and Milestones¶

9.1 Pre-Submission Checklist¶

All datasets uploaded to HuggingFace with complete README.md
YAML metadata validated
License tracking table completed
Responsible NLP checklist filled out
Code repository prepared and documented
Benchmark results replicated with baseline models
Paper formatted with ACL template
Supplementary materials (appendices, license table) prepared

9.2 MTEB Integration Timeline¶

Phase	Duration	Deliverable
Preparation	2-4 weeks	Datasets uploaded, README.md complete
PR Submission	1 week	MTEB PR submitted
Review	2-4 weeks	MTEB maintainer review
Revision	1-2 weeks	Address feedback
Integration	1 week	Merged to main branch

9.3 ARR Submission Timeline¶

Phase	Duration	Deadline
ARR Submission	-	February 15, 2025 (example)
Reviews Available	~2 months	April 15, 2025
Rebuttal Period	1 week	After reviews
Final Decision	~1 month	May 2025
Conference Selection	-	ACL/EMNLP/NAACL

10. Common Pitfalls to Avoid¶

10.1 Licensing Mistakes¶

Mistake	Consequence	Solution
License not specified	Desk rejection, unusable dataset	Always include license in YAML
Wrong license type	Legal issues downstream	Track source licenses properly
Ignoring ShareAlike	License violation	CC-BY-SA → CC-BY-SA for derivatives
Missing attribution	Plagiarism concerns	Cite all sources explicitly

10.2 Documentation Mistakes¶

Mistake	Consequence	Solution
Incomplete README	Dataset not discoverable	Use HuggingFace template
No contact info	Cannot report issues	Include maintainer email
Unclear data format	Integration problems	Provide examples and schema
Missing splits info	Wrong evaluation	Document train/dev/test sizes

10.3 Submission Mistakes¶

Mistake	Consequence	Solution
Wrong paper size	Formatting rejection	Use A4 only (21cm × 29.7cm)
Incomplete checklist	Desk rejection (since Dec 2024)	Fill all items carefully
Missing supplementary	Reviewer concerns	Upload code/data even if optional
Anonymity violation	Historical issue only	No anonymity period since Feb 2024

11. Implementation Checklist¶

11.1 Dataset Preparation¶

# For each dataset:
- [ ] Verify source license
- [ ] Apply translation pipeline
- [ ] Run validation (3-stage)
- [ ] Format to MTEB structure
- [ ] Create README.md with YAML
- [ ] Upload to HuggingFace
- [ ] Test with mteb.run_evaluation()

11.2 Paper Preparation¶

# Paper components:
- [ ] Abstract (200-250 words)
- [ ] Introduction (motivation, contributions)
- [ ] Related Work (MTEB, regional MTEBs, Indonesian NLP)
- [ ] Methodology (3-pronged approach)
- [ ] Benchmark overview (tables, figures)
- [ ] Experiments (implementation, results)
- [ ] Analysis and insights
- [ ] Conclusion and limitations
- [ ] Ethics statement
- [ ] Broader impact
- [ ] Acknowledgments
- [ ] References (BibTeX)
- [ ] Appendix (license table, examples)

11.3 Submission Package¶

# Final submission:
- [ ] Main paper PDF (A4 format)
- [ ] Supplementary materials (.tgz, max 200MB)
  - [ ] Evaluation code
  - [ ] Dataset statistics
  - [ ] Additional examples
  - [ ] License compatibility table
- [ ] Responsible NLP checklist (submitted via ARR portal)
- [ ] Author information (if no anonymity)
- [ ] ORCID IDs (recommended)

12. References and Resources¶

12.1 Official Resources¶

Resource	URL
ACL Rolling Review	http://aclrollingreview.org
ARR Author Guidelines	http://aclrollingreview.org/authors
Responsible NLP Checklist	http://aclrollingreview.org/responsibleNLPresearch
MTEB GitHub	https://github.com/embeddings-benchmark/mteb
MTEB Leaderboard	https://huggingface.co/spaces/mteb/leaderboard
HuggingFace Datasets Docs	https://huggingface.co/docs/datasets

12.2 Dataset Documentation Standards¶

Standard	Citation
Datasheets for Datasets	Gebru et al., 2018, arXiv:1803.09010
Model Cards	Mitchell et al., 2019
GEM Data Cards	Gehrmann et al., 2021
Data Cards for Dataset Documentation	Pushkarna et al., 2022

12.3 Legal Resources¶

Topic	Resource
CC for AI Training	https://creativecommons.org/using-cc-licensed-works-for-ai-training
Dataset Licensing Audit	Longpré et al., 2024, Nature Machine Intelligence
Indonesia PDP Law	Law No. 27 of 2022 on Personal Data Protection
Derivative Works	US Copyright Office, Circular 14

Summary¶

Publishing Indonesia-MTEB as an ACL dataset paper requires:

ARR Compliance: Paper format, Responsible NLP checklist, proper anonymization
Dataset Documentation: Complete HuggingFace cards with YAML metadata
Licensing Clarity: Track all source licenses, specify downstream terms
MTEB Integration: Follow contribution guidelines, ensure compatibility
Ethics and Impact: Ethics statement, broader impact discussion, bias documentation
Attribution: Cite all source datasets, models, and related work

By following these standards, Indonesia-MTEB can achieve: - Acceptance at top-tier NLP venues - Integration with the MTEB leaderboard - Community adoption and reproducibility - Legal clarity for downstream users