Document 08: ACL Dataset Paper Standards¶
Overview¶
This document outlines the standards, requirements, and best practices for publishing Indonesia-MTEB as a dataset paper in ACL venues (ACL, EMNLP, NAACL, COLING) via the ACL Rolling Review (ARR) system. It covers submission formats, dataset documentation standards, licensing requirements, responsible NLP considerations, and integration with the MTEB ecosystem.
1. ACL Rolling Review (ARR) Submission Requirements¶
1.1 Paper Format Specifications¶
| Requirement | Specification |
|---|---|
| Paper Size | A4 (21 cm × 29.7 cm) - strictly enforced |
| Templates | ACL 2025 style files (LaTeX/Word) |
| Page Limits | Long papers: 8 pages + unlimited references; Short papers: 4 pages + unlimited references |
| Anonymity | No anonymity period required (since Feb 2024) |
| File Format | PDF only |
| Supplementary Materials | Optional .tgz or .zip archive (max 200MB) |
1.2 ARR Submission Checklist¶
- Paper formatted with ACL template
- Responsible NLP Research checklist completed (required since Dec 2024)
- All cited artifacts properly attributed
- License/terms of use discussed for all datasets
- Supplementary materials prepared (code, data, appendices)
- At least one author registered as reviewer for the same cycle (required since April 2024)
1.3 Responsible NLP Checklist (Required)¶
The ARR Responsible NLP Research checklist addresses: - Research Ethics: Human subjects, data collection ethics - Societal Impact: Potential harms, mitigation strategies - Reproducibility: Code availability, data access, experimental details - Attribution: Proper citation of all artifacts used
Critical Note: Since December 2024, ARR enforces desk rejections for incorrect, incomplete, or misleading Responsible NLP checklist filings.
2. Dataset Paper Structure¶
Based on analysis of VN-MTEB, TR-MTEB, C-MTEB, and SEA-BED papers, a regional MTEB dataset paper should follow this structure:
2.1 Recommended Sections¶
1. Abstract
2. Introduction
- Motivation (language coverage gap)
- Problem statement
- Contributions
3. Related Work
- MTEB and existing benchmarks
- Regional MTEBs (C-MTEB, VN-MTEB, TR-MTEB, etc.)
- Indonesian NLP resources (IndoNLU, NusaCrowd, Indo4B)
4. Methodology
- Dataset collection strategy
- Translation pipeline (if applicable)
- Quality validation procedures
5. Indonesia-MTEB Benchmark
- Task categories covered
- Dataset overview table
- Statistics and kept ratios
6. Experiments
- Implementation details (models, hyperparameters)
- Benchmark results
- Analysis and insights
7. Conclusion
8. Limitations
9. Ethics Statement
A. Broader Impact Statement
B. Dataset Licenses
C. Supplementary Materials
2.2 Required Tables and Figures¶
| Element | Description |
|---|---|
| Dataset Overview Table | Task, # datasets, samples (before/after), kept ratio |
| Benchmark Results Table | Model performance across tasks |
| Translation Pipeline Figure | Visual representation of methodology |
| Kept Ratio Boxplot | Quality distribution by task type |
| Model Performance vs. Size | Correlation analysis |
2.3 Abstract Template¶
We introduce [Indonesia-MTEB], a comprehensive text embedding benchmark
for Indonesian covering [X] datasets across [Y] tasks. Despite Indonesian
being spoken by [270M+] people, existing embedding benchmarks lack
comprehensive Indonesian coverage. Our benchmark addresses this gap through
[3-pronged strategy: aggregation, translation, AI generation]. We evaluate
[Z] embedding models, revealing insights about [key findings]. Datasets are
available at: [HuggingFace URL]
3. HuggingFace Dataset Card Standards¶
3.1 YAML Metadata Format¶
Each dataset must include a README.md with YAML metadata:
---
language:
- id
- en
- jv
- su
- ms
license: cc-by-4.0
task_categories:
- text-classification
- clustering
- pair-classification
- reranking
- retrieval
- semantic-similarity
- summarization
- instruction-following
task_ids:
- BitextMining
- Classification
- Clustering
- PairClassification
- Reranking
- Retrieval
- STS
- Summarization
multilinguality:
- translation
- multilingual
size_categories:
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
source_datasets:
- original
- extended
pretty_name: Indonesia MTEB
dataset_info:
config_names:
- default
features:
- name: text
dtype: string
- name: label
dtype: string
splits:
- name: test
num_bytes: X
num_examples: Y
download_size: X
dataset_size: Y
---
3.2 Dataset Card Sections¶
# Dataset Name
## Dataset Description
- Brief overview (2-3 sentences)
- Homepage/Project page URL
- Repository URL
- Paper URL
## Languages
- Language codes (ISO 639-1)
- Dialect/variety information
## Dataset Structure
### Data Instances
Example format for each task type
### Data Fields
- Field descriptions and types
- Label descriptions
### Data Splits
Train/validation/test split sizes
## Dataset Creation
### Curation Rationale
Why this dataset was created
### Source Data
Original data sources and licensing
### Annotations
Annotation process, annotator information
### Personal and Sensitive Information
Any PII considerations
## Considerations for Using the Data
### Social Impact of Dataset
Potential societal implications
### Discussion of Biases
Known biases in the data
### Other Known Limitations
Technical or quality limitations
## Additional Information
### Dataset Curators
Who created/maintained the dataset
### Licensing Information
License type and restrictions
### Citation Information
BibTeX citation
### Contributions
How to contribute or report issues
3.3 License Metadata Standards¶
For Indonesia-MTEB datasets:
| License Type | When to Use | Notes |
|---|---|---|
| CC-BY-4.0 | Default for new translations | Recommended by OpenAIRE for academic data |
| CC-BY-SA-4.0 | Derived from CC-BY-SA sources | Share-alike requirement |
| CC0 | Public domain dedications | For fully permissive use |
| ODC-BY-1.0 | Open data commons | Alternative to CC licenses |
| Custom | Original dataset licenses | Must specify terms clearly |
4. Licensing and Attribution Requirements¶
4.1 Indonesian Dataset Licenses¶
| Dataset | License | Citation Requirement |
|---|---|---|
| Indo4B | Custom (check IndoNLP) | Wilie et al. (2020) |
| IndoNLU | MIT License | Wilie et al. (2020) |
| NusaCrowd | varies by dataset | Cahyawijaya et al. (2023) |
| SEACrowd | varies by dataset | SEACrowd Consortium |
4.2 Translation Derivative Works¶
Key Legal Considerations:
- Machine Translation Creates Derivative Works: Under copyright law, translations are derivative works of the original
- CC License Compatibility: When translating CC-BY-SA content, translations must also be CC-BY-SA (ShareAlike requirement)
- Source License Propagation: Track all source licenses as they may impose restrictions on derivative works
Best Practices: - Maintain a license tracking table for all source datasets - Clearly document any license incompatibilities - Use CC-BY-4.0 (no ShareAlike) when possible to avoid downstream restrictions - Provide attribution in README.md and paper acknowledgments
4.3 Attribution Format¶
@dataset{indonesia_mteb_2025,
title = {Indonesia-MTEB: Indonesian Massive Text Embedding Benchmark},
author = {Author Names},
year = {2025},
publisher = {Hugging Face},
version = {1.0.0},
url = {https://huggingface.co/datasets/indonesia-mteb/...},
license = {CC-BY-4.0}
}
Required Attribution for Sources: - Original MTEB datasets (Muennighoff et al., 2023) - Indo4B/IndoNLU (Wilie et al., 2020) - NusaCrowd (Cahyawijaya et al., 2023) - Translation models used (e.g., TranslateGemma-12B)
4.4 Indonesia Personal Data Protection Law (PDP Law)¶
Law No. 27/2022 Compliance: - Came into force October 17, 2022 - Full compliance required since October 17, 2024 - Data breach reporting within 72 hours - Explicit consent requirements for personal data processing
Implications for Indonesia-MTEB: - Review all datasets for potential PII - Anonymize or exclude personal data - Document privacy measures in paper - Consider IRB review if using human-subject data
5. MTEB Integration Requirements¶
5.1 Dataset Submission to MTEB¶
Contribution Points System (as of 2026): - First dataset for a language × task: 4 bonus points - Additional datasets: 1 point per language × task combination - New languages (≥12): Extra recognition
PR Requirements:
1. Follow MTEB contribution guidelines
2. Include dataset loader in mteb/ directory
3. Add metadata to leaderboard
4. Ensure compatibility with MTEB evaluation scripts
5.2 Dataset Format Requirements¶
# Example: MTEB-compatible dataset class
from mteb.abstasks import AbsTask
from mteb.abstasks.TaskMetadata import TaskMetadata
class IndonesianDataset(AbsTask):
metadata = TaskMetadata(
name="DatasetName",
description="Description of dataset",
reference="https://arxiv.org/xxxx.xxxxx",
dataset={
"path": "indonesia-mteb/dataset-name",
"revision": "revision-hash"
},
type="Classification", # or Retrieval, STS, etc.
category="s2s", # or s2p, p2p
eval_splits=["test"],
eval_langs=["id-ID"],
main_score="accuracy",
date=("2024-01-01", "2024-12-31"),
form=["written"],
domains=["Social", "Reviews"],
task_subtypes=["Sentiment classification"],
license="CC-BY-4.0",
annotations_creators="human-verified",
dialect=[],
sample_creation="created",
bibtex_citation="""@article{...}"""
)
5.3 Leaderboard Integration¶
Steps to Add Indonesia-MTEB to Leaderboard:
- Create HuggingFace organization:
indonesia-mteb - Upload all datasets with proper YAML metadata
- Submit PR to
embeddings-benchmark/mtebGitHub - Update leaderboard configuration
- Run evaluation on reference models
Benchmark Configuration:
# mteb/leaderboard/benchmark_configs/indonesia_mteb.yaml
name: Indonesia-MTEB
description: Indonesian Massive Text Embedding Benchmark
version: 1.0.0
languages: [id]
tasks:
- Classification
- Clustering
- PairClassification
- Reranking
- Retrieval
- STS
datasets:
- dataset1
- dataset2
# ...
6. Responsible NLP Checklist Responses¶
6.1 Required Checklist Items¶
| Category | Checklist Item | Indonesia-MTEB Response |
|---|---|---|
| A. Ethics | A1. Human subjects | Existing datasets only; no new human data collection |
| A2. Personal data | PII removed; PDP Law compliance documented | |
| A3. Consent | Public domain/explicitly licensed datasets | |
| B. Attribution | B1. Artifact creators cited | Yes (MTEB, IndoNLU, NusaCrowd, etc.) |
| B2. License terms discussed | Yes, see Appendix B | |
| C. Impact | C1. Societal impact | Discussed in Broader Impact |
| C2. Risks mitigated | Data filtering, bias documentation | |
| D. Reproducibility | D1. Code availability | GitHub repo |
| D2. Data availability | HuggingFace datasets | |
| D3. Experimental details | Full methodology in paper |
6.2 Ethics Statement Template¶
Ethics Statement
This work aggregates and translates existing publicly available datasets.
No new human subjects research was conducted. All source datasets were
collected with appropriate consent or are public domain. Personal
identifying information has been removed where present. The work complies
with Indonesia's Personal Data Protection Law (Law No. 27/2022).
Potential risks include amplification of biases present in source data.
We document known biases and limitations in Section X.
6.3 Broader Impact Statement Template¶
Broader Impact
Positive Impact:
- Enables better Indonesian NLP applications (search, recommendation)
- Supports Indonesian language preservation in AI
- Provides evaluation resources for Indonesian model development
Potential Negative Impact:
- May encode biases from source datasets
- Translation may introduce cultural artifacts
- Could be used for surveillance or content moderation
Mitigation:
- Bias documentation for each dataset
- Open licensing enables community audit
- Responsible use guidelines provided
7. Dataset Documentation Standards¶
7.1 Datasheets for Datasets (Gebru et al., 2018)¶
Recommended sections for Indonesia-MTEB datasheets:
- Motivation: Why was the dataset created?
- Composition: What are the instances? What fields?
- Collection Process: How was data collected?
- Preprocessing: What cleaning/filtering was applied?
- Uses: What is the intended use? Unintended uses?
- Distribution: How is the dataset distributed?
- Maintenance: Will the dataset be updated?
7.2 GEM Data Card Standards¶
For natural language generation tasks (Summarization, Instruction Following):
data_card:
motivation:
rationale: "Fill in rationale"
primary_use: "Describe primary use"
other_uses: ["List", "other", "uses"]
composition:
data_format: "Format description"
data_fields: [
{name: "field1", description: "...", type: "..."}
]
collection:
collection_process: "Description"
source_datasets: ["List", "sources"]
preprocessing:
cleaning_steps: ["Step1", "Step2"]
filtering_criteria: "Criteria description"
annotation:
annotation_process: "Description"
annotator_demographics: "If known"
uses:
intended_uses: ["Use1", "Use2"]
out_of_scope_uses: ["Misuse1", "Misuse2"]
distribution:
license: "CC-BY-4.0"
maintenance:
update_frequency: "As needed"
contact: "maintainer@email.com"
7.3 Example Dataset Card (Classification Task)¶
---
language:
- id
license: cc-by-4.0
task_categories:
- text-classification
pretty_name: Indonesian Sentiment Classification
---
# Indonesian Sentiment Classification
## Dataset Description
This dataset contains Indonesian text samples labeled for sentiment analysis.
Translated from [Original Dataset] using [Translation Model].
## Languages
- Indonesian (id-ID)
## Dataset Structure
### Data Instances
### Data Fields
- `text`: Indonesian text string
- `label`: Sentiment label (positive/negative/neutral)
### Data Splits
- Test: 3,424 samples
## Dataset Creation
### Source Data
Original: [Source dataset name and citation]
License: [Original license]
### Translation
- Model: TranslateGemma-12B
- Validation: 3-stage pipeline (language detection, semantic similarity, LLM-as-judge)
- Kept Ratio: 72.5%
### Quality Assurance
- Semantic similarity threshold: ≥0.75
- LLM-as-judge calibration: 88.4% F1 target
## Considerations
### Known Biases
- E-commerce domain bias
- Formal Indonesian bias (less colloquial)
### Limitations
- Machine translation may lose cultural nuances
- Limited to specific domains
## Citation
```bibtex
@dataset{indo_sentiment_2025,
author = {Indonesia-MTEB Team},
title = {Indonesian Sentiment Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/indonesia-mteb/sentiment}
}
---
## 8. Citation and Bibliography Standards
### 8.1 Required Citations
**Must Cite**:
```bibtex
@inproceedings{muennighoff2023mteb,
title={MTEB: Massive Text Embedding Benchmark},
author={Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils},
booktitle={EACL},
year={2023}
}
@inproceedings{wilie2020indonlu,
title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
author={Wilie, Bryan and Vincentio, Karissa and Winata, Genta Indra and Cahyawijaya, Samuel},
booktitle={AACL},
year={2020}
}
@inproceedings{cahyawijaya2023nusacrowd,
title={NusaCrowd: A Collaborative Project to Collect Indonesian and Regional Languages Datasets},
author={Cahyawijaya, Samuel and others},
booktitle={Findings of ACL},
year={2023}
}
8.2 Translation Model Citations¶
@article{translategemma2024,
title={TranslateGemma: High-Quality Machine Translation with Gemma},
author={Gemma Team},
year={2024}
}
@article{sea_lion_v4,
title={SEA-LION-v4: Southeast Asian Language Model},
author={AI Singapore},
year={2024}
}
8.3 Regional MTEB Citations¶
@article{pham2025vnmteb,
title={VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
author={Pham, Loc and Luu, Tung and Vo, Thu and others},
journal={arXiv preprint arXiv:2507.21500},
year={2025}
}
@article{baysan2025trmteb,
title={TR-MTEB: A Comprehensive Benchmark for Turkish Text Embeddings},
author={Baysan, Muhammed Selim and others},
booktitle={Findings of EMNLP},
year={2025}
}
@article{ponwitayarat2025seabed,
title={SEA-BED: Southeast Asia Embedding Benchmark},
author={Ponwitayarat, Wuttikorn and others},
journal={arXiv preprint arXiv:2508.12243},
year={2025}
}
9. Submission Timeline and Milestones¶
9.1 Pre-Submission Checklist¶
- All datasets uploaded to HuggingFace with complete README.md
- YAML metadata validated
- License tracking table completed
- Responsible NLP checklist filled out
- Code repository prepared and documented
- Benchmark results replicated with baseline models
- Paper formatted with ACL template
- Supplementary materials (appendices, license table) prepared
9.2 MTEB Integration Timeline¶
| Phase | Duration | Deliverable |
|---|---|---|
| Preparation | 2-4 weeks | Datasets uploaded, README.md complete |
| PR Submission | 1 week | MTEB PR submitted |
| Review | 2-4 weeks | MTEB maintainer review |
| Revision | 1-2 weeks | Address feedback |
| Integration | 1 week | Merged to main branch |
9.3 ARR Submission Timeline¶
| Phase | Duration | Deadline |
|---|---|---|
| ARR Submission | - | February 15, 2025 (example) |
| Reviews Available | ~2 months | April 15, 2025 |
| Rebuttal Period | 1 week | After reviews |
| Final Decision | ~1 month | May 2025 |
| Conference Selection | - | ACL/EMNLP/NAACL |
10. Common Pitfalls to Avoid¶
10.1 Licensing Mistakes¶
| Mistake | Consequence | Solution |
|---|---|---|
| License not specified | Desk rejection, unusable dataset | Always include license in YAML |
| Wrong license type | Legal issues downstream | Track source licenses properly |
| Ignoring ShareAlike | License violation | CC-BY-SA → CC-BY-SA for derivatives |
| Missing attribution | Plagiarism concerns | Cite all sources explicitly |
10.2 Documentation Mistakes¶
| Mistake | Consequence | Solution |
|---|---|---|
| Incomplete README | Dataset not discoverable | Use HuggingFace template |
| No contact info | Cannot report issues | Include maintainer email |
| Unclear data format | Integration problems | Provide examples and schema |
| Missing splits info | Wrong evaluation | Document train/dev/test sizes |
10.3 Submission Mistakes¶
| Mistake | Consequence | Solution |
|---|---|---|
| Wrong paper size | Formatting rejection | Use A4 only (21cm × 29.7cm) |
| Incomplete checklist | Desk rejection (since Dec 2024) | Fill all items carefully |
| Missing supplementary | Reviewer concerns | Upload code/data even if optional |
| Anonymity violation | Historical issue only | No anonymity period since Feb 2024 |
11. Implementation Checklist¶
11.1 Dataset Preparation¶
# For each dataset:
- [ ] Verify source license
- [ ] Apply translation pipeline
- [ ] Run validation (3-stage)
- [ ] Format to MTEB structure
- [ ] Create README.md with YAML
- [ ] Upload to HuggingFace
- [ ] Test with mteb.run_evaluation()
11.2 Paper Preparation¶
# Paper components:
- [ ] Abstract (200-250 words)
- [ ] Introduction (motivation, contributions)
- [ ] Related Work (MTEB, regional MTEBs, Indonesian NLP)
- [ ] Methodology (3-pronged approach)
- [ ] Benchmark overview (tables, figures)
- [ ] Experiments (implementation, results)
- [ ] Analysis and insights
- [ ] Conclusion and limitations
- [ ] Ethics statement
- [ ] Broader impact
- [ ] Acknowledgments
- [ ] References (BibTeX)
- [ ] Appendix (license table, examples)
11.3 Submission Package¶
# Final submission:
- [ ] Main paper PDF (A4 format)
- [ ] Supplementary materials (.tgz, max 200MB)
- [ ] Evaluation code
- [ ] Dataset statistics
- [ ] Additional examples
- [ ] License compatibility table
- [ ] Responsible NLP checklist (submitted via ARR portal)
- [ ] Author information (if no anonymity)
- [ ] ORCID IDs (recommended)
12. References and Resources¶
12.1 Official Resources¶
| Resource | URL |
|---|---|
| ACL Rolling Review | http://aclrollingreview.org |
| ARR Author Guidelines | http://aclrollingreview.org/authors |
| Responsible NLP Checklist | http://aclrollingreview.org/responsibleNLPresearch |
| MTEB GitHub | https://github.com/embeddings-benchmark/mteb |
| MTEB Leaderboard | https://huggingface.co/spaces/mteb/leaderboard |
| HuggingFace Datasets Docs | https://huggingface.co/docs/datasets |
12.2 Dataset Documentation Standards¶
| Standard | Citation |
|---|---|
| Datasheets for Datasets | Gebru et al., 2018, arXiv:1803.09010 |
| Model Cards | Mitchell et al., 2019 |
| GEM Data Cards | Gehrmann et al., 2021 |
| Data Cards for Dataset Documentation | Pushkarna et al., 2022 |
12.3 Legal Resources¶
| Topic | Resource |
|---|---|
| CC for AI Training | https://creativecommons.org/using-cc-licensed-works-for-ai-training |
| Dataset Licensing Audit | Longpré et al., 2024, Nature Machine Intelligence |
| Indonesia PDP Law | Law No. 27 of 2022 on Personal Data Protection |
| Derivative Works | US Copyright Office, Circular 14 |
Summary¶
Publishing Indonesia-MTEB as an ACL dataset paper requires:
- ARR Compliance: Paper format, Responsible NLP checklist, proper anonymization
- Dataset Documentation: Complete HuggingFace cards with YAML metadata
- Licensing Clarity: Track all source licenses, specify downstream terms
- MTEB Integration: Follow contribution guidelines, ensure compatibility
- Ethics and Impact: Ethics statement, broader impact discussion, bias documentation
- Attribution: Cite all source datasets, models, and related work
By following these standards, Indonesia-MTEB can achieve: - Acceptance at top-tier NLP venues - Integration with the MTEB leaderboard - Community adoption and reproducibility - Legal clarity for downstream users