haznitrama/idn-ban-cbn-synthetic
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/haznitrama/idn-ban-cbn-synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Indonesian ↔ Balinese ↔ Cirebonese Synthetic Corpus
license: mit
task_categories:
- text-classification
- text-generation
language:
- id
- ban
- jav
---
# Indonesian–Balinese–Cirebonese Synthetic Parallel Corpus
## Dataset Summary
This corpus is **fully synthetic** and targets two extremely low-resource languages:
- **Balinese** (ban)
- **Cirebonese** (often grouped under Javanese dialects)
Key facts:
- Seeded with 10k high-level topics generated by GPT-5.
- For each topic, gpt-oss-120b produced 30 subtopics and continued to generate 30
question–answer pairs (≈9M Q&A pairs).
- Answers were translated into Balinese and Cirebonese by gpt-oss-120b while
leveraging bilingual lexicons/dictionaries through prompting and in-context
learning.
- The parallel corpus contains roughly **~1B tokens per low-resource language**
(Balinese & Cirebonese) across all splits.
## Available Subsets
- `raw`: direct aggregation of every translated answer that passes structural
validation (IDs and all three language fields present).
- `filtered_heuristic`: subset of `raw` that passes filtering based on heuristics
including minimum length, repetition checks, and also GlotLID verification
for Balinese and Cirebonese (cirebonese are referred to as "jav" in GlotLID).
## Citation
If you use this dataset (or derivatives), please cite:
```
@misc{scale-lowres-synth-2025,
title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation},
author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}},
}
```
## Notes
- Dataset is synthetic; no human-written or human-translated text is included.
- Licensed under the permissive MIT terms to encourage downstream reuse.
- Please verify downstream safety/quality constraints that apply to your
deployment scenario before production use.
数据集名称:印尼语↔巴厘语↔井里汶语合成语料库
许可证:MIT
任务类别:
- 文本分类
- 文本生成
语言:
- 印尼语(id)
- 巴厘语(ban)
- 爪哇语(jav)
# 印尼语-巴厘语-井里汶语平行合成语料库
## 数据集概览
本语料库为**完全合成生成**,针对两种极低资源语言:
- **巴厘语(ban)**
- **井里汶语(常被归类为爪哇方言)**
核心信息:
- 以GPT-5生成的1万个高级主题作为种子数据集。
- 针对每个主题,gpt-oss-120b生成30个子主题,并进一步生成30组问答对,总计约900万组问答对。
- gpt-oss-120b通过提示词工程与上下文学习调用双语词典/字典,将生成的答案翻译成巴厘语与井里汶语。
- 各拆分集下,该平行语料库的每种低资源语言(巴厘语与井里汶语)的Token数约为10亿。
## 可用子集
- `raw`:所有通过结构验证(包含ID与三种语言字段)的翻译答案的直接聚合。
- `filtered_heuristic`:`raw`的子集,通过启发式规则过滤,包括最小长度校验、重复项检测,以及针对巴厘语与井里汶语的GlotLID语言验证(注:在GlotLID中,井里汶语被标注为`jav`)。
## 引用规范
若使用本数据集(或其衍生版本),请引用如下文献:
@misc{scale-lowres-synth-2025,
title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation},
author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh},
year = {2025},
howpublished = {url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}},
}
## 注意事项
- 本数据集为合成生成,未包含任何人工撰写或人工翻译的文本。
- 本数据集采用宽松的MIT许可证,以鼓励下游场景的复用。
- 在进行生产部署前,请验证适用于您的应用场景的下游安全性与质量约束。
提供机构:
haznitrama



