five

haznitrama/idn-ban-cbn-synthetic

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/haznitrama/idn-ban-cbn-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Indonesian ↔ Balinese ↔ Cirebonese Synthetic Corpus license: mit task_categories: - text-classification - text-generation language: - id - ban - jav --- # Indonesian–Balinese–Cirebonese Synthetic Parallel Corpus ## Dataset Summary This corpus is **fully synthetic** and targets two extremely low-resource languages: - **Balinese** (ban) - **Cirebonese** (often grouped under Javanese dialects) Key facts: - Seeded with 10k high-level topics generated by GPT-5. - For each topic, gpt-oss-120b produced 30 subtopics and continued to generate 30 question–answer pairs (≈9M Q&A pairs). - Answers were translated into Balinese and Cirebonese by gpt-oss-120b while leveraging bilingual lexicons/dictionaries through prompting and in-context learning. - The parallel corpus contains roughly **~1B tokens per low-resource language** (Balinese & Cirebonese) across all splits. ## Available Subsets - `raw`: direct aggregation of every translated answer that passes structural validation (IDs and all three language fields present). - `filtered_heuristic`: subset of `raw` that passes filtering based on heuristics including minimum length, repetition checks, and also GlotLID verification for Balinese and Cirebonese (cirebonese are referred to as "jav" in GlotLID). ## Citation If you use this dataset (or derivatives), please cite: ``` @misc{scale-lowres-synth-2025, title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation}, author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}}, } ``` ## Notes - Dataset is synthetic; no human-written or human-translated text is included. - Licensed under the permissive MIT terms to encourage downstream reuse. - Please verify downstream safety/quality constraints that apply to your deployment scenario before production use.

数据集名称:印尼语↔巴厘语↔井里汶语合成语料库 许可证:MIT 任务类别: - 文本分类 - 文本生成 语言: - 印尼语(id) - 巴厘语(ban) - 爪哇语(jav) # 印尼语-巴厘语-井里汶语平行合成语料库 ## 数据集概览 本语料库为**完全合成生成**,针对两种极低资源语言: - **巴厘语(ban)** - **井里汶语(常被归类为爪哇方言)** 核心信息: - 以GPT-5生成的1万个高级主题作为种子数据集。 - 针对每个主题,gpt-oss-120b生成30个子主题,并进一步生成30组问答对,总计约900万组问答对。 - gpt-oss-120b通过提示词工程与上下文学习调用双语词典/字典,将生成的答案翻译成巴厘语与井里汶语。 - 各拆分集下,该平行语料库的每种低资源语言(巴厘语与井里汶语)的Token数约为10亿。 ## 可用子集 - `raw`:所有通过结构验证(包含ID与三种语言字段)的翻译答案的直接聚合。 - `filtered_heuristic`:`raw`的子集,通过启发式规则过滤,包括最小长度校验、重复项检测,以及针对巴厘语与井里汶语的GlotLID语言验证(注:在GlotLID中,井里汶语被标注为`jav`)。 ## 引用规范 若使用本数据集(或其衍生版本),请引用如下文献: @misc{scale-lowres-synth-2025, title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation}, author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh}, year = {2025}, howpublished = {url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}}, } ## 注意事项 - 本数据集为合成生成,未包含任何人工撰写或人工翻译的文本。 - 本数据集采用宽松的MIT许可证,以鼓励下游场景的复用。 - 在进行生产部署前,请验证适用于您的应用场景的下游安全性与质量约束。
提供机构:
haznitrama
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作