haznitrama/idn-ban-cbn-synthetic

Name: haznitrama/idn-ban-cbn-synthetic
Creator: haznitrama
Published: 2025-12-10 09:55:24
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/haznitrama/idn-ban-cbn-synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Indonesian ↔ Balinese ↔ Cirebonese Synthetic Corpus license: mit task_categories: - text-classification - text-generation language: - id - ban - jav --- # Indonesian–Balinese–Cirebonese Synthetic Parallel Corpus ## Dataset Summary This corpus is **fully synthetic** and targets two extremely low-resource languages: - **Balinese** (ban) - **Cirebonese** (often grouped under Javanese dialects) Key facts: - Seeded with 10k high-level topics generated by GPT-5. - For each topic, gpt-oss-120b produced 30 subtopics and continued to generate 30 question–answer pairs (≈9M Q&A pairs). - Answers were translated into Balinese and Cirebonese by gpt-oss-120b while leveraging bilingual lexicons/dictionaries through prompting and in-context learning. - The parallel corpus contains roughly **~1B tokens per low-resource language** (Balinese & Cirebonese) across all splits. ## Available Subsets - `raw`: direct aggregation of every translated answer that passes structural validation (IDs and all three language fields present). - `filtered_heuristic`: subset of `raw` that passes filtering based on heuristics including minimum length, repetition checks, and also GlotLID verification for Balinese and Cirebonese (cirebonese are referred to as "jav" in GlotLID). ## Citation If you use this dataset (or derivatives), please cite: ``` @misc{scale-lowres-synth-2025, title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation}, author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}}, } ``` ## Notes - Dataset is synthetic; no human-written or human-translated text is included. - Licensed under the permissive MIT terms to encourage downstream reuse. - Please verify downstream safety/quality constraints that apply to your deployment scenario before production use.

数据集名称：印尼语↔巴厘语↔井里汶语合成语料库许可证：MIT 任务类别： - 文本分类 - 文本生成语言： - 印尼语（id） - 巴厘语（ban） - 爪哇语（jav） # 印尼语-巴厘语-井里汶语平行合成语料库 ## 数据集概览本语料库为**完全合成生成**，针对两种极低资源语言： - **巴厘语（ban）** - **井里汶语（常被归类为爪哇方言）** 核心信息： - 以GPT-5生成的1万个高级主题作为种子数据集。 - 针对每个主题，gpt-oss-120b生成30个子主题，并进一步生成30组问答对，总计约900万组问答对。 - gpt-oss-120b通过提示词工程与上下文学习调用双语词典/字典，将生成的答案翻译成巴厘语与井里汶语。 - 各拆分集下，该平行语料库的每种低资源语言（巴厘语与井里汶语）的Token数约为10亿。 ## 可用子集 - `raw`：所有通过结构验证（包含ID与三种语言字段）的翻译答案的直接聚合。 - `filtered_heuristic`：`raw`的子集，通过启发式规则过滤，包括最小长度校验、重复项检测，以及针对巴厘语与井里汶语的GlotLID语言验证（注：在GlotLID中，井里汶语被标注为`jav`）。 ## 引用规范若使用本数据集（或其衍生版本），请引用如下文献： @misc{scale-lowres-synth-2025, title = {Scale Resources for Low-Resource Languages via Synthetic Data Generation}, author = {Faiz Ghifari Haznitrama and Najma Qalbi Dwiharani and Alice Oh}, year = {2025}, howpublished = {url{https://huggingface.co/datasets/haznitrama/idn-ban-cbn-synthetic}}, } ## 注意事项 - 本数据集为合成生成，未包含任何人工撰写或人工翻译的文本。 - 本数据集采用宽松的MIT许可证，以鼓励下游场景的复用。 - 在进行生产部署前，请验证适用于您的应用场景的下游安全性与质量约束。

提供机构：

haznitrama

5,000+

优质数据集

54 个

任务类型

进入经典数据集