five

llm-jp/scaling-data-constrained-llms

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/llm-jp/scaling-data-constrained-llms
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation language: - ja --- # Scaling Data-Constrained Language Models with Synthetic Data This repository provides the pre-training corpora used in **Scaling Data-Constrained Language Models with Synthetic Data (Findings of EACL 2026)**. ## Overview ![](overview.png) This repository contains multiple corpora designed to study data augmentation strategies for pre-training Japanese LLMs under a data-constrained data setting. Starting from a limited Japanese Web corpus and a larger English Web corpus, we construct three Japanese synthetic corpora via paraphrasing, instruction generation, and translation. ## Corpora ### Organic Corpora - **JA-WEB-9B**: A 9B-token Japanese web corpus derived from [the FineWeb2 dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). - **EN-WEB-63B**: A 63B-token English web corpus derived from [the FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb). - **JA-WEB-63B**: A 63B-token Japanese web corpus derived from [the FineWeb2 dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). ### Synthetic Corpora All synthetic corpora are constructed from the above organic datasets using [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B). - **JA-PARAPHRASE-63B**: A paraphrased version of JA-WEB-9B. - **JA-INSTRUCT-63B**: Instruction-style data generated from JA-WEB-9B. - **JA-TRANSLATE-63B**: Japanese translations of EN-WEB-63B. Further details of the data construction pipeline are described in the paper. ### Citation If you use this dataset, please cite: ```bibtex @inproceedings{kiyomaru-etal-2026-scaling, title = "Scaling Data-Constrained Language Models with Synthetic Data", author = "Kiyomaru, Hirokazu and Oda, Yusuke and Kodama, Takashi and Liu, Chaoran and Kawahara, Daisuke", editor = "Demberg, Vera and Inui, Kentaro and Marquez, Llu{\'i}s", booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026", month = mar, year = "2026", address = "Rabat, Morocco", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2026.findings-eacl.52/", pages = "1002--1016", ISBN = "979-8-89176-386-9", } ```
提供机构:
llm-jp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作