llm-jp/scaling-data-constrained-llms
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/llm-jp/scaling-data-constrained-llms
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
language:
- ja
---
# Scaling Data-Constrained Language Models with Synthetic Data
This repository provides the pre-training corpora used in **Scaling Data-Constrained Language Models with Synthetic Data (Findings of EACL 2026)**.
## Overview

This repository contains multiple corpora designed to study data augmentation strategies for pre-training Japanese LLMs under a data-constrained data setting.
Starting from a limited Japanese Web corpus and a larger English Web corpus, we construct three Japanese synthetic corpora via paraphrasing, instruction generation, and translation.
## Corpora
### Organic Corpora
- **JA-WEB-9B**: A 9B-token Japanese web corpus derived from [the FineWeb2 dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2).
- **EN-WEB-63B**: A 63B-token English web corpus derived from [the FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
- **JA-WEB-63B**: A 63B-token Japanese web corpus derived from [the FineWeb2 dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2).
### Synthetic Corpora
All synthetic corpora are constructed from the above organic datasets using [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B).
- **JA-PARAPHRASE-63B**: A paraphrased version of JA-WEB-9B.
- **JA-INSTRUCT-63B**: Instruction-style data generated from JA-WEB-9B.
- **JA-TRANSLATE-63B**: Japanese translations of EN-WEB-63B.
Further details of the data construction pipeline are described in the paper.
### Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{kiyomaru-etal-2026-scaling,
title = "Scaling Data-Constrained Language Models with Synthetic Data",
author = "Kiyomaru, Hirokazu and
Oda, Yusuke and
Kodama, Takashi and
Liu, Chaoran and
Kawahara, Daisuke",
editor = "Demberg, Vera and
Inui, Kentaro and
Marquez, Llu{\'i}s",
booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {EACL} 2026",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.findings-eacl.52/",
pages = "1002--1016",
ISBN = "979-8-89176-386-9",
}
```
提供机构:
llm-jp



