five

hachi-intelligence/JapaneseSummarization-FW2EduJa-Distill

收藏
Hugging Face2026-01-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hachi-intelligence/JapaneseSummarization-FW2EduJa-Distill
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - summarization language: - ja dataset_info: features: - name: id dtype: string - name: input dtype: string - name: instruction dtype: string - name: length dtype: string - name: output dtype: string - name: reasoning dtype: string - name: token_count dtype: int64 - name: url dtype: string configs: - config_name: default data_files: - split: train path: "*.parquet" pretty_name: Japanese Summarization Dataset (FW2EduJa Distilled) tags: - summarization - instruction - distillation - japanese - HACHI-Intelligence task_templates: - task: summarization input: input output: output --- # JapaneseSummarization-FW2EduJa-Distill ## Dataset Description This dataset is a large-scale Japanese summarization dataset (approximately 1 billion tokens) designed for high-fidelity knowledge distillation. It is built upon the **[fineweb-2-edu-japanese](https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese)** corpus, utilizing state-of-the-art LLMs to generate summaries that preserve precise factual information. The core objective is to create "extractive-style abstractive summaries" that maintain the integrity of proper nouns, numerical values, chronological order, and causal relationships. This makes it particularly suitable for training SLMs (Small Language Models) for professional domains such as public administration, research, medical, and finance. ### Method and Model Diversity To ensure a balanced dataset and mitigate model-specific biases, we employed two distinct model architectures: 1. **Qwen3-30B-A3B-Thinking-2507**: Leveraged for its superior Japanese linguistic capabilities. * Generates: Basic summary & Three-line summary. 2. **gpt-oss-120b**: Utilized for its high instruction-following performance. * Generates: Summaries within 100, 300, and 500 characters. ### Key Features * **Factual Accuracy**: Explicit instructions were given to accurately transcribe units, numbers, and proper nouns. * **Logical Consistency**: Preserves the original flow of time and causal links. * **High Utility**: Optimized for sectors requiring high-precision information extraction. ## License [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
提供机构:
hachi-intelligence
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作