hachi-intelligence/JapaneseSummarization-FW2EduJa-Distill

Name: hachi-intelligence/JapaneseSummarization-FW2EduJa-Distill
Creator: hachi-intelligence
Published: 2026-01-29 11:55:37
License: 暂无描述

Hugging Face2026-01-29 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hachi-intelligence/JapaneseSummarization-FW2EduJa-Distill

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - summarization language: - ja dataset_info: features: - name: id dtype: string - name: input dtype: string - name: instruction dtype: string - name: length dtype: string - name: output dtype: string - name: reasoning dtype: string - name: token_count dtype: int64 - name: url dtype: string configs: - config_name: default data_files: - split: train path: "*.parquet" pretty_name: Japanese Summarization Dataset (FW2EduJa Distilled) tags: - summarization - instruction - distillation - japanese - HACHI-Intelligence task_templates: - task: summarization input: input output: output --- # JapaneseSummarization-FW2EduJa-Distill ## Dataset Description This dataset is a large-scale Japanese summarization dataset (approximately 1 billion tokens) designed for high-fidelity knowledge distillation. It is built upon the **[fineweb-2-edu-japanese](https://huggingface.co/datasets/hotchpotch/fineweb-2-edu-japanese)** corpus, utilizing state-of-the-art LLMs to generate summaries that preserve precise factual information. The core objective is to create "extractive-style abstractive summaries" that maintain the integrity of proper nouns, numerical values, chronological order, and causal relationships. This makes it particularly suitable for training SLMs (Small Language Models) for professional domains such as public administration, research, medical, and finance. ### Method and Model Diversity To ensure a balanced dataset and mitigate model-specific biases, we employed two distinct model architectures: 1. **Qwen3-30B-A3B-Thinking-2507**: Leveraged for its superior Japanese linguistic capabilities. * Generates: Basic summary & Three-line summary. 2. **gpt-oss-120b**: Utilized for its high instruction-following performance. * Generates: Summaries within 100, 300, and 500 characters. ### Key Features * **Factual Accuracy**: Explicit instructions were given to accurately transcribe units, numbers, and proper nouns. * **Logical Consistency**: Preserves the original flow of time and causal links. * **High Utility**: Optimized for sectors requiring high-precision information extraction. ## License [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)

提供机构：

hachi-intelligence

5,000+

优质数据集

54 个

任务类型

进入经典数据集