DeL-TaiseiOzaki/Tengentoppa-pretrain-v1.0
收藏Hugging Face2025-12-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/DeL-TaiseiOzaki/Tengentoppa-pretrain-v1.0
下载链接
链接失效反馈官方服务:
资源简介:
Tengentoppa-pretrain-v1.0是一个用于日语预训练的合成数据集,包含基于多样人物角色生成的高质量日语文本。数据集包含约100,000个样本,每个样本包括文本、主题、文本类型和质量评分等信息。生成过程涉及人物角色提取、计划生成、文本生成、质量评估和多样性检查等多个步骤,确保文本的多样性和高质量。
Tengentoppa-pretrain-v1.0 is a synthetic dataset for Japanese pretraining, containing high-quality Japanese texts generated based on diverse personas. The dataset includes approximately 100,000 samples, each with fields such as text, topic, text type, and quality score. The generation process involves multiple steps including persona extraction, plan generation, text generation, quality evaluation, and diversity check to ensure the diversity and high quality of the texts.
提供机构:
DeL-TaiseiOzaki



