five

COIG-Writer

收藏
魔搭社区2025-12-05 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/COIG-Writer
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains the dataset and supplementary materials for the paper **COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes**. --- ## 🔔 Introduction **COIG-Writer** is a large-scale Chinese creative writing dataset that connects final literary works with their underlying reasoning processes. Each sample includes a **reverse-engineered writing prompt**, a **step-by-step reasoning trace**, and the **final article**. This design allows researchers to explore *process-level creative reasoning* rather than only text-level output. The dataset contains **1,665** triplets covering **51** genres, including novels, essays, scripts, and poetry. Data were curated through systematic reverse engineering and manually validated to ensure narrative coherence and creative depth. **Key Insights** - Process supervision stabilizes generation when balanced with general data at a **1:12** ratio. - Creative ability is **language-specific**, showing limited cross-lingual transfer. - Higher lexical diversity (**TTR**) does not necessarily imply higher creativity. --- ## 🏆 Dataset Statistics | Metric | Value | |---|---| | Total Triplets | 1,665 | | Total Genres | 51 | | Prompt Length (min / avg / max) | 30 / 283 / 2,642 | | Reasoning Length (min / avg / max) | 252 / 1,089 / 4,094 | | Article Length (min / avg / max) | 12 / 2,214 / 31,071 | **Genre Distribution** | Category | Count | Share | |---|---:|---:| | Communication Writing | 481 | 28.9% | | Novel | 467 | 28.0% | | Non-fiction | 243 | 14.6% | | Functional Writing | 221 | 13.3% | | Poetry | 128 | 7.7% | | Funny Literature | 68 | 4.1% | | Script | 57 | 3.4% | --- ## ⚙️ Installation & Usage To download and explore the dataset: ```bash git clone https://github.com/Juno6222222/COIG-Writer.git cd COIG-Writer ``` You can also load it directly from Hugging Face: ```python from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-Writer") print(dataset["train"][0]) ``` --- ## 📦 Dataset Format Each record in **COIG-Writer** follows the triplet structure below: ```json { "id": "d6544671-d025-4832-a5f9-7fe261a290bf", "query": "请帮我以现代诗的形式创作一首关于山居与人间生活的对比作品,要求:\n\n使用山花、蜜蜂、蝴蝶等自然意象\n\n包含'寻找不能耕种的岩石'的哲学隐喻\n\n展现黄昏下山时的人间烟火气\n\n结尾通过'山中寒夜'与'人间空缺'形成双重对照\n\n语言简洁但富有画面感,适当使用通感(如蜜蜂的轰鸣)\n\n分三至四节,每节4-6行。", "thought": "主题是山居和人间生活的对比。首先,需要明确要求,确保每个点都覆盖到。问题提到了自然意象,比如山花、蜜蜂、蝴蝶,还有哲学隐喻“寻找不能耕种的岩石”,黄昏下山的人间烟火气,以及结尾的双重对照“山中寒夜”和“人间空缺”。语言要简洁但有画面感,用通感,比如蜜蜂的轰鸣。分三到四节,每节4-6行。……", "answer": "山中\n有时需要去山中,尤其在春天。\n离开人类,去寻找草木和鸟,\n离开人的平地,\n寻找不能耕种、不能居住的岩石。\n\n山花像一万年前一样盛开,\n仿佛不曾中断过。\n树随意生长,更接近太阳。\n寂静之中,一只蜜蜂过去,\n发出直升机般的轰鸣。\n蝴蝶过去的时候保持沉默。\n\n在黄昏走下山,走近人间。\n千万种声音从那里升起,\n聚成无法散去的云。\n\n在人间,人的居所等待着他,\n他没有做完的事,\n几个需要他的人。\n他需要填补自己留下的小小空缺。\n而山中的夜晚是冷的。\n" } ``` **Field Descriptions** - `id`: A unique UUID identifying each sample. - `query`: The reverse-engineered creative writing instruction that motivates the piece. - `thought`: The step-by-step reasoning process describing the writer’s conceptual and structural planning. - `answer`: The final completed literary text corresponding to the reasoning trace. Each triplet (`query`, `thought`, `answer`) captures **prompt-level intent**, **reasoning-level planning**, and **output-level realization**, enabling detailed research into *process-supervised creative generation*. --- ## 📜 License **COIG-Writer** is distributed under the [Open Data Commons Attribution License (ODC-BY)](https://opendatacommons.org/licenses/by/). When using this dataset, please provide proper attribution. --- ## 📚 Citation ```bibtex @misc{coigwriter2025, title = {COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes}, author = {Yunwen Li and Shuangshuang Ying and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Tianyu Zheng and Xeron Du and Qiguang Chen and Jiajun Shi and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Chenghua Lin and Eli Zhang}, year = {2025}, eprint = {2510.14763}, archivePrefix= {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2510.14763} } ```

本仓库包含论文《COIG-Writer:带思维过程的高质量中文创意写作数据集》配套的数据集与补充材料。 --- ## 🔔 引言 **COIG-Writer**是一款大规模中文创意写作数据集,将最终文学作品与其背后的推理过程相绑定。每条样本均包含**逆向工程构建的写作指令(prompt)**、**逐步推理轨迹**与**最终成文**。该设计支持研究者探索**过程级创意推理**,而非仅关注文本级输出成果。 本数据集共包含1665组三元样本,覆盖51种文体,涵盖小说、散文、剧本、诗歌等。数据集通过系统化逆向工程进行整理,并经过人工校验,以确保叙事连贯性与创意深度。 **核心发现** - 当以1:12的比例与通用数据配比时,过程监督可稳定生成效果。 - 创意能力具有语言特异性,跨语言迁移能力有限。 - 更高的词汇多样性(TTR,类型-标记比)并不必然代表更高的创意水平。 --- ## 🏆 数据集统计 | 指标 | 数值 | |---|---| | 总三元样本数 | 1,665 | | 总文体数 | 51 | | 指令长度(最小值/平均值/最大值) | 30 / 283 / 2,642 | | 推理轨迹长度(最小值/平均值/最大值) | 252 / 1,089 / 4,094 | | 成文长度(最小值/平均值/最大值) | 12 / 2,214 / 31,071 | **文体分布** | 类别 | 数量 | 占比 | |---|---:|---:| | 交际写作 | 481 | 28.9% | | 小说 | 467 | 28.0% | | 非虚构写作 | 243 | 14.6% | | 功能写作 | 221 | 13.3% | | 诗歌 | 128 | 7.7% | | 幽默文学 | 68 | 4.1% | | 剧本 | 57 | 3.4% | --- ## ⚙️ 安装与使用 如需下载并探索本数据集: bash git clone https://github.com/Juno6222222/COIG-Writer.git cd COIG-Writer 你也可以直接从Hugging Face加载该数据集: python from datasets import load_dataset dataset = load_dataset("m-a-p/COIG-Writer") print(dataset["train"][0]) --- ## 📦 数据集格式 **COIG-Writer**中的每条记录均遵循以下三元结构: json { "id": "d6544671-d025-4832-a5f9-7fe261a290bf", "query": "请帮我以现代诗的形式创作一首关于山居与人间生活的对比作品,要求: 使用山花、蜜蜂、蝴蝶等自然意象 包含'寻找不能耕种的岩石'的哲学隐喻 展现黄昏下山时的人间烟火气 结尾通过'山中寒夜'与'人间空缺'形成双重对照 语言简洁但富有画面感,适当使用通感(如蜜蜂的轰鸣) 分三至四节,每节4-6行。", "thought": "主题是山居和人间生活的对比。首先,需要明确要求,确保每个点都覆盖到。问题提到了自然意象,比如山花、蜜蜂、蝴蝶,还有哲学隐喻“寻找不能耕种的岩石”,黄昏下山的人间烟火气,以及结尾的双重对照“山中寒夜”和“人间空缺”。语言要简洁但有画面感,用通感,比如蜜蜂的轰鸣。分三到四节,每节4-6行。……", "answer": "山中 有时需要去山中,尤其在春天。 离开人类,去寻找草木和鸟, 离开人的平地, 寻找不能耕种、不能居住的岩石。 山花像一万年前一样盛开, 仿佛不曾中断过。 树随意生长,更接近太阳。 寂静之中,一只蜜蜂过去, 发出直升机般的轰鸣。 蝴蝶过去的时候保持沉默。 在黄昏走下山,走近人间。 千万种声音从那里升起, 聚成无法散去的云。 在人间,人的居所等待着他, 他没有做完的事, 几个需要他的人。 他需要填补自己留下的小小空缺。 而山中的夜晚是冷的。 " } **字段说明** - `id`:每条样本的唯一UUID标识符。 - `query`:逆向工程生成的创意写作指令,用于驱动该作品创作。 - `thought`:描述创作者构思与结构规划的逐步推理过程。 - `answer`:与推理轨迹对应的最终完成文学文本。 每组三元样本(`query`、`thought`、`answer`)分别涵盖**指令级意图**、**推理级规划**与**输出级实现**,可支持对过程监督式创意生成的精细化研究。 --- ## 📜 许可协议 **COIG-Writer**采用[开放数据公共领域署名许可协议(ODC-BY)](https://opendatacommons.org/licenses/by/)进行分发。使用本数据集时,请注明正确的出处。 --- ## 📚 引用格式 bibtex @misc{coigwriter2025, title = {COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes}, author = {Yunwen Li and Shuangshuang Ying and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Tianyu Zheng and Xeron Du and Qiguang Chen and Jiajun Shi and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Chenghua Lin and Eli Zhang}, year = {2025}, eprint = {2510.14763}, archivePrefix= {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2510.14763} }
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作