alexander-llm-wiki-zh-article-to-title

Name: alexander-llm-wiki-zh-article-to-title
Creator: maas
Published: 2025-12-05 16:58:00
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/siliconflow/alexander-llm-wiki-zh-article-to-title

下载链接

链接失效反馈

官方服务：

资源简介：

A Chinese dataset for generating Wikipedia-style article titles from full article content (snapshot 2023-11-01). The full cleaned and deduplicated corpus contains 1,384,748 samples. ## Source This dataset is derived from the Chinese Wikipedia dump: - wikimedia / wikipedia (configuration: 20231101.zh) https://huggingface.co/datasets/wikimedia/wikipedia All textual content is originally from Chinese Wikipedia and is licensed under CC-BY-SA 3.0 + GFDL. ## Description This dataset is intended to evaluate and benchmark LLMs on the task of title generation from article content. Each sample consists of a full Chinese Wikipedia–style article content (or sufficiently long excerpt) as input; the model is expected to output a concise, accurate, and Wikipedia-style article title in Chinese. Processing steps include: - Extract full article content from the 20231101.zh corpus. - Remove duplicates by content (or by title + content) to avoid repeated entries. - Construct a generation-style prompt of the form: ```你是一名资深的中文维基百科编辑，熟悉维基百科的写作规范。\n请根据我提供的条目内容，生成维基百科条目标题。\n条目内容：{text}``` - Reformat the data into OpenAI Batch–compatible JSONL format, where each line is a POST /v1/chat/completions with a single user message. - The dataset includes only prompts (i.e. content → title generation); evaluation or model outputs are external. This dataset supports research in title generation, summarization-to-title, LLM comprehension and condensation, and consistency between content and title. ## Token Length Statistics (prompt side) Tokenizer | Mean | P50 | P75 | P90 | P95 | P99 -----------------------|--------------------|------|------|-------|---------|---------- DeepSeek-V3.2 | 536.8321449101209 | 199 | 470 | 1153 | 2016 | 5246.53 Kimi-K2-Thinking | 589.060510648869 | 202 | 509 | 1287 | 2250 | 5935.53 MiniMax-M2 | 529.7712890720911 | 196 | 465 | 1142 | 1993 | 5187 GLM-4.6 | 605.57643556806 | 216 | 526 | 1312 | 2290 | 6001.53 Qwen3-235B-Thinking | 596.6013289060537 | 219 | 521 | 1282 | 2241 | 5839.53 ## License All textual content in this dataset is derived from Chinese Wikipedia and thus remains under: - Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0) - GNU Free Documentation License (GFDL)

本数据集为面向中文场景的专用数据集，用于从完整文章内容生成维基百科风格的文章标题（数据集快照时间为2023年11月1日）。经清洗、去重后的完整语料库共包含1,384,748条样本。 ## 数据集来源本数据集源自中文维基百科快照： - 维基媒体（wikimedia）/维基百科（wikipedia）（配置版本：20231101.zh）数据集链接：https://huggingface.co/datasets/wikimedia/wikipedia 所有文本内容均源自中文维基百科，授权协议为CC-BY-SA 3.0 + GNU自由文档许可证（GFDL）。 ## 数据集说明本数据集旨在用于评估大语言模型（LLM）在文章内容生成标题任务上的性能与基准表现。每条样本均以完整的中文维基百科风格文章内容（或足够长度的节选内容）作为输入，要求模型输出简洁准确、符合维基百科规范的中文文章标题。数据集处理流程如下： 1. 从20231101.zh语料库中提取完整文章内容； 2. 基于内容（或标题+内容）进行去重，避免重复条目； 3. 构建生成式提示词，格式为：你是一名资深的中文维基百科编辑，熟悉维基百科的写作规范。请根据我提供的条目内容，生成维基百科条目标题。条目内容：{text} 4. 将数据重新格式化为适配OpenAI Batch的JSONL格式，每行对应一条POST请求至`/v1/chat/completions`接口，仅包含单条用户消息。 5. 本数据集仅包含提示词（即内容到标题的生成任务），评估过程及模型输出均为外部环节。本数据集可用于文章标题生成、摘要转标题、大语言模型理解与凝练能力、内容与标题一致性等方向的研究。 ## 提示词侧Token长度统计 | 分词器 | 均值 | P50 | P75 | P90 | P95 | P99 | |-----------------------|---------------------|------|------|-------|---------|-----------| | DeepSeek-V3.2 | 536.8321449101209 | 199 | 470 | 1153 | 2016 | 5246.53 | | Kimi-K2-Thinking | 589.060510648869 | 202 | 509 | 1287 | 2250 | 5935.53 | | MiniMax-M2 | 529.7712890720911 | 196 | 465 | 1142 | 1993 | 5187 | | GLM-4.6 | 605.57643556806 | 216 | 526 | 1312 | 2290 | 6001.53 | | Qwen3-235B-Thinking | 596.6013289060537 | 219 | 521 | 1282 | 2241 | 5839.53 | ## 授权协议本数据集内所有文本内容均源自中文维基百科，因此沿用以下授权协议： - 知识共享署名-相同方式共享3.0（CC-BY-SA 3.0） - GNU自由文档许可证（GFDL）

提供机构：

maas

创建时间：

2025-12-05

搜集汇总

数据集介绍