siliconflow/alexander-llm-wiki-zh-title-to-article
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/siliconflow/alexander-llm-wiki-zh-title-to-article
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- text-generation
language:
- zh
size_categories:
- 1M<n<10M
---
# ChineseWikipedia_Title2Article_20231101
A Chinese dataset for generating full Wikipedia-style article content from only a given article title (snapshot 2023-11-01).
The full cleaned and deduplicated corpus contains 1,384,748 samples.
## Source
This dataset is constructed from the Chinese Wikipedia dump:
- wikimedia / wikipedia (configuration: 20231101.zh)
https://huggingface.co/datasets/wikimedia/wikipedia
All textual content comes from Chinese Wikipedia and is published under Wikipedia content licenses (CC-BY-SA 3.0 + GFDL).
## Description
This dataset is designed to evaluate and benchmark LLMs on encyclopedic article generation.
The model receives only a Wikipedia article title and is expected to produce a complete, neutral, well-structured Chinese Wikipedia–style article.
Processing steps:
- Extract all article titles from the 20231101.zh corpus.
- Remove duplicate titles.
- Construct a generation-style user prompt:
```你是一名资深的中文维基百科编辑,熟悉维基百科的写作规范。\n请根据我提供的条目标题,撰写维基百科条目内容。\n条目标题:{title}```
- Reformat the data into OpenAI Batch–compatible JSONL, with each line containing a `/v1/chat/completions` request body (including a `custom_id`, `method`, `url`, and `body.messages` with a single user role)
## Token Length Statistics
Tokenizer | Mean | P50 | P75 | P90 | P95 | P99
-----------------------|---------------------|-----|-----|-----|-----|------
DeepSeek-V3.2 | 42.10771851629322 | 41 | 43 | 46 | 48 | 52
Kimi-K2-Thinking | 37.55160000231089 | 37 | 39 | 42 | 44 | 50
MiniMax-M2 | 41.117435085661796 | 40 | 42 | 45 | 47 | 51
GLM-4.6 | 43.72347170748757 | 43 | 45 | 48 | 50 | 56
Qwen3-235B-Thinking | 43.4858992394284 | 43 | 45 | 48 | 50 | 55
## License
All textual content is derived from Chinese Wikipedia and remains under:
- Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0)
- GNU Free Documentation License (GFDL)
提供机构:
siliconflow



