five

siliconflow/alexander-llm-wiki-zh-title-to-article

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/siliconflow/alexander-llm-wiki-zh-title-to-article
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - text-generation language: - zh size_categories: - 1M<n<10M --- # ChineseWikipedia_Title2Article_20231101 A Chinese dataset for generating full Wikipedia-style article content from only a given article title (snapshot 2023-11-01). The full cleaned and deduplicated corpus contains 1,384,748 samples. ## Source This dataset is constructed from the Chinese Wikipedia dump: - wikimedia / wikipedia (configuration: 20231101.zh) https://huggingface.co/datasets/wikimedia/wikipedia All textual content comes from Chinese Wikipedia and is published under Wikipedia content licenses (CC-BY-SA 3.0 + GFDL). ## Description This dataset is designed to evaluate and benchmark LLMs on encyclopedic article generation. The model receives only a Wikipedia article title and is expected to produce a complete, neutral, well-structured Chinese Wikipedia–style article. Processing steps: - Extract all article titles from the 20231101.zh corpus. - Remove duplicate titles. - Construct a generation-style user prompt: ```你是一名资深的中文维基百科编辑,熟悉维基百科的写作规范。\n请根据我提供的条目标题,撰写维基百科条目内容。\n条目标题:{title}``` - Reformat the data into OpenAI Batch–compatible JSONL, with each line containing a `/v1/chat/completions` request body (including a `custom_id`, `method`, `url`, and `body.messages` with a single user role) ## Token Length Statistics Tokenizer | Mean | P50 | P75 | P90 | P95 | P99 -----------------------|---------------------|-----|-----|-----|-----|------ DeepSeek-V3.2 | 42.10771851629322 | 41 | 43 | 46 | 48 | 52 Kimi-K2-Thinking | 37.55160000231089 | 37 | 39 | 42 | 44 | 50 MiniMax-M2 | 41.117435085661796 | 40 | 42 | 45 | 47 | 51 GLM-4.6 | 43.72347170748757 | 43 | 45 | 48 | 50 | 56 Qwen3-235B-Thinking | 43.4858992394284 | 43 | 45 | 48 | 50 | 55 ## License All textual content is derived from Chinese Wikipedia and remains under: - Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0) - GNU Free Documentation License (GFDL)
提供机构:
siliconflow
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作