alexander-llm-wiki-zh-to-en

Name: alexander-llm-wiki-zh-to-en
Creator: maas
Published: 2025-12-12 19:19:10
License: 暂无描述

魔搭社区2025-12-12 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/siliconflow/alexander-llm-wiki-zh-to-en

下载链接

链接失效反馈

官方服务：

资源简介：

# ChineseWikipedia_ZhEn_Translate_20231101 A Chinese–English translation dataset constructed from Chinese Wikipedia articles (snapshot 2023-11-01). The full cleaned and deduplicated corpus contains 1,384,748 samples. ## Source This dataset is derived from the Hugging Face dataset: - wikimedia / wikipedia (configuration: 20231101.zh) https://huggingface.co/datasets/wikimedia/wikipedia All textual content in the original dataset comes from Chinese Wikipedia and is published under the Wikipedia content licenses (CC-BY-SA 3.0 + GFDL). ## Description Based on the original 20231101.zh split, we perform the following processing steps: - Remove duplicates using the pair (title, text) on the full corpus - Clean and normalize article text where appropriate - Construct a Chinese→English translation-style user message of the form: ```你是一名专业的、以英语为母语的译者和编辑，专门处理维基百科风格的条目。将给定的中文维基百科内容翻译成清晰、地道且中性的英文，使之适合用于英文维基百科。\n条目标题：{title}\n条目内容：{text}``` - Reformat the data into OpenAI Batch–compatible JSONL, with each line containing a `/v1/chat/completions` request body (including a `custom_id`, `method`, `url`, and `body.messages` with a single user role) This dataset is intended for benchmarking and evaluating Chinese–English translation models on Wikipedia-style encyclopedic articles. ## Token Length Statistics | Tokenizer | Mean | P50 | P75 | P90 | P95 | P99 | |------------------------|---------------------|-----|-----|------|----------|----------| | DeepSeek-V3.2 | 559.8099935872808 | 222 | 493 | 1176 | 2039 | 5270 | | Kimi-K2-Thinking | 613.4831102843261 | 226 | 534 | 1312 | 2274.65 | 5959.53 | | MiniMax-M2 | 550.7587626051816 | 217 | 485 | 1162 | 2014 | 5208 | | GLM-4.6 | 630.1708419149188 | 240 | 551 | 1336 | 2314 | 6025 | | Qwen3-235B-Thinking | 624.9581613405471 | 247 | 549 | 1310 | 2269 | 5869 | License All **textual content** in this dataset is derived from Chinese Wikipedia and therefore remains under the original Wikipedia licenses: - Creative Commons Attribution-ShareAlike 3.0 (CC-BY-SA 3.0) - GNU Free Documentation License (GFDL)

# ChineseWikipedia_ZhEn_Translate_20231101 本数据集为基于2023年11月1日快照的中文维基百科（Chinese Wikipedia）条目构建的中英翻译数据集。经清洗与去重后的完整语料库共包含1,384,748条样本。 ## 数据集来源本数据集源自以下Hugging Face（Hugging Face）数据集： - wikimedia / wikipedia（配置项：20231101.zh） https://huggingface.co/datasets/wikimedia/wikipedia 原始数据集中的所有文本内容均来自中文维基百科（Chinese Wikipedia），并遵循维基百科内容许可协议（CC-BY-SA 3.0 + GFDL）发布。 ## 数据集说明基于原始的20231101.zh数据集划分，我们执行了如下处理流程： - 针对完整语料库，以（标题，文本）为键进行去重处理 - 按需对条目文本进行清洗与标准化操作 - 构建符合中英翻译任务格式的用户提示词，具体形式如下：你是一名专业的、以英语为母语的译者和编辑，专门处理维基百科风格的条目。将给定的中文维基百科内容翻译成清晰、地道且中性的英文，使之适合用于英文维基百科。条目标题：{title} 条目内容：{text} - 将数据重构为适配OpenAI批量处理（OpenAI Batch）的JSONL格式，每行包含一个`/v1/chat/completions`请求体（包含`custom_id`、`method`、`url`以及仅含单条用户角色消息的`body.messages`字段）本数据集旨在针对维基百科风格的百科条目，对中英翻译模型进行基准测试与性能评估。 ## 分词长度统计 | 分词器（Tokenizer） | 均值（Mean） | P50 | P75 | P90 | P95 | P99 | |------------------------|---------------------|-----|-----|------|----------|----------| | DeepSeek-V3.2 | 559.8099935872808 | 222 | 493 | 1176 | 2039 | 5270 | | Kimi-K2-Thinking | 613.4831102843261 | 226 | 534 | 1312 | 2274.65 | 5959.53 | | MiniMax-M2 | 550.7587626051816 | 217 | 485 | 1162 | 2014 | 5208 | | GLM-4.6 | 630.1708419149188 | 240 | 551 | 1336 | 2314 | 6025 | | Qwen3-235B-Thinking | 624.9581613405471 | 247 | 549 | 1310 | 2269 | 5869 | ## 许可协议本数据集中的所有文本内容均源自中文维基百科（Chinese Wikipedia），因此沿用原始维基百科的许可协议： - 知识共享署名-相同方式共享3.0协议（Creative Commons Attribution-ShareAlike 3.0，CC-BY-SA 3.0） - GNU自由文档许可证（GNU Free Documentation License，GFDL）

提供机构：

maas

创建时间：

2025-12-05

搜集汇总

数据集介绍