alexander-llm-public-domain-book-summarization

Name: alexander-llm-public-domain-book-summarization
Creator: maas
Published: 2026-01-10 00:40:43
License: 暂无描述

魔搭社区2026-01-10 更新2026-01-10 收录

下载链接：

https://modelscope.cn/datasets/siliconflow/alexander-llm-public-domain-book-summarization

下载链接

链接失效反馈

官方服务：

资源简介：

# alexander-llm-public-domain-book-summarization A long-context English book summarization dataset sampled from public-domain books. Using the DeepSeek-V3.2 tokenizer, we sampled 10,857 documents whose token length falls in [64k, 100k] tokens. ## Source This dataset is derived from the Hugging Face dataset: - common-pile/project_gutenberg_filtered https://huggingface.co/datasets/common-pile/project_gutenberg_filtered The upstream dataset is built from Project Gutenberg books that are marked as public domain in its metadata (and includes PG19 items). Note: licensing metadata can be imperfect; users should verify copyright status in their jurisdiction. ## Description This dataset is intended for benchmarking and evaluating LLMs on long-context summarization of book-length public-domain text. Processing steps: - Load the upstream corpus. - Compute token length using the DeepSeek-V3.2 tokenizer. - Select documents with token length in [64k, 100k]. - Construct a summarization-style user prompt: ```text Summarize the following content. title: {title} text: {text} ``` - Reformat into OpenAI Batch–compatible JSONL: each line is a /v1/chat/completions request body containing: - custom_id (UUID) - method - url - body.messages (single user role) ## Token Length Statistics | Tokenizer | Mean | P50 | P75 | P90 | P95 | P99 | |---|---:|---:|---:|---:|---:|---:| | DeepSeek-V3.2 | 82792.4606244819 | 82360.0 | 91601.0 | 97987.4 | 100142.4 | 102019.64 | | Kimi-K2-Thinking | 83348.66399557889 | 82892.0 | 92225.0 | 98693.0 | 100963.0 | 102956.40000000001 | | MiniMax-M2 | 80860.96315740996 | 80409.0 | 89472.0 | 95741.2 | 97904.79999999999 | 99926.52 | | GLM-4.6 | 82815.61950815142 | 82406.0 | 91642.0 | 98117.59999999999 | 100264.79999999999 | 102309.2 | | Qwen3-235B-Thinking | 83323.6754167818 | 82784.0 | 92142.0 | 98806.2 | 100970.2 | 103255.48000000001 | ## License ### Upstream text status (Public Domain) The upstream dataset indicates the underlying books are public domain (per its metadata). However, public-domain status can vary by country/region, and metadata can be incorrect. Users are responsible for checking copyright status and compliance in their jurisdiction. ### Acknowledgments, References, and Trademark notice (Project Gutenberg) Data were sampled from https://huggingface.co/datasets/common-pile/project_gutenberg_filtered. “Project Gutenberg” is a trademark. This dataset does NOT use the name “Project Gutenberg” as the dataset title/brand or for marketing purposes. If you redistribute or advertise copies of ebooks while using the name “Project Gutenberg” on the distribution medium or in promotional materials, you may be subject to additional trademark-related restrictions (including requirements to distribute verbatim copies and other conditions). A safer approach for derived datasets is to avoid using the Project Gutenberg trademark as a product name or branding and to keep any mentions limited to source attribution/references.

# alexander-llm-public-domain-book-summarization 本数据集为从公有领域书籍中采样得到的长上下文英文书籍摘要数据集。我们采用DeepSeek-V3.2分词器（DeepSeek-V3.2），从Token长度介于64k至100k之间的文档中采样得到10857条样本。 ## 来源本数据集衍生自Hugging Face平台上的如下数据集： - common-pile/project_gutenberg_filtered https://huggingface.co/datasets/common-pile/project_gutenberg_filtered 上游数据集源自元数据中标注为公有领域的Project Gutenberg（古腾堡计划）书籍（包含PG19相关条目）。注意：许可元数据可能存在不完善之处，使用者需自行验证其所在司法辖区内的版权状态。 ## 数据集说明本数据集旨在用于基准测试与评估大语言模型（LLM）对书籍长度的公有领域文本开展长上下文摘要的能力。 ### 处理流程 1. 加载上游语料库； 2. 采用DeepSeek-V3.2分词器计算Token长度； 3. 筛选Token长度介于64k至100k之间的文档； 4. 构建摘要风格的用户提示词： text Summarize the following content. title: {title} text: {text} 5. 将数据重构为兼容OpenAI批量处理（OpenAI Batch）的JSONL格式：每行均为一条/v1/chat/completions请求体，包含以下字段： - custom_id（通用唯一识别码UUID） - method（请求方法） - url（请求地址） - body.messages（仅包含单条用户角色消息） ## Token长度统计 | 分词器 | 均值 | P50 | P75 | P90 | P95 | P99 | |---|---:|---:|---:|---:|---:|---:| | DeepSeek-V3.2 | 82792.4606244819 | 82360.0 | 91601.0 | 97987.4 | 100142.4 | 102019.64 | | Kimi-K2-Thinking | 83348.66399557889 | 82892.0 | 92225.0 | 98693.0 | 100963.0 | 102956.40000000001 | | MiniMax-M2 | 80860.96315740996 | 80409.0 | 89472.0 | 95741.2 | 97904.79999999999 | 99926.52 | | GLM-4.6 | 82815.61950815142 | 82406.0 | 91642.0 | 98117.59999999999 | 100264.79999999999 | 102309.2 | | Qwen3-235B-Thinking | 83323.6754167818 | 82784.0 | 92142.0 | 98806.2 | 100970.2 | 103255.48000000001 | ## 许可声明 ### 上游文本版权状态（公有领域）上游数据集显示其收录的书籍符合公有领域标注（基于其元数据）。但公有领域的认定标准会因国家/地区而异，且元数据可能存在错误。使用者需自行核查其所在司法辖区内的版权状态与合规要求。 ### 致谢、参考文献与商标声明（Project Gutenberg，古腾堡计划）本数据集采样自https://huggingface.co/datasets/common-pile/project_gutenberg_filtered。 “Project Gutenberg（古腾堡计划）”为注册商标。本数据集未将“Project Gutenberg”用作数据集名称、品牌或用于营销目的。若您在分发介质或宣传材料中使用“Project Gutenberg”名称来重新分发或宣传电子书副本，可能需遵守额外的商标相关限制（包括要求分发完全一致的副本及其他条款）。对于衍生数据集而言，更稳妥的做法是避免将Project Gutenberg商标用作产品名称或品牌，仅将提及内容限定为来源标注与参考文献范畴。

提供机构：

maas

创建时间：

2026-01-05

5,000+

优质数据集

54 个

任务类型

进入经典数据集