five

blog-key-points

收藏
魔搭社区2025-11-27 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/ncls-p/blog-key-points
下载链接
链接失效反馈
官方服务:
资源简介:
# Article Key Points Dataset ## Dataset Description This dataset contains articles and their corresponding key points extracted using AI. Each entry consists of the full article text and a concise bullet-point summary highlighting the most important information from the article. ### Dataset Summary The dataset is designed for training and evaluating models on extractive and abstractive summarization tasks. It provides pairs of full article content and human-readable key point summaries that capture the essential information from each article. ## Dataset Structure ### Data Instances Each instance in the dataset contains: ```json { "instruction": "", "input": "Full article content", "output": "Here are the key points of the article:\n* Key point 1\n* Key point 2\n* Key point 3\n..." } ``` ### Data Fields - `instruction`: Empty field - `input`: The full text of the article - `output`: A bullet-point list of key points extracted from the article ### Data Splits The dataset is provided as a single collection without predefined splits. ## Dataset Creation ### Curation Rationale This dataset was created to provide high-quality article-summary pairs for training summarization models. The key points are focused on extracting the most important information from each article in a concise format. ### Source Data #### Initial Data Collection and Normalization Articles were collected from various online sources. The full text of each article was preserved in its original form. #### Who are the source data producers? The source data comes from publicly available articles from various online publications. ### Annotations #### Annotation process The key points were automatically extracted using AI. The extraction process focused on identifying the most important information from each article and presenting it in a concise, bullet-point format. #### Who are the annotators? The annotations (key points) were generated using AI, a large language model service. The extraction was performed using a custom tool called [Dataset Enhancer](https://github.com/ncls-p/pplx-to-dataset), which processes articles and extracts key points using the AI. ### Personal and Sensitive Information Care was taken to avoid including personal or sensitive information in the dataset. However, as the dataset contains content from public articles, it may include names of public figures or information that was already in the public domain. ## Considerations for Using the Dataset ### Social Impact of Dataset This dataset aims to improve summarization capabilities of language models, which can help users quickly understand the key points of long articles, saving time and improving information accessibility. ### Discussion of Biases The dataset may contain biases present in the source articles or introduced during the key point extraction process. Users should be aware that: 1. The selection of articles may not represent all perspectives equally 2. The automated extraction of key points may emphasize certain types of information over others 3. The language model used for extraction may have its own inherent biases ### Dataset Curators This dataset was curated using the [llm-to-blog-key-points-dataset](https://github.com/ncls-p/llm-to-blog-key-points-dataset) tool, which was created to automate the process of extracting key points from articles using AI. ### Licensing Information This dataset is released under the [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license. ### Citation Information If you use this dataset in your research, please cite: ``` @misc{article-key-points-dataset, author = {ncls-p}, title = {blog-key-points}, year = {2025}, howpublished = {Hugging Face Dataset}, url = {https://huggingface.co/datasets/ncls-p/blog-key-points} } ```

# 文章关键点数据集 ## 数据集描述 本数据集包含文章及通过人工智能(AI)提取的对应关键点。每条数据均由完整的文章文本,以及提炼文章核心信息的简洁项目符号列表式摘要构成。 ### 数据集概述 本数据集专为抽取式与抽象式摘要任务的模型训练与评估设计,提供完整文章内容与涵盖文章核心信息的可读式关键点摘要配对数据。 ## 数据集结构 ### 数据实例 数据集中的每条实例均包含如下内容: json { "instruction": "", "input": "Full article content", "output": "Here are the key points of the article: * Key point 1 * Key point 2 * Key point 3 ..." } ### 数据字段 - `instruction`:空字段 - `input`:文章的完整文本 - `output`:从文章中提取的关键点项目符号列表 ### 数据划分 本数据集以单一集合形式提供,未预设划分方式。 ## 数据集构建 ### 构建初衷 本数据集旨在为摘要模型训练提供高质量的文章-摘要配对数据,其关键点提取聚焦于以简洁格式提炼每篇文章的核心信息。 ### 源数据 #### 初始数据收集与标准化处理 文章从各类在线来源收集,每篇文章的完整文本均保留原始形态。 #### 源数据提供者是谁? 源数据来自各类在线出版物的公开可获取文章。 ### 标注信息 #### 标注流程 关键点通过人工智能自动提取,提取流程聚焦于识别每篇文章的核心信息,并以简洁的项目符号格式呈现。 #### 标注者是谁? 标注内容(即关键点)由人工智能、大语言模型(LLM)服务生成。提取工作通过一款名为[Dataset Enhancer](https://github.com/ncls-p/pplx-to-dataset)的自定义工具完成,该工具可处理文章并通过人工智能提取关键点。 ### 个人与敏感信息 本数据集已尽力避免包含个人或敏感信息。但由于数据集内容来自公开文章,可能会包含公众人物姓名或其他已进入公共领域的信息。 ## 数据集使用注意事项 ### 数据集的社会影响 本数据集旨在提升语言模型的摘要生成能力,帮助用户快速理解长文章的核心要点,节省时间并提升信息获取的便捷性。 ### 偏差讨论 本数据集可能包含源文章中存在的偏见,或在关键点提取过程中引入的偏见。用户需注意以下几点: 1. 文章的选择可能无法均衡代表所有视角 2. 自动化的关键点提取可能会优先强调某些类型的信息 3. 用于提取的大语言模型本身可能存在固有偏见 ### 数据集整理者 本数据集通过[llm-to-blog-key-points-dataset](https://github.com/ncls-p/llm-to-blog-key-points-dataset)工具整理完成,该工具旨在通过人工智能自动化实现文章关键点的提取流程。 ### 许可信息 本数据集采用[知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International)](https://creativecommons.org/licenses/by/4.0/)发布。 ### 引用信息 若您在研究中使用本数据集,请引用如下: @misc{article-key-points-dataset, author = {ncls-p}, title = {blog-key-points}, year = {2025}, howpublished = {Hugging Face Dataset}, url = {https://huggingface.co/datasets/ncls-p/blog-key-points} }
提供机构:
maas
创建时间:
2025-11-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作