abokbot/wikipedia-first-paragraph
收藏数据集描述
本数据集包含英文维基百科文章的第一段内容,经过清洗处理。数据集来源于“20220301.en”版本的维基百科数据集,通过以下步骤生成:
- 加载原始维基百科数据集。
- 定义函数
get_first_paragraph,用于提取每篇文章的第一段。 - 应用该函数到数据集,生成新的数据集。
数据集用途
原始英文维基百科数据集大小超过20GB,加载和计算成本较高。本数据集专注于文章的首段,大小为1.39GB,加载时间约为5分钟,适用于需要快速获取文章主要信息的场景。
数据集加载方法
可通过以下代码加载数据集:
python from datasets import load_dataset
load_dataset("abokbot/wikipedia-first-paragraph")
数据集结构
数据集示例结构如下:
json { id: 12, url: https://en.wikipedia.org/wiki/Anarchism, title: Anarchism, text: Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism. }




