five

abokbot/wikipedia-first-paragraph

收藏
Hugging Face2023-06-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/abokbot/wikipedia-first-paragraph
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en tags: - wikipedia --- # Dataset Description This dataset contains the first paragraph of cleaned Wikipedia articles in English. It was obtained by transorming the [Wikipedia](https://huggingface.co/datasets/wikipedia) "20220301.en" dataset as follows: ```python from datasets import load_dataset dataset = load_dataset("wikipedia", "20220301.en")["train"] def get_first_paragraph(example): example["text"] = example['text'].split('\n\n')[0] return example dataset = dataset.map(get_first_paragraph) ``` # Why use this dataset? The size of the original English Wikipedia dataset is over 20GB. It takes 20min to load it on a Google Colab notebook and running computations on that dataset can be costly. If you want to create a use case that mostly needs the information in the first paragraph of a Wikipedia article (which is the paragraph with the most important information), this 'wikipedia-first-paragraph' dataset is for you. Its size is 1.39GB and it takes 5 min to load it on a Google colab notebook. # How to load dataset You can load it by runnning: ```python from datasets import load_dataset load_dataset("abokbot/wikipedia-first-paragraph") ``` # Dataset Structure An example looks as follows: ``` { 'id': '12', 'url': 'https://en.wikipedia.org/wiki/Anarchism', 'title': 'Anarchism', 'text': 'Anarchism is a political philosophy and movement that is sceptical of authority and rejects \ all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, \ which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, \ placed on the farthest left of the political spectrum, it is usually described alongside communalism \ and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and \ has a strong historical association with anti-capitalism and socialism.' } ```
提供机构:
abokbot
原始信息汇总

数据集描述

本数据集包含英文维基百科文章的第一段内容,经过清洗处理。数据集来源于“20220301.en”版本的维基百科数据集,通过以下步骤生成:

  1. 加载原始维基百科数据集。
  2. 定义函数get_first_paragraph,用于提取每篇文章的第一段。
  3. 应用该函数到数据集,生成新的数据集。

数据集用途

原始英文维基百科数据集大小超过20GB,加载和计算成本较高。本数据集专注于文章的首段,大小为1.39GB,加载时间约为5分钟,适用于需要快速获取文章主要信息的场景。

数据集加载方法

可通过以下代码加载数据集:

python from datasets import load_dataset

load_dataset("abokbot/wikipedia-first-paragraph")

数据集结构

数据集示例结构如下:

json { id: 12, url: https://en.wikipedia.org/wiki/Anarchism, title: Anarchism, text: Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism. }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作