Wikipedia-Abstract

Name: Wikipedia-Abstract
Creator: maas
Published: 2025-12-05 16:52:48
License: 暂无描述

魔搭社区2025-12-05 更新2025-10-11 收录

下载链接：

https://modelscope.cn/datasets/laion/Wikipedia-Abstract

下载链接

链接失效反馈

官方服务：

资源简介：

<h1 style="text-align: center;">Wikipedia Abstract</h1> <p align="center"> <img src="Wikipedia.jpg" alt="Wikipedia X Logo" width="250" height="250" /> </p> **Introducing Wikipedia Abstract**, a comprehensive dataset encompassing abstracts, complete articles, and a popularity score index for both widely spoken and lesser-known Wikipedia subsets. Our dedication to Wikipedia-X ensures a centralized Wikipedia dataset that undergoes regular updates and adheres to the highest standards. A central focus of our efforts was to include exotic languages that often lack up-to-date Wikipedia dumps or may not have any dumps at all. Languages such as Hebrew, Urdu, Bengali, Aramaic, Uighur, and Polish were prioritized to ensure high-quality processed Wikipedia datasets are accessible for these languages with substantial speaker bases. This initiative aims to enable Artificial Intelligence to thrive across all languages, breaking down language barriers and fostering inclusivity. Notice: We're continuously updating this dataset every 8 months as part of a broader effort at LAION AI dedicated to textual embeddings. If you'd like to see a specific language added, please don't hesitate to reach out to us. #### Dataset Information: Indexed on (history): 19th of August 2024 **Sourcing:** Our dataset is sourced from the outstanding dumps of the Wikimedia project. It represents the content of Wikipedia pages of corresponding articles without any alterations. **Cleaning:** Some languages, like English and German, underwent cleaning while maintaining their Unicode representation. **Structure:** Our dataset includes the following columns: **Abstract:** Contains complete abstracts for each entry **Version Control:** Base64-encoded metadata of the official Wikipedia extraction code. **WE HAVE RELEASED WIKIPEDIA X (FULL) FOR ENTIRE TEXT OF ALL THE ARTICLES IN THE BELOW MENTIONED 17 LANGUAGES** [HF LINK](https://huggingface.co/datasets/laion/Wikipedia-X-Full) | Language | Code | |---------------|-------| | English | en | | German | de | | Polish | pl | | Spanish | es | | Hebrew | he | | French | fr | | Chinese | zh | | Italian | it | | Russian | ru | | Urdu | ur | | Portuguese | pt | | Aramaic | arc | | Cebuano | ceb | | Swedish | sv | | Uighur | ug | | Bengali | bn | | Arabic | ar |

<h1 style="text-align: center;">维基百科摘要（Wikipedia Abstract）</h1> <p align="center"> <img src="Wikipedia.jpg" alt="维基百科X标识（Wikipedia X Logo）" width="250" height="250" /> </p> **推出维基百科摘要数据集（Wikipedia Abstract）**：这是一套涵盖热门与小众维基百科条目子集的摘要、完整文章及热度评分指数的综合性数据集。我们对维基百科X（Wikipedia-X）项目的投入，旨在打造一套可定期更新、符合最高标准的中心化维基百科数据集。本项目的核心重点之一，是纳入那些通常缺乏最新维基百科数据转储（dumps）甚至完全没有数据转储的小众语种。我们优先处理希伯来语、乌尔都语、孟加拉语、阿拉姆语、维吾尔语（Uighur）及波兰语等语言，确保这些拥有大量使用人群的语种也能获取经过高质量处理的维基百科数据集。本倡议旨在推动人工智能（Artificial Intelligence）在所有语言场景中发挥作用，打破语言壁垒，促进语言包容。 **注意：** 作为LAION AI专注于文本嵌入的整体项目的一部分，我们每8个月就会对本数据集进行一次更新。若您希望将特定语言纳入数据集，欢迎随时与我们联系。 #### 数据集详情：索引更新时间（历史版本）：2024年8月19日 **数据来源：** 本数据集源自维基媒体项目的优质数据转储文件，完整保留对应维基百科文章页面的原始内容，未做任何修改。 **数据清洗：** 英语、德语等部分语言的数据集在清洗过程中保留了其Unicode编码表示。 **数据集结构：** 本数据集包含以下字段： **摘要（Abstract）**：存储每条条目的完整摘要内容 **版本控制：** 采用Base64编码的官方维基百科抽取代码元数据。 **我们已针对下述17种语言的全部文章内容推出维基百科X全量数据集（Wikipedia X (Full)）** [Hugging Face 链接（HF LINK）](https://huggingface.co/datasets/laion/Wikipedia-X-Full) | 语言名称 | 语言代码 | |-------------------|----------| | 英语（English） | en | | 德语（German） | de | | 波兰语（Polish） | pl | | 西班牙语（Spanish）| es | | 希伯来语（Hebrew）| he | | 法语（French） | fr | | 汉语（Chinese） | zh | | 意大利语（Italian）| it | | 俄语（Russian） | ru | | 乌尔都语（Urdu） | ur | | 葡萄牙语（Portuguese）| pt | | 阿拉姆语（Aramaic）| arc | | 宿务语（Cebuano） | ceb | | 瑞典语（Swedish） | sv | | 维吾尔语（Uighur）| ug | | 孟加拉语（Bengali）| bn | | 阿拉伯语（Arabic）| ar |

提供机构：

maas

创建时间：

2025-10-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集