CALM/arwiki

Name: CALM/arwiki
Creator: CALM
Published: 2022-08-01 16:37:23
License: 暂无描述

Hugging Face2022-08-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/CALM/arwiki

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Wikipedia Arabic dumps dataset. language: - ar license: - unknown multilinguality: - monolingual --- # Arabic Wiki Dataset ## Dataset Summary This dataset is extracted using [`wikiextractor`](https://github.com/attardi/wikiextractor) tool, from [Wikipedia Arabic pages](https://dumps.wikimedia.org/arwiki/). ## Supported Tasks and Leaderboards Intended to train **Arabic** language models on MSA (Modern Standard Arabic). ## Dataset Structure The dataset is structured into 2 folders: - `arwiki_20211213_txt`: dataset is divided into subfolders each of which contains no more than 100 documents. - `arwiki_20211213_txt_single`: all documents merged together in a single txt file. ## Dataset Statistics #### Extracts from **December 13, 2021**: | documents | vocabulary | words | | --- | --- | --- | | 1,136,455 | 5,446,560 | 175,566,016 | ## Usage Load all dataset from the single txt file: ```python load_dataset('CALM/arwiki', data_files='arwiki_2021_txt_single/arwiki_20211213.txt') # OR with stream load_dataset('CALM/arwiki', data_files='arwiki_2021_txt_single/arwiki_20211213.txt', streaming=True) ``` Load a smaller subset from the individual txt files: ```python load_dataset('CALM/arwiki', data_files='arwiki_2021_txt/AA/arwiki_20211213_1208.txt') # OR with stream load_dataset('CALM/arwiki', data_files='arwiki_2021_txt/AA/arwiki_20211213_1208.txt', streaming=True) ```

提供机构：

CALM

原始信息汇总

数据集概述

数据集名称

名称：Wikipedia Arabic dumps dataset
语言：阿拉伯语（ar）
许可证：未知
多语言性：单语种

数据集摘要

提取工具：wikiextractor
来源：Wikipedia Arabic pages

支持的任务和排行榜

目的：训练现代标准阿拉伯语（MSA）语言模型

数据集结构

arwiki_20211213_txt: 分为多个子文件夹，每个子文件夹包含不超过100个文档。
arwiki_20211213_txt_single: 所有文档合并为一个txt文件。

数据集统计

提取日期：2021年12月13日
文档数量：1,136,455
词汇量：5,446,560
单词总数：175,566,016

使用方法

加载整个数据集： python load_dataset(CALM/arwiki, data_files=arwiki_2021_txt_single/arwiki_20211213.txt)

或使用流式加载： python load_dataset(CALM/arwiki, data_files=arwiki_2021_txt_single/arwiki_20211213.txt, streaming=True)
加载小部分数据集： python load_dataset(CALM/arwiki, data_files=arwiki_2021_txt/AA/arwiki_20211213_1208.txt)

或使用流式加载： python load_dataset(CALM/arwiki, data_files=arwiki_2021_txt/AA/arwiki_20211213_1208.txt, streaming=True)

5,000+

优质数据集

54 个

任务类型

进入经典数据集