tarekeldeeb/ArabicCorpus2B

Name: tarekeldeeb/ArabicCorpus2B
Creator: tarekeldeeb
Published: 2022-12-14 11:17:34
License: 暂无描述

Hugging Face2022-12-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/tarekeldeeb/ArabicCorpus2B

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other --- ``` BUILDING VOCABULARY Processed 1754541204 tokens. Counted 5329509 unique words. Truncating vocabulary at min count 5. Using vocabulary of size 1539115. ``` --- # Build the Arabic Corpus #### Dowload Resources The arabic corpus {1.9B word} consists of the following resources: - ShamelaLibrary348.7z [link](https://www.quran.tv/ketab/ShamelaLibrary348.7z) {1.15B} - UN arabic corpus [mirror1](http://lotus.kuee.kyoto-u.ac.jp/~raj/rajwindroot/corpora_downloads/UN_CORPUS/UNv1.0.6way.ar.txt) [mirror2](http://corpus.leeds.ac.uk/bogdan/resources/UN-corpus/6way/UNv1.0.6way.ar.txt) {0.37B} - AraCorpus.tar.gz [link](http://aracorpus.e3rab.com/argistestsrv.nmsu.edu/AraCorpus.tar.gz) {0.14B} - Arabic Wikipedia Latest Articles Dump [link](https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2) {0.11B} - Tashkeela-arabic-diacritized-text-utf8-0.3.zip [link](https://netix.dl.sourceforge.net/project/tashkeela/) {0.07B} - Arabic Tweets [link](https://github.com/bakrianoo/Datasets) {0.03B} - watan-2004.7z [link](https://netix.dl.sourceforge.net/project/arabiccorpus/watan-2004corpus/watan-2004.7z) {0.01B} #### Build Script: https://github.com/tarekeldeeb/GloVe-Arabic/tree/master/arabic_corpus --- # Download the dataset Mirror : https://archive.org/details/arabic_corpus --- license: Waqf v2 (https://github.com/ojuba-org/waqf/tree/master/2.0)

提供机构：

tarekeldeeb

原始信息汇总

数据集概述

数据集组成

ShamelaLibrary348.7z: 包含1.15B字
UN arabic corpus: 包含0.37B字
AraCorpus.tar.gz: 包含0.14B字
Arabic Wikipedia Latest Articles Dump: 包含0.11B字
Tashkeela-arabic-diacritized-text-utf8-0.3.zip: 包含0.07B字
Arabic Tweets: 包含0.03B字
watan-2004.7z: 包含0.01B字

词汇构建

处理了1754541204个tokens
统计了5329509个独特单词
词汇量截断至最小计数5
使用词汇量为1539115

许可证

数据集使用Waqf v2许可证

5,000+

优质数据集

54 个

任务类型

进入经典数据集