tarekeldeeb/ArabicCorpus2B
收藏Hugging Face2022-12-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tarekeldeeb/ArabicCorpus2B
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
---
```
BUILDING VOCABULARY
Processed 1754541204 tokens.
Counted 5329509 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 1539115.
```
---
# Build the Arabic Corpus
#### Dowload Resources
The arabic corpus {1.9B word} consists of the following resources:
- ShamelaLibrary348.7z [link](https://www.quran.tv/ketab/ShamelaLibrary348.7z) {1.15B}
- UN arabic corpus [mirror1](http://lotus.kuee.kyoto-u.ac.jp/~raj/rajwindroot/corpora_downloads/UN_CORPUS/UNv1.0.6way.ar.txt) [mirror2](http://corpus.leeds.ac.uk/bogdan/resources/UN-corpus/6way/UNv1.0.6way.ar.txt) {0.37B}
- AraCorpus.tar.gz [link](http://aracorpus.e3rab.com/argistestsrv.nmsu.edu/AraCorpus.tar.gz) {0.14B}
- Arabic Wikipedia Latest Articles Dump [link](https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2) {0.11B}
- Tashkeela-arabic-diacritized-text-utf8-0.3.zip [link](https://netix.dl.sourceforge.net/project/tashkeela/) {0.07B}
- Arabic Tweets [link](https://github.com/bakrianoo/Datasets) {0.03B}
- watan-2004.7z [link](https://netix.dl.sourceforge.net/project/arabiccorpus/watan-2004corpus/watan-2004.7z) {0.01B}
#### Build Script: https://github.com/tarekeldeeb/GloVe-Arabic/tree/master/arabic_corpus
---
# Download the dataset
Mirror : https://archive.org/details/arabic_corpus
---
license: Waqf v2 (https://github.com/ojuba-org/waqf/tree/master/2.0)
提供机构:
tarekeldeeb
原始信息汇总
数据集概述
数据集组成
- ShamelaLibrary348.7z: 包含1.15B字
- UN arabic corpus: 包含0.37B字
- AraCorpus.tar.gz: 包含0.14B字
- Arabic Wikipedia Latest Articles Dump: 包含0.11B字
- Tashkeela-arabic-diacritized-text-utf8-0.3.zip: 包含0.07B字
- Arabic Tweets: 包含0.03B字
- watan-2004.7z: 包含0.01B字
词汇构建
- 处理了1754541204个tokens
- 统计了5329509个独特单词
- 词汇量截断至最小计数5
- 使用词汇量为1539115
许可证
- 数据集使用Waqf v2许可证



