five

tarekeldeeb/ArabicCorpus2B

收藏
Hugging Face2022-12-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tarekeldeeb/ArabicCorpus2B
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other --- ``` BUILDING VOCABULARY Processed 1754541204 tokens. Counted 5329509 unique words. Truncating vocabulary at min count 5. Using vocabulary of size 1539115. ``` --- # Build the Arabic Corpus #### Dowload Resources The arabic corpus {1.9B word} consists of the following resources: - ShamelaLibrary348.7z [link](https://www.quran.tv/ketab/ShamelaLibrary348.7z) {1.15B} - UN arabic corpus [mirror1](http://lotus.kuee.kyoto-u.ac.jp/~raj/rajwindroot/corpora_downloads/UN_CORPUS/UNv1.0.6way.ar.txt) [mirror2](http://corpus.leeds.ac.uk/bogdan/resources/UN-corpus/6way/UNv1.0.6way.ar.txt) {0.37B} - AraCorpus.tar.gz [link](http://aracorpus.e3rab.com/argistestsrv.nmsu.edu/AraCorpus.tar.gz) {0.14B} - Arabic Wikipedia Latest Articles Dump [link](https://dumps.wikimedia.org/arwiki/latest/arwiki-latest-pages-articles.xml.bz2) {0.11B} - Tashkeela-arabic-diacritized-text-utf8-0.3.zip [link](https://netix.dl.sourceforge.net/project/tashkeela/) {0.07B} - Arabic Tweets [link](https://github.com/bakrianoo/Datasets) {0.03B} - watan-2004.7z [link](https://netix.dl.sourceforge.net/project/arabiccorpus/watan-2004corpus/watan-2004.7z) {0.01B} #### Build Script: https://github.com/tarekeldeeb/GloVe-Arabic/tree/master/arabic_corpus --- # Download the dataset Mirror : https://archive.org/details/arabic_corpus --- license: Waqf v2 (https://github.com/ojuba-org/waqf/tree/master/2.0)
提供机构:
tarekeldeeb
原始信息汇总

数据集概述

数据集组成

  • ShamelaLibrary348.7z: 包含1.15B字
  • UN arabic corpus: 包含0.37B字
  • AraCorpus.tar.gz: 包含0.14B字
  • Arabic Wikipedia Latest Articles Dump: 包含0.11B字
  • Tashkeela-arabic-diacritized-text-utf8-0.3.zip: 包含0.07B字
  • Arabic Tweets: 包含0.03B字
  • watan-2004.7z: 包含0.01B字

词汇构建

  • 处理了1754541204个tokens
  • 统计了5329509个独特单词
  • 词汇量截断至最小计数5
  • 使用词汇量为1539115

许可证

  • 数据集使用Waqf v2许可证
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作