five

MoryBinM/101_billion_arabic_words_dataset

收藏
Hugging Face2026-01-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MoryBinM/101_billion_arabic_words_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: apache-2.0 size_categories: - 100B<n<1T task_categories: - text-generation pretty_name: 101 Billion Arabic Words Dataset dataset_info: features: - name: date dtype: string - name: text dtype: string - name: url dtype: string splits: - name: train num_bytes: 234862507623 num_examples: 33059988 download_size: 96089262509 dataset_size: 234862507623 configs: - config_name: default data_files: - split: train path: data/train-* --- # 101 Billion Arabic Words Dataset ### Updates - **Maintenance Status:** Actively Maintained - **Update Frequency:** Weekly updates to refine data quality and expand coverage. ### Upcoming Version - **More Cleaned Version:** A more cleaned version of the dataset is in processing, which includes the addition of a UUID column for better data traceability and management. ## Dataset Details The 101 Billion Arabic Words Dataset is curated by the Clusterlab team and consists of 101 billion words extracted and cleaned from web content, specifically targeting Arabic text. This dataset is intended for use in natural language processing applications, particularly in training and fine-tuning Large Language Models (LLMs) capable of understanding and generating Arabic text. - **Curated by:** Clusterlab Team - **Language(s) (NLP):** Mix of Modern Standard Arabic (MSA) & Arabic Dialects - **License:** Apache 2.0 - **Repository:** [HuggingFace Dataset Page](https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset) - **Paper:** [101 Billion Arabic Words Dataset](https://arxiv.org/abs/2405.01590) ## Uses ### Direct Use The dataset is suitable for training and fine-tuning models that perform text-generation task in Arabic. Its vast size and comprehensive coverage of Arabic text make it a valuable resource for developing language models. ### Out-of-Scope Use The dataset is not intended for uses that require personal or sensitive data as it consists of general web text. Uses requiring fine-grained dialectal understanding or specific cultural nuances without further processing and adaptation might find limitations in this dataset. ## Dataset Structure ```json { "text": "content...", "date": "YYYY-MM-DDTHH:MM:SSZ", "uuid": "123e4567-e89b-12d3-a456-426614174000" } ``` ## Dataset Creation ### Curation Rationale This dataset was created to address the significant lack of large-scale, high-quality datasets for the Arabic language in NLP research and application development. It aims to provide a foundation for developing more accurate and efficient Arabic language models. ### Source Data #### Data Collection and Processing We initially gathered data from specified sources, primarily Common Crawl, and extracted Arabic content from WET files using Rust. Then, we applied our preprocessing pipeline, which included text cleaning and deduplication. ## Bias, Risks, and Limitations The dataset primarily consists of web text that may include biases present in online content. Users should be aware of these potential biases when training models with this dataset. Further research and adjustment may be necessary to mitigate these biases for specific applications. ### Recommendations Users should critically evaluate the dataset for any potential biases or misrepresentations of the Arabic language and culture due to its web-derived nature. ### Citation Information ``` @misc{aloui2024101, title={101 Billion Arabic Words Dataset}, author={Manel Aloui and Hasna Chouikhi and Ghaith Chaabane and Haithem Kchaou and Chehir Dhaouadi}, year={2024}, eprint={2405.01590}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
MoryBinM
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作