premio-ai/TheArabicPile_Web

Name: premio-ai/TheArabicPile_Web
Creator: premio-ai
Published: 2024-03-21 21:46:28
License: 暂无描述

Hugging Face2024-03-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/premio-ai/TheArabicPile_Web

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: splits: - name: train num_bytes: 62937755336.483734 num_examples: 4949349 download_size: 29828157902 dataset_size: 62937755336.483734 configs: - config_name: default data_files: - split: train path: data/train-* --- # The Arabic Pile ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64da0fd923557cdce3e514c3/J0oY67lVvecV75SOlWpjc.png) ## Introduction: The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each uniquely crafted to cater to different linguistic domains. ## The Web Subset: This dataset has a collection of Arabic web articles based on Common Crawl. ## Other Subsets: 1. premio-ai/TheArabicPile 2. premio-ai/TheArabicPile_Web 3. premio-ai/TheArabicPile_Lyrics 4. premio-ai/TheArabicPile_Reviews 5. premio-ai/TheArabicPile_Dialects 6. premio-ai/TheArabicPile_Mathematics 7. premio-ai/TheArabicPile_Conversational 8. premio-ai/TheArabicPile_Articles 9. premio-ai/TheArabicPile_Poetry 10. premio-ai/TheArabicPile_Medical 11. premio-ai/TheArabicPile_Miscellaneous 12. premio-ai/TheArabicPile_SocialMedia 13. premio-ai/TheArabicPile_Translations 14. premio-ai/TheArabicPile_Books These subsets serve distinct purposes, ranging from mathematical content to conversational dialogue, medical texts, and more. Notably, there's a dedicated subset, "premio-ai/TheArabicPile_SocialMedia," emphasizing the inclusion of language commonly found in social media contexts. ## Dataset Description * Curated by: Premio.AI team * Language(s) (NLP): Arabic, multiple languages on the translation dataset. * License: CC BY-NC 4.0 Deed - Non Commercial. * For any commercial uses or licensing, please contact mo@premio.ai. ## Data Structure The datasets are divided into two main subsets: 1. Original Subset: The raw data as collected from sources, without modifications. 2. Deduplication Subset: A filtered and cleaned version, enhancing usability for large language models by reducing redundancy and noise. The Arabic Pile extends an invitation not only for training and fine-tuning large language models but also for diverse applications across linguistic domains. Whether for research, analysis, or other linguistic endeavors, The Arabic Pile stands as a rich resource for the exploration of Arabic language intricacies. ## Data Collection Please refer to the paper for more details on our data collection procedures. ## Data Format The dataset has one single column called text. The text should contain the required meta data and the body combined. This was done to make sure that it will be a good fit for direct training or fine-tuning of large language models. Please note that the meta data might require to be repeated if your training context window won’t fit the entire body of text. ## Potential Bias As with any large-scale dataset, The Arabic Pile is not immune to potential biases that may influence the training and performance of language models. It's crucial to transparently address these biases to ensure responsible usage and interpretation of the dataset. Here are some considerations: 1. Dialectal Imbalance: The dataset incorporates various Arabic dialects, with a focus on Levantine, North African, and Egyptian variants. However, there might be variations in the representation of these dialects, potentially leading to an imbalance in the training data. 2. Source Influence: Bias may arise from the sources of the original data. The dataset collects information from diverse platforms and domains, and biases inherent in those sources could transfer to the dataset. 3. Social Media Context: Some of our datasets have language from social media platforms and online platforms. This subset may introduce biases inherent in online discourse, such as informal language, colloquial expressions, and potential subjectivity in politics, religion or culture. 4. Genre and Domain Bias: Different subsets cater to distinct linguistic domains, such as medical texts, poetry, reviews, and more. Each domain carries its own linguistic characteristics, potentially leading to biases based on the genres represented. ## License Information for The Arabic Pile: No Commercial Use The Arabic Pile is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license is designed to facilitate the open sharing and collaboration of the dataset while ensuring responsible and non-commercial usage. Key Points of the License: * Attribution (BY): Users are free to share, adapt, and build upon the dataset, even commercially, as long as they provide appropriate attribution to the dataset creators. * Non-Commercial (NC): The dataset may not be used for commercial purposes. Any use for commercial gain requires explicit permission from the dataset creators. * No Additional Restrictions: The license allows for maximum freedom of use, provided the terms of attribution and non-commercial use are adhered to. How to Cite: When using The Arabic Pile in your work, please include a proper citation to acknowledge the dataset creators. A recommended citation can be found in the model card for easy reference. License Deed: For a comprehensive understanding of the terms and conditions, please refer to the CC BY-NC 4.0 License Deed. By adopting this license, we aim to foster a collaborative and open environment for the exploration and advancement of Arabic language understanding and natural language processing. ## Citation When utilizing The Arabic Pile in your research, development, or other projects, we kindly request that you cite the dataset using the following format: @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }

提供机构：

premio-ai

原始信息汇总

阿拉伯语数据集（The Arabic Pile）

简介

阿拉伯语数据集（The Arabic Pile）是一个精心设计的综合数据集，旨在与The Pile和The Nordic Pile的结构相平行。该数据集专注于阿拉伯语，涵盖了现代标准阿拉伯语（MSA）以及各种黎凡特、北非和埃及方言。该数据集由13个子集组成，每个子集都针对不同的语言领域进行了独特的设计，适用于大型语言模型的训练和微调。

子集列表

premio-ai/TheArabicPile
premio-ai/TheArabicPile_Web
premio-ai/TheArabicPile_Lyrics
premio-ai/TheArabicPile_Reviews
premio-ai/TheArabicPile_Dialects
premio-ai/TheArabicPile_Mathematics
premio-ai/TheArabicPile_Conversational
premio-ai/TheArabicPile_Articles
premio-ai/TheArabicPile_Poetry
premio-ai/TheArabicPile_Medical
premio-ai/TheArabicPile_Miscellaneous
premio-ai/TheArabicPile_SocialMedia
premio-ai/TheArabicPile_Translations
premio-ai/TheArabicPile_Books

这些子集涵盖了从数学内容到对话、医学文本等多种用途，特别是“premio-ai/TheArabicPile_SocialMedia”子集，强调了社交媒体中常见的语言。

数据集描述

策划团队：Premio.AI团队
语言：阿拉伯语，翻译数据集包含多种语言。
许可证：CC BY-NC 4.0 Deed - 非商业用途。

数据结构

数据集分为两个主要子集：

原始子集：从来源收集的原始数据，未经修改。
去重子集：经过过滤和清洗的版本，通过减少冗余和噪声来提高大型语言模型的可用性。

数据格式

数据集包含一个名为“text”的单列，文本应包含所需的元数据和主体。这种设计确保了它适合直接用于大型语言模型的训练或微调。

潜在偏差

与任何大规模数据集一样，阿拉伯语数据集（The Arabic Pile）可能存在潜在偏差，这些偏差可能影响语言模型的训练和性能。以下是一些考虑因素：

方言不平衡：数据集包含各种阿拉伯方言，但这些方言的表示可能存在差异。
来源影响：原始数据来源的偏差可能转移到数据集中。
社交媒体上下文：某些数据集包含来自社交媒体平台的语言，这可能引入在线讨论中的偏差。
类型和领域偏差：不同子集针对不同的语言领域，每个领域都有其独特的语言特征。

许可证信息

阿拉伯语数据集（The Arabic Pile）在Creative Commons Attribution-NonCommercial 4.0 International License（CC BY-NC 4.0）下发布。该许可证旨在促进数据集的开放共享和协作，同时确保负责任和非商业用途。

引用

在使用阿拉伯语数据集（The Arabic Pile）时，请按以下格式引用：

@article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集