five

premio-ai/TheArabicPile_Books

收藏
Hugging Face2024-03-23 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/premio-ai/TheArabicPile_Books
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-nc-4.0 task_categories: - text-generation dataset_info: - config_name: dedup features: - name: text dtype: string splits: - name: train num_bytes: 2579823752 num_examples: 1923 download_size: 1116323743 dataset_size: 2579823752 - config_name: original features: - name: text dtype: string splits: - name: train num_bytes: 3079842775 num_examples: 1932 download_size: 1324665373 dataset_size: 3079842775 configs: - config_name: dedup data_files: - split: train path: dedup/train-* - config_name: original data_files: - split: train path: data/train-* --- # The Arabic Pile ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64da0fd923557cdce3e514c3/J0oY67lVvecV75SOlWpjc.png) ## Introduction: The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each uniquely crafted to cater to different linguistic domains. ## The Books Subset: This dataset has a collection of Arabic books found in the open domain. ## Other Subsets: 1. premio-ai/TheArabicPile 2. premio-ai/TheArabicPile_Web 3. premio-ai/TheArabicPile_Lyrics 4. premio-ai/TheArabicPile_Reviews 5. premio-ai/TheArabicPile_Dialects 6. premio-ai/TheArabicPile_Mathematics 7. premio-ai/TheArabicPile_Conversational 8. premio-ai/TheArabicPile_Articles 9. premio-ai/TheArabicPile_Poetry 10. premio-ai/TheArabicPile_Medical 11. premio-ai/TheArabicPile_Miscellaneous 12. premio-ai/TheArabicPile_SocialMedia 13. premio-ai/TheArabicPile_Translations 14. premio-ai/TheArabicPile_Books These subsets serve distinct purposes, ranging from mathematical content to conversational dialogue, medical texts, and more. Notably, there's a dedicated subset, "premio-ai/TheArabicPile_SocialMedia," emphasizing the inclusion of language commonly found in social media contexts. ## Dataset Description * Curated by: Premio.AI team * Language(s) (NLP): Arabic, multiple languages on the translation dataset. * License: CC BY-NC 4.0 Deed - Non Commercial. * For any commercial uses or licensing, please contact mo@premio.ai. ## Data Structure The datasets are divided into two main subsets: 1. Original Subset: The raw data as collected from sources, without modifications. 2. Deduplication Subset: A filtered and cleaned version, enhancing usability for large language models by reducing redundancy and noise. The Arabic Pile extends an invitation not only for training and fine-tuning large language models but also for diverse applications across linguistic domains. Whether for research, analysis, or other linguistic endeavors, The Arabic Pile stands as a rich resource for the exploration of Arabic language intricacies. ## Data Collection Please refer to the paper for more details on our data collection procedures. ## Data Format The dataset has one single column called text. The text should contain the required meta data and the body combined. This was done to make sure that it will be a good fit for direct training or fine-tuning of large language models. Please note that the meta data might require to be repeated if your training context window won’t fit the entire body of text. ## Potential Bias As with any large-scale dataset, The Arabic Pile is not immune to potential biases that may influence the training and performance of language models. It's crucial to transparently address these biases to ensure responsible usage and interpretation of the dataset. Here are some considerations: 1. Dialectal Imbalance: The dataset incorporates various Arabic dialects, with a focus on Levantine, North African, and Egyptian variants. However, there might be variations in the representation of these dialects, potentially leading to an imbalance in the training data. 2. Source Influence: Bias may arise from the sources of the original data. The dataset collects information from diverse platforms and domains, and biases inherent in those sources could transfer to the dataset. 3. Social Media Context: Some of our datasets have language from social media platforms and online platforms. This subset may introduce biases inherent in online discourse, such as informal language, colloquial expressions, and potential subjectivity in politics, religion or culture. 4. Genre and Domain Bias: Different subsets cater to distinct linguistic domains, such as medical texts, poetry, reviews, and more. Each domain carries its own linguistic characteristics, potentially leading to biases based on the genres represented. ## License Information for The Arabic Pile: No Commercial Use The Arabic Pile is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license is designed to facilitate the open sharing and collaboration of the dataset while ensuring responsible and non-commercial usage. Key Points of the License: * Attribution (BY): Users are free to share, adapt, and build upon the dataset, even commercially, as long as they provide appropriate attribution to the dataset creators. * Non-Commercial (NC): The dataset may not be used for commercial purposes. Any use for commercial gain requires explicit permission from the dataset creators. * No Additional Restrictions: The license allows for maximum freedom of use, provided the terms of attribution and non-commercial use are adhered to. How to Cite: When using The Arabic Pile in your work, please include a proper citation to acknowledge the dataset creators. A recommended citation can be found in the model card for easy reference. License Deed: For a comprehensive understanding of the terms and conditions, please refer to the CC BY-NC 4.0 License Deed. By adopting this license, we aim to foster a collaborative and open environment for the exploration and advancement of Arabic language understanding and natural language processing. ## Citation When utilizing The Arabic Pile in your research, development, or other projects, we kindly request that you cite the dataset using the following format: @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }
提供机构:
premio-ai
原始信息汇总

阿拉伯语数据集概述

数据集信息

  • 语言: 阿拉伯语
  • 许可证: CC BY-NC 4.0(非商业用途)
  • 任务类别: 文本生成

数据集配置

  • 配置名称: dedup

    • 特征:
      • 名称: text
      • 数据类型: string
    • 分割:
      • 名称: train
      • 字节数: 2579823752
      • 样本数: 1923
    • 下载大小: 1116323743
    • 数据集大小: 2579823752
  • 配置名称: original

    • 特征:
      • 名称: text
      • 数据类型: string
    • 分割:
      • 名称: train
      • 字节数: 3079842775
      • 样本数: 1932
    • 下载大小: 1324665373
    • 数据集大小: 3079842775

数据文件配置

  • 配置名称: dedup

    • 数据文件:
      • 分割: train
      • 路径: dedup/train-*
  • 配置名称: original

    • 数据文件:
      • 分割: train
      • 路径: data/train-*

数据集描述

  • 策划团队: Premio.AI
  • 语言: 阿拉伯语,翻译数据集包含多种语言。
  • 许可证: CC BY-NC 4.0(非商业用途)
  • 商业用途: 请联系 mo@premio.ai

数据结构

  • 原始子集: 从来源收集的原始数据,未经修改。
  • 去重子集: 经过过滤和清理的版本,通过减少冗余和噪声提高大型语言模型的可用性。

数据格式

  • 数据集包含一个名为 text 的列,该列应包含所需的元数据和正文。这种设计确保了它适合直接用于大型语言模型的训练或微调。

潜在偏差

  • 方言不平衡: 数据集包含各种阿拉伯方言,但这些方言的表示可能存在差异,可能导致训练数据的不平衡。
  • 来源影响: 原始数据来源的多样性可能导致偏差。
  • 社交媒体上下文: 一些数据集包含来自社交媒体平台的语言,可能引入在线讨论中的偏差。
  • 类型和领域偏差: 不同的子集服务于不同的语言领域,每个领域都有其语言特征,可能导致基于所代表类型的偏差。

许可证信息

  • 许可证: CC BY-NC 4.0(非商业用途)
  • 关键点:
    • 署名 (BY): 用户可以自由分享、改编和构建数据集,只要他们适当注明出处。
    • 非商业 (NC): 数据集不得用于商业目的。任何商业用途都需要明确许可。
    • 无额外限制: 只要遵守署名和非商业用途的条款,许可证允许最大程度的自由使用。

引用

  • 引用格式:

    @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作