five

premio-ai/TheArabicPile_Lyrics

收藏
Hugging Face2024-03-05 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/premio-ai/TheArabicPile_Lyrics
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: original data_files: - split: train path: original/* - config_name: dedup data_files: - split: train path: dedup/train-* dataset_info: config_name: dedup features: - name: text dtype: string splits: - name: train num_bytes: 32302240 num_examples: 28846 download_size: 14204745 dataset_size: 32302240 license: cc-by-nc-4.0 task_categories: - text-generation language: - ar --- # The Arabic Pile ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64da0fd923557cdce3e514c3/J0oY67lVvecV75SOlWpjc.png) ## Introduction: The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each uniquely crafted to cater to different linguistic domains. ## The Lyrics Subset: This dataset has a collection of Arabic lyrics collected from various sources. The dataset includes vairous MSA and Arabic dialects. The dataset is optimized for the fine-tuning of large language models. Please use accordingly. ## Other Subsets: 1. premio-ai/TheArabicPile 2. premio-ai/TheArabicPile_Web 3. premio-ai/TheArabicPile_Lyrics 4. premio-ai/TheArabicPile_Reviews 5. premio-ai/TheArabicPile_Dialects 6. premio-ai/TheArabicPile_Mathematics 7. premio-ai/TheArabicPile_Conversational 8. premio-ai/TheArabicPile_Articles 9. premio-ai/TheArabicPile_Poetry 10. premio-ai/TheArabicPile_Medical 11. premio-ai/TheArabicPile_Miscellaneous 12. premio-ai/TheArabicPile_SocialMedia 13. premio-ai/TheArabicPile_Translations 14. premio-ai/TheArabicPile_Books These subsets serve distinct purposes, ranging from mathematical content to conversational dialogue, medical texts, and more. Notably, there's a dedicated subset, "premio-ai/TheArabicPile_SocialMedia," emphasizing the inclusion of language commonly found in social media contexts. ## Dataset Description * Curated by: Premio.AI team * Language(s) (NLP): Arabic, multiple languages on the translation dataset. * License: CC BY-NC 4.0 Deed - Non Commercial. * For any commercial uses or licensing, please contact mo@premio.ai. ## Data Structure The datasets are divided into two main subsets: 1. Original Subset: The raw data as collected from sources, without modifications. 2. Deduplication Subset: A filtered and cleaned version, enhancing usability for large language models by reducing redundancy and noise. The Arabic Pile extends an invitation not only for training and fine-tuning large language models but also for diverse applications across linguistic domains. Whether for research, analysis, or other linguistic endeavors, The Arabic Pile stands as a rich resource for the exploration of Arabic language intricacies. ## Data Collection Please refer to the paper for more details on our data collection procedures. ## Data Format The dataset has one single column called text. The text should contain the required meta data and the body combined. This was done to make sure that it will be a good fit for direct training or fine-tuning of large language models. Please note that the meta data might require to be repeated if your training context window won’t fit the entire body of text. ## Potential Bias As with any large-scale dataset, The Arabic Pile is not immune to potential biases that may influence the training and performance of language models. It's crucial to transparently address these biases to ensure responsible usage and interpretation of the dataset. Here are some considerations: 1. Dialectal Imbalance: The dataset incorporates various Arabic dialects, with a focus on Levantine, North African, and Egyptian variants. However, there might be variations in the representation of these dialects, potentially leading to an imbalance in the training data. 2. Source Influence: Bias may arise from the sources of the original data. The dataset collects information from diverse platforms and domains, and biases inherent in those sources could transfer to the dataset. 3. Social Media Context: Some of our datasets have language from social media platforms and online platforms. This subset may introduce biases inherent in online discourse, such as informal language, colloquial expressions, and potential subjectivity in politics, religion or culture. 4. Genre and Domain Bias: Different subsets cater to distinct linguistic domains, such as medical texts, poetry, reviews, and more. Each domain carries its own linguistic characteristics, potentially leading to biases based on the genres represented. ## License Information for The Arabic Pile: No Commercial Use The Arabic Pile is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license is designed to facilitate the open sharing and collaboration of the dataset while ensuring responsible and non-commercial usage. Key Points of the License: * Attribution (BY): Users are free to share, adapt, and build upon the dataset, even commercially, as long as they provide appropriate attribution to the dataset creators. * Non-Commercial (NC): The dataset may not be used for commercial purposes. Any use for commercial gain requires explicit permission from the dataset creators. * No Additional Restrictions: The license allows for maximum freedom of use, provided the terms of attribution and non-commercial use are adhered to. How to Cite: When using The Arabic Pile in your work, please include a proper citation to acknowledge the dataset creators. A recommended citation can be found in the model card for easy reference. License Deed: For a comprehensive understanding of the terms and conditions, please refer to the CC BY-NC 4.0 License Deed. By adopting this license, we aim to foster a collaborative and open environment for the exploration and advancement of Arabic language understanding and natural language processing. ## Citation When utilizing The Arabic Pile in your research, development, or other projects, we kindly request that you cite the dataset using the following format: @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }
提供机构:
premio-ai
原始信息汇总

数据集概述

数据集名称

The Arabic Pile

数据集简介

The Arabic Pile 是一个精心设计的综合数据集,旨在与 The Pile 和 The Nordic Pile 的结构相平行。该数据集专注于阿拉伯语,包含现代标准阿拉伯语(MSA)和各种黎凡特、北非和埃及方言。该数据集由13个子集组成,每个子集都针对不同的语言领域进行了独特的设计。

数据集配置

  • 配置名称: dedup
  • 特征:
    • 名称: text
    • 数据类型: string
  • 分割:
    • 名称: train
    • 字节数: 32302240
    • 样本数: 28846
  • 下载大小: 14204745
  • 数据集大小: 32302240

许可证

CC BY-NC 4.0 Deed - 非商业用途。

任务类别

  • 文本生成

语言

  • 阿拉伯语

数据结构

数据集分为两个主要子集:

  1. 原始子集: 从来源收集的原始数据,未经修改。
  2. 去重子集: 经过过滤和清理的版本,通过减少冗余和噪声来提高大型语言模型的可用性。

数据格式

数据集包含一个名为 text 的单列。文本应包含所需的元数据和主体组合。这样做是为了确保它适合直接训练或微调大型语言模型。

潜在偏差

与任何大规模数据集一样,The Arabic Pile 可能存在潜在偏差,这些偏差可能会影响语言模型的训练和性能。以下是一些考虑因素:

  1. 方言不平衡: 数据集包含各种阿拉伯方言,重点关注黎凡特、北非和埃及变体。然而,这些方言的表示可能存在差异,可能导致训练数据的不平衡。
  2. 来源影响: 原始数据的来源可能引入偏差。数据集从不同的平台和领域收集信息,这些来源的固有偏差可能会转移到数据集中。
  3. 社交媒体上下文: 某些数据集包含来自社交媒体平台和在线平台的语言。这个子集可能引入在线讨论中固有的偏差,如非正式语言、口语表达和政治、宗教或文化中的潜在主观性。
  4. 类型和领域偏差: 不同的子集针对不同的语言领域,如医学文本、诗歌、评论等。每个领域都有其独特的语言特征,可能导致基于所代表类型的偏差。

引用

在使用 The Arabic Pile 进行研究、开发或其他项目时,请使用以下格式引用数据集:

@article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作