five

premio-ai/TheArabicPile_Medical

收藏
Hugging Face2024-03-21 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/premio-ai/TheArabicPile_Medical
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: cc-by-nc-4.0 task_categories: - text-generation dataset_info: - config_name: dedup features: - name: text dtype: string splits: - name: train num_bytes: 19198059 num_examples: 32016 download_size: 9223336 dataset_size: 19198059 - config_name: default features: - name: text dtype: string splits: - name: train num_bytes: 34717649 num_examples: 53058 download_size: 11091835 dataset_size: 34717649 configs: - config_name: dedup data_files: - split: train path: dedup/train-* - config_name: default data_files: - split: train path: data/train-* --- # The Arabic Pile ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64da0fd923557cdce3e514c3/J0oY67lVvecV75SOlWpjc.png) ## Introduction: The Arabic Pile is a comprehensive dataset meticulously designed to parallel the structure of The Pile and The Nordic Pile. Focused on the Arabic language, the dataset encompasses a vast array of linguistic nuances, incorporating both Modern Standard Arabic (MSA) and various Levantine, North African, and Egyptian dialects. Tailored for the training and fine-tuning of large language models, the dataset consists of 13 subsets, each uniquely crafted to cater to different linguistic domains. ## The Medical Subset: This dataset has a collection of all medical data collected on the interent for the Arabic language. The subset is quite limited and showcases the limitations in the Arabic content. ## Other Subsets: 1. premio-ai/TheArabicPile 2. premio-ai/TheArabicPile_Web 3. premio-ai/TheArabicPile_Lyrics 4. premio-ai/TheArabicPile_Reviews 5. premio-ai/TheArabicPile_Dialects 6. premio-ai/TheArabicPile_Mathematics 7. premio-ai/TheArabicPile_Conversational 8. premio-ai/TheArabicPile_Articles 9. premio-ai/TheArabicPile_Poetry 10. premio-ai/TheArabicPile_Medical 11. premio-ai/TheArabicPile_Miscellaneous 12. premio-ai/TheArabicPile_SocialMedia 13. premio-ai/TheArabicPile_Translations 14. premio-ai/TheArabicPile_Books These subsets serve distinct purposes, ranging from mathematical content to conversational dialogue, medical texts, and more. Notably, there's a dedicated subset, "premio-ai/TheArabicPile_SocialMedia," emphasizing the inclusion of language commonly found in social media contexts. ## Dataset Description * Curated by: Premio.AI team * Language(s) (NLP): Arabic, multiple languages on the translation dataset. * License: CC BY-NC 4.0 Deed - Non Commercial. * For any commercial uses or licensing, please contact mo@premio.ai. ## Data Structure The datasets are divided into two main subsets: 1. Original Subset: The raw data as collected from sources, without modifications. 2. Deduplication Subset: A filtered and cleaned version, enhancing usability for large language models by reducing redundancy and noise. The Arabic Pile extends an invitation not only for training and fine-tuning large language models but also for diverse applications across linguistic domains. Whether for research, analysis, or other linguistic endeavors, The Arabic Pile stands as a rich resource for the exploration of Arabic language intricacies. ## Data Collection Please refer to the paper for more details on our data collection procedures. ## Data Format The dataset has one single column called text. The text should contain the required meta data and the body combined. This was done to make sure that it will be a good fit for direct training or fine-tuning of large language models. Please note that the meta data might require to be repeated if your training context window won’t fit the entire body of text. ## Potential Bias As with any large-scale dataset, The Arabic Pile is not immune to potential biases that may influence the training and performance of language models. It's crucial to transparently address these biases to ensure responsible usage and interpretation of the dataset. Here are some considerations: 1. Dialectal Imbalance: The dataset incorporates various Arabic dialects, with a focus on Levantine, North African, and Egyptian variants. However, there might be variations in the representation of these dialects, potentially leading to an imbalance in the training data. 2. Source Influence: Bias may arise from the sources of the original data. The dataset collects information from diverse platforms and domains, and biases inherent in those sources could transfer to the dataset. 3. Social Media Context: Some of our datasets have language from social media platforms and online platforms. This subset may introduce biases inherent in online discourse, such as informal language, colloquial expressions, and potential subjectivity in politics, religion or culture. 4. Genre and Domain Bias: Different subsets cater to distinct linguistic domains, such as medical texts, poetry, reviews, and more. Each domain carries its own linguistic characteristics, potentially leading to biases based on the genres represented. ## License Information for The Arabic Pile: No Commercial Use The Arabic Pile is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This license is designed to facilitate the open sharing and collaboration of the dataset while ensuring responsible and non-commercial usage. Key Points of the License: * Attribution (BY): Users are free to share, adapt, and build upon the dataset, even commercially, as long as they provide appropriate attribution to the dataset creators. * Non-Commercial (NC): The dataset may not be used for commercial purposes. Any use for commercial gain requires explicit permission from the dataset creators. * No Additional Restrictions: The license allows for maximum freedom of use, provided the terms of attribution and non-commercial use are adhered to. How to Cite: When using The Arabic Pile in your work, please include a proper citation to acknowledge the dataset creators. A recommended citation can be found in the model card for easy reference. License Deed: For a comprehensive understanding of the terms and conditions, please refer to the CC BY-NC 4.0 License Deed. By adopting this license, we aim to foster a collaborative and open environment for the exploration and advancement of Arabic language understanding and natural language processing. ## Citation When utilizing The Arabic Pile in your research, development, or other projects, we kindly request that you cite the dataset using the following format: @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }
提供机构:
premio-ai
原始信息汇总

数据集概述

数据集信息

  • 语言: 阿拉伯语
  • 许可证: CC BY-NC 4.0(非商业用途)
  • 任务类别: 文本生成

配置详情

配置名称: dedup

  • 特征:
    • 名称: text
    • 数据类型: string
  • 分割:
    • 名称: train
    • 字节数: 19198059
    • 样本数: 32016
  • 下载大小: 9223336
  • 数据集大小: 19198059

配置名称: default

  • 特征:
    • 名称: text
    • 数据类型: string
  • 分割:
    • 名称: train
    • 字节数: 34717649
    • 样本数: 53058
  • 下载大小: 11091835
  • 数据集大小: 34717649

数据文件

  • 配置名称: dedup
    • 分割: train
    • 路径: dedup/train-*
  • 配置名称: default
    • 分割: train
    • 路径: data/train-*

数据集描述

  • 数据集名称: The Arabic Pile
  • 创建者: Premio.AI 团队
  • 语言: 阿拉伯语,翻译数据集包含多种语言
  • 许可证: CC BY-NC 4.0(非商业用途)
  • 商业用途: 请联系 mo@premio.ai

数据结构

  • 原始子集: 从来源收集的原始数据,未经修改。
  • 去重子集: 经过过滤和清洗的版本,通过减少冗余和噪声提高大型语言模型的可用性。

数据格式

  • 数据列: text
  • 文本内容: 包含所需的元数据和主体,以便直接用于大型语言模型的训练或微调。

潜在偏差

  • 方言不平衡: 数据集包含多种阿拉伯方言,但可能存在方言表示的差异。
  • 来源影响: 数据集从不同平台和领域收集信息,可能存在来源固有的偏差。
  • 社交媒体上下文: 部分数据集包含来自社交媒体平台的语言,可能引入在线讨论中的偏差。
  • 类型和领域偏差: 不同子集服务于不同的语言领域,每个领域都有其语言特征,可能导致基于类型的偏差。

许可证信息

  • 许可证: CC BY-NC 4.0(非商业用途)
  • 主要条款:
    • 署名 (BY): 用户可以自由分享、改编和构建数据集,只要提供适当的归属。
    • 非商业 (NC): 数据集不得用于商业目的。
    • 无额外限制: 只要遵守署名和非商业用途的条款,许可证允许最大程度的自由使用。

引用

  • 引用格式:

    @article{alrefaie2024arabicpile, author = {Mohamed Taher Alrefaie, Mahmoud Ibrahim Barbary, Ahmed Yasser Hassanein, Shiref Khaled Elhalawany, Karim Ashraf Elsayed, Ahmed Yasser }, title = {The Arabic Pile: A Large Scale Dataset of Diverse Text for Large Language Modeling}, year = {2024}, url = {https://huggingface.co/datasets/premio-ai/TheArabicPile} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作