five

Ali-C137/Darija-Stories-Dataset

收藏
Hugging Face2023-07-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Ali-C137/Darija-Stories-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: ChapterName dtype: string - name: ChapterLink dtype: string - name: Author dtype: string - name: Text dtype: string - name: Tags dtype: int64 splits: - name: train num_bytes: 476926644 num_examples: 6142 download_size: 241528641 dataset_size: 476926644 license: cc-by-nc-4.0 task_categories: - text-generation language: - ar pretty_name: Darija (Moroccan Arabic) Stories Dataset --- # Dataset Card for "Darija-Stories-Dataset" **Darija (Moroccan Arabic) Stories Dataset is a large-scale collection of stories written in Moroccan Arabic dialect (Darija).** ## Dataset Description Darija (Moroccan Arabic) Stories Dataset contains a diverse range of stories that provide insights into Moroccan culture, traditions, and everyday life. The dataset consists of textual content from various chapters, including narratives, dialogues, and descriptions. Each story chapter is associated with a URL link for online reading or reference. The dataset also includes information about the author and tags that provide additional context or categorization. ## Dataset Details - **Homepage:** https://huggingface.co/datasets/Ali-C137/Darija-Stories-Dataset - **Author:** Elfilali Ali - **Email:** ali.elfilali00@gmail.com, alielfilali0909@gmail.com - **Github Profile:** [https://github.com/alielfilali01](https://github.com/alielfilali01) - **LinkedIn Profile:** [https://www.linkedin.com/in/alielfilali01/](https://www.linkedin.com/in/alielfilali01/) ## Dataset Size The Darija (Moroccan Arabic) Stories Dataset is the largest publicly available dataset in Moroccan Arabic dialect (Darija) to date, with over 70 million tokens. ## Potential Use Cases - **Arabic Dialect NLP:** Researchers can utilize this dataset to develop and evaluate NLP models specifically designed for Arabic dialects, with a focus on Moroccan Arabic (Darija). Tasks such as dialect identification, part-of-speech tagging, and named entity recognition can be explored. - **Sentiment Analysis:** The dataset can be used to analyze sentiment expressed in Darija stories, enabling sentiment classification, emotion detection, or opinion mining within the context of Moroccan culture. - **Text Generation:** Researchers and developers can leverage the dataset to generate new stories or expand existing ones using various text generation techniques, facilitating the development of story generation systems specifically tailored for Moroccan Arabic dialect. ## Dataset Access The Darija (Moroccan Arabic) Stories Dataset is available for academic and non-commercial use, under a Creative Commons Non Commercial license. ## Citation Please use the following citation when referencing the Darija (Moroccan Arabic) Stories Dataset: ``` @dataset{ title = {Darija (Moroccan Arabic) Stories Dataset}, author = {Elfilali Ali}, howpublished = {Dataset}, url = {https://huggingface.co/datasets/Ali-C137/Darija-Stories-Dataset}, year = {2023}, } ```
提供机构:
Ali-C137
原始信息汇总

数据集概述

数据集名称

Darija (Moroccan Arabic) Stories Dataset

数据集描述

该数据集包含多样化的故事,涵盖摩洛哥文化、传统和日常生活。数据集中的每个故事章节都附有在线阅读或参考的URL链接,并包含作者信息和提供额外上下文或分类的标签。

数据集特征

  • ChapterName: 字符串类型
  • ChapterLink: 字符串类型
  • Author: 字符串类型
  • Text: 字符串类型
  • Tags: 整数类型

数据集分割

  • train: 6142个样本,占用476926644字节

数据集大小

  • 下载大小: 241528641字节
  • 数据集大小: 476926644字节

许可

cc-by-nc-4.0

任务类别

  • 文本生成

语言

  • 阿拉伯语

潜在用途

  • 阿拉伯方言NLP研究
  • 情感分析
  • 文本生成

访问权限

该数据集适用于学术和非商业用途。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作