Ali-C137/Darija-Stories-Dataset
收藏Hugging Face2023-07-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Ali-C137/Darija-Stories-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: ChapterName
dtype: string
- name: ChapterLink
dtype: string
- name: Author
dtype: string
- name: Text
dtype: string
- name: Tags
dtype: int64
splits:
- name: train
num_bytes: 476926644
num_examples: 6142
download_size: 241528641
dataset_size: 476926644
license: cc-by-nc-4.0
task_categories:
- text-generation
language:
- ar
pretty_name: Darija (Moroccan Arabic) Stories Dataset
---
# Dataset Card for "Darija-Stories-Dataset"
**Darija (Moroccan Arabic) Stories Dataset is a large-scale collection of stories written in Moroccan Arabic dialect (Darija).**
## Dataset Description
Darija (Moroccan Arabic) Stories Dataset contains a diverse range of stories that provide insights into Moroccan culture, traditions, and everyday life. The dataset consists of textual content from various chapters, including narratives, dialogues, and descriptions. Each story chapter is associated with a URL link for online reading or reference. The dataset also includes information about the author and tags that provide additional context or categorization.
## Dataset Details
- **Homepage:** https://huggingface.co/datasets/Ali-C137/Darija-Stories-Dataset
- **Author:** Elfilali Ali
- **Email:** ali.elfilali00@gmail.com, alielfilali0909@gmail.com
- **Github Profile:** [https://github.com/alielfilali01](https://github.com/alielfilali01)
- **LinkedIn Profile:** [https://www.linkedin.com/in/alielfilali01/](https://www.linkedin.com/in/alielfilali01/)
## Dataset Size
The Darija (Moroccan Arabic) Stories Dataset is the largest publicly available dataset in Moroccan Arabic dialect (Darija) to date, with over 70 million tokens.
## Potential Use Cases
- **Arabic Dialect NLP:** Researchers can utilize this dataset to develop and evaluate NLP models specifically designed for Arabic dialects, with a focus on Moroccan Arabic (Darija). Tasks such as dialect identification, part-of-speech tagging, and named entity recognition can be explored.
- **Sentiment Analysis:** The dataset can be used to analyze sentiment expressed in Darija stories, enabling sentiment classification, emotion detection, or opinion mining within the context of Moroccan culture.
- **Text Generation:** Researchers and developers can leverage the dataset to generate new stories or expand existing ones using various text generation techniques, facilitating the development of story generation systems specifically tailored for Moroccan Arabic dialect.
## Dataset Access
The Darija (Moroccan Arabic) Stories Dataset is available for academic and non-commercial use, under a Creative Commons Non Commercial license.
## Citation
Please use the following citation when referencing the Darija (Moroccan Arabic) Stories Dataset:
```
@dataset{
title = {Darija (Moroccan Arabic) Stories Dataset},
author = {Elfilali Ali},
howpublished = {Dataset},
url = {https://huggingface.co/datasets/Ali-C137/Darija-Stories-Dataset},
year = {2023},
}
```
提供机构:
Ali-C137
原始信息汇总
数据集概述
数据集名称
Darija (Moroccan Arabic) Stories Dataset
数据集描述
该数据集包含多样化的故事,涵盖摩洛哥文化、传统和日常生活。数据集中的每个故事章节都附有在线阅读或参考的URL链接,并包含作者信息和提供额外上下文或分类的标签。
数据集特征
- ChapterName: 字符串类型
- ChapterLink: 字符串类型
- Author: 字符串类型
- Text: 字符串类型
- Tags: 整数类型
数据集分割
- train: 6142个样本,占用476926644字节
数据集大小
- 下载大小: 241528641字节
- 数据集大小: 476926644字节
许可
cc-by-nc-4.0
任务类别
- 文本生成
语言
- 阿拉伯语
潜在用途
- 阿拉伯方言NLP研究
- 情感分析
- 文本生成
访问权限
该数据集适用于学术和非商业用途。



