five

MBZUAI-Paris/DarijaStory

收藏
Hugging Face2024-11-13 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MBZUAI-Paris/DarijaStory
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ary task_categories: - text-generation multilinguality: - monolingual language_creators: - machine-translated source_datasets: - original size_categories: - 1K<n<10K --- # Dataset Card for DarijaStory ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Repository:** [https://hf.co/datasets/MBZUAI-Paris/DarijaStory](https://hf.co/datasets/MBZUAI-Paris/DarijaStory) - **Paper:** [https://arxiv.org/pdf/2409.17912](https://arxiv.org/pdf/2409.17912) ### Dataset Summary DarijaStory is a story completion dataset. It consists of 4,392 long stories scraped from [9esa](https://www.9esa.com), a website featuring a variety of stories written in Moroccan Darija. ### Supported Tasks and Leaderboards - **Task Category:** Conditional Text Generation - **Task:** Story Completion in Moroccan Darija ### Languages The dataset is available in Moroccan Arabic (Darija). ## Dataset Structure ### Data Instances Each data instance contains a story or a chapter of story. #### Example Data Instance: ``` { 'id': 1170, 'story_name': 'قصة اللؤلؤة السوداء', 'content': 'حلات عوييناتها بتقااالة... حاسة بحرييق كيقطع فراااسها... قوي وعينيها مضببين ليها رؤيا ... هزات يديها بتقالة حطاتها فوق رااسها....' } ``` ### Data Fields - **id**: *(integer)* Index of the story. - **story_name**: *(string)* The story name. - **content**: *(integer)* The content of the story. ### Data Splits The dataset consists of a single split: | Split | Number of Instances | |-------|----------------------| | train | 4,392 | ## Dataset Creation ### Curation Rationale The dataset was web-scraped from [9esa.com](https://www.9esa.com) that contains stories in Darija. ### Personal and Sensitive Information The dataset does not contain personal, private, or sensitive information. All stories are general and cover fictional or societal themes relevant to Morocco. ## Considerations for Using the Data ### Social Impact of Dataset This dataset promotes the development and evaluation of language models capable of understanding and generating extended narratives in Moroccan Darija, thus contributing to the advancement of NLP in underrepresented languages and supporting cultural diversity in AI applications. ### Discussion of Biases The dataset consists of Moroccan Darija stories, which may reflect specific cultural and societal themes relevant to Morocco. Users should be aware of this when using the dataset for general language model applications. ## Additional Information ### Dataset Curators - **MBZUAI-Paris Team** ### Licensing Information - **License:** [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). ### Citation If you use this dataset in your research, please cite our paper: ```none @article{shang2024atlaschatadaptinglargelanguage, title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing}, year={2024}, eprint={2409.17912}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.17912}, } ```
提供机构:
MBZUAI-Paris
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作