five

MBZUAI-Paris/MoroccanSocialMedia-MultiGen

收藏
Hugging Face2024-09-27 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MBZUAI-Paris/MoroccanSocialMedia-MultiGen
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ma task_categories: - text-generation task_ids: - text-generation - text2text-generation multilinguality: - monolingual language_creators: - machine-translated annotations_creators: - machine-generated source_datasets: - qadi - omcd size_categories: - 10K<n<100K license: - mit --- # Dataset Card for MoroccanSocialMedia-MultiGen (MSM-MG) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [https://hf.co/datasets/MBZUAI-Paris/MoroccanSocialMedia-MultiGen](https://hf.co/datasets/MBZUAI-Paris/MoroccanSocialMedia-MultiGen) - **Repository:** [https://github.com/MBZUAI-Paris/lm-evaluation-harness-atlas-chat](https://github.com/MBZUAI-Paris/lm-evaluation-harness-atlas-chat) - **Paper:** [More Information Needed] ### Dataset Summary MoroccanSocialMedia-MultiGen (MSM-MG) is a dataset of 12,973 pairs of native Darija social media posts (tweets and YouTube comments) and their synthetic counterparts. The dataset supports six tasks: Continuation, Reply, Summarization, Rephrasing, Explanation, and Safe Response. The synthetic generations were created by prompting Claude 3.5 Sonnet to perform each of these tasks based on the original posts. Data sources include QADI, Twitter API, and OMCD. ### Supported Tasks - **Task Category:** Text generation - **Task:** Generating synthetic responses to native Darija social media posts based on six specific tasks: Continuation, Reply, Summarization, Rephrasing, Explanation, and Safe Response. ### Languages The dataset is available in Moroccan Arabic (Darija). ## Dataset Structure The dataset consists of 12,973 pairs of posts and their synthetic counterparts, divided across different tasks. ### Data Instances Each data instance includes: - **index**: Unique identifier for the social media post. - **Text**: The original Darija social media post. - **Task**: The task performed (one of Continuation, Reply, Summarization ,Rephrasing , Explanation, and SafeResponse). - **Generation**: The response generated for the specific task. Example: ``` { "index": 135, "Text": "غير لعبو الكورة وايلا ربحتو مبروك عليكم", "Task": "Continuation", "Generation": "وإيلا خسرتو ما تبقاوش تلوموا الحكم والا الظروف. الرياضة روح وأخلاق قبل كل شيء." } ``` ## Dataset Creation ### Curation Rationale This dataset was created to enhance NLP capabilities for Moroccan Darija by generating synthetic data for various text generation tasks. It enables models to handle common social media use cases in Darija. ### Source Data #### Initial Data Collection and Normalization The data was collected from three sources: - **QADI**: 6,362 valid tweets were collected from the QADI dataset. - **Twitter API**: 4,226 tweets were gathered using Darija-specific keywords. - **OMCD**: 3,219 YouTube comments labeled as offensive were collected from OMCD for the Safe Response task. #### Who are the source language producers? The original social media posts were produced by native Darija speakers on Twitter and YouTube. The synthetic generations were created by Claude 3.5 Sonnet, based on prompts related to each specific task. ### Annotations #### Annotation process The dataset was generated using machine translation and text generation, followed by manual review to ensure the quality and relevance of the synthetic outputs. #### Who are the annotators? The annotations were machine-generated, with manual oversight provided by experts in Moroccan Darija. ### Personal and Sensitive Information The dataset does not contain personal or sensitive information. ## Considerations for Using the Data ### Social Impact of Dataset This dataset promotes the development of language models capable of understanding and responding in Moroccan Darija, contributing to the advancement of NLP for underrepresented languages. ### Discussion of Biases The dataset excludes certain technical topics and culturally inappropriate questions to ensure relevance and accessibility in the Moroccan context. However, as the data was machine-translated and adapted, it may still contain linguistic biases inherent in the translation models used, namely Claude 3.5 Sonnet . ### Other Known Limitations - Some social media posts may have been misclassified or misrepresented during data collection and filtering. - The quality of the synthetic generations may vary based on the specific task and the nature of the original post. ## Additional Information ### Dataset Curators - MBZUAI-Paris team ### Licensing Information - [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). ### Citation Information ``` @article{shang2024atlaschatadaptinglargelanguage, title={Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect}, author={Guokan Shang and Hadi Abdine and Yousef Khoubrane and Amr Mohamed and Yassine Abbahaddou and Sofiane Ennadir and Imane Momayiz and Xuguang Ren and Eric Moulines and Preslav Nakov and Michalis Vazirgiannis and Eric Xing}, year={2024}, eprint={2409.17912}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.17912}, } ``` ``` @article{abdelali2020arabic, title={Arabic dialect identification in the wild}, author={Abdelali, Ahmed and Mubarak, Hamdy and Samih, Younes and Hassan, Sabit and Darwish, Kareem}, journal={arXiv preprint arXiv:2005.06557}, year={2020} } ``` ``` @article{essefar2023omcd, title={OMCD: Offensive Moroccan comments dataset}, author={Essefar, Kabil and Ait Baha, Hassan and El Mahdaouy, Abdelkader and El Mekki, Abdellah and Berrada, Ismail}, journal={Language Resources and Evaluation}, volume={57}, number={4}, pages={1745--1765}, year={2023}, publisher={Springer} } ```
提供机构:
MBZUAI-Paris
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作