universityofbucharest/moroco

Name: universityofbucharest/moroco
Creator: universityofbucharest
Published: 2024-01-18 11:09:14
License: 暂无描述

Hugging Face2024-01-18 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/universityofbucharest/moroco

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - ro license: - cc-by-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-classification task_ids: - topic-classification paperswithcode_id: moroco pretty_name: 'MOROCO: The Moldavian and Romanian Dialectal Corpus' language_bcp47: - ro-MD dataset_info: features: - name: id dtype: string - name: category dtype: class_label: names: '0': culture '1': finance '2': politics '3': science '4': sports '5': tech - name: sample dtype: string config_name: moroco splits: - name: train num_bytes: 39314292 num_examples: 21719 - name: test num_bytes: 10877813 num_examples: 5924 - name: validation num_bytes: 10721304 num_examples: 5921 download_size: 60711985 dataset_size: 60913409 --- # Dataset Card for MOROCO ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Github](https://github.com/butnaruandrei/MOROCO) - **Repository:** [Github](https://github.com/butnaruandrei/MOROCO) - **Paper:** [Arxiv](https://arxiv.org/abs/1901.06543) - **Leaderboard:** [Needs More Information] - **Point of Contact:** [email](raducu.ionescu@gmail.com) ### Dataset Summary Introducing MOROCO - The **Mo**ldavian and **Ro**manian Dialectal **Co**rpus. The MOROCO data set contains Moldavian and Romanian samples of text collected from the news domain. The samples belong to one of the following six topics: (0) culture, (1) finance, (2) politics, (3) science, (4) sports, (5) tech. The corpus features a total of 33,564 samples labelled with one of the fore mentioned six categories. We are also including a train/validation/test split with 21,719/5,921/5,924 samples in each subset. ### Supported Tasks and Leaderboards [LiRo Benchmark and Leaderboard](https://eemlcommunity.github.io/ro_benchmark_leaderboard/site/) ### Languages The text dataset is in Romanian (`ro`) ## Dataset Structure ### Data Instances Below we have an example of sample from MOROCO: ``` {'id': , '48482', 'category': 2, 'sample': '“$NE$ cum am spus, nu este un sfârşit de drum . Vom continua lupta cu toate instrumentele şi cu toate mijloacele legale, parlamentare şi civice pe care le avem la dispoziţie . Evident că vom contesta la $NE$ această lege, au anunţat şi colegii de la $NE$ o astfel de contestaţie . Practic trebuie utilizat orice instrument pe care îl identificăm pentru a bloca intrarea în vigoare a acestei legi . Bineînţeles, şi preşedintele are punctul său de vedere . ( . . . ) $NE$ legi sunt împănate de motive de neconstituţionalitate . Colegii mei de la departamentul juridic lucrează în prezent pentru a definitiva textul contestaţiei”, a declarat $NE$ $NE$ citat de news . ro . Senatul a adoptat, marţi, în calitate de for decizional, $NE$ privind statutul judecătorilor şi procurorilor, cu 80 de voturi ”pentru” şi niciun vot ”împotrivă”, în condiţiile în care niciun partid din opoziţie nu a fost prezent în sală .', } ``` where 48482 is the sample ID, followed by the category ground truth label, and then the text representing the actual content to be classified by topic. Note: The category label has integer values ranging from 0 to 5. ### Data Fields - `id`: string, the unique indentifier of a sample - `category_label`: integer in the range [0, 5]; the category assigned to a sample. - `sample`: a string, news report to be classified / used in classification. ### Data Splits The train/validation/test split contains 21,719/5,921/5,924 samples tagged with the category assigned to each sample in the dataset. ## Dataset Creation ### Curation Rationale The samples are preprocessed in order to eliminate named entities. This is required to prevent classifiers from taking the decision based on features that are not related to the topics. For example, named entities that refer to politicians or football players names can provide clues about the topic. For more details, please read the [paper](https://arxiv.org/abs/1901.06543). ### Source Data #### Data Collection and Normalization For the data collection, five of the most popular news websites in Romania and the Republic of Moldova were targetted. Given that the data set was obtained through a web scraping technique, all the HTML tags needed to be removed, as well as replace consecutive white spaces with a single space. As part of the pre-processing, we remove named entities, such as country names, cities, public figures, etc. The named entities have been replaced with $NE$. The necessity to remove them, comes also from the scope of this dataset: categorization by topic. Thus, the authors decided to remove named entities in order to prevent classifiers from taking the decision based on features that are not truly indicative of the topics. #### Who are the source language producers? The original text comes from news websites from Romania and the Republic of Moldova. ### Annotations #### Annotation process As mentioned above, MOROCO is composed of text samples from the top five most popular news websites in Romania and the Republic of Moldova, respectively. Since there are topic tags in the news websites targetd, the text samples can be automatically labeled with the corresponding category. #### Who are the annotators? N/A ### Personal and Sensitive Information The textual data collected for MOROCO consists in news reports freely available on the Internet and of public interest. To the best of authors' knowledge, there is no personal or sensitive information that needed to be considered in the said textual inputs collected. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is part of an effort to encourage text classification research in languages other than English. Such work increases the accessibility of natural language technology to more regions and cultures. In the past three years there was a growing interest for studying Romanian from a Computational Linguistics perspective. However, we are far from having enough datasets and resources in this particular language. ### Discussion of Biases The data included in MOROCO spans a well defined time frame of a few years. Part of the topics that were of interest then in the news landscape, might not show up nowadays or a few years from now in news websites. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators Published and managed by Radu Tudor Ionescu and Andrei Butnaru. ### Licensing Information CC BY-SA 4.0 License ### Citation Information ``` @inproceedings{ Butnaru-ACL-2019, author = {Andrei M. Butnaru and Radu Tudor Ionescu}, title = "{MOROCO: The Moldavian and Romanian Dialectal Corpus}", booktitle = {Proceedings of ACL}, year = {2019}, pages={688--698}, } ``` ### Contributions Thanks to [@MihaelaGaman](https://github.com/MihaelaGaman) for adding this dataset.

提供机构：

universityofbucharest

原始信息汇总

数据集概述

名称: MOROCO: The Moldavian and Romanian Dialectal Corpus
语言: 罗马尼亚语 (ro)
许可证: CC-BY-4.0
多语言性: 单语种
大小: 10K<n<100K
源数据: 原始数据
任务类别: 文本分类
任务ID: 主题分类

数据集结构

数据实例

ID: 字符串，样本的唯一标识符
类别: 整数，范围[0, 5]，表示样本的分类
样本: 字符串，待分类的新闻报道文本

数据字段

id: 字符串，样本的唯一标识符
category: 整数，范围[0, 5]，表示样本的分类
sample: 字符串，新闻报道文本

数据分割

训练集: 21,719样本
测试集: 5,924样本
验证集: 5,921样本

数据集创建

源数据

数据收集: 从罗马尼亚和摩尔多瓦的五个最受欢迎的新闻网站收集
数据预处理: 移除HTML标签和连续空格，替换命名实体为$NE$

注释

注释过程: 自动从新闻网站获取主题标签进行分类

使用数据注意事项

社会影响

促进非英语语言的文本分类研究，增加自然语言技术对更多地区和文化的可及性

偏见讨论

数据涵盖的时间范围有限，可能不包含当前或未来新闻网站中的某些主题

5,000+

优质数据集

54 个

任务类型

进入经典数据集