five

projecte-aina/GuiaCat

收藏
Hugging Face2024-02-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/projecte-aina/GuiaCat
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - ca license: - cc-by-nc-nd-4.0 multilinguality: - monolingual pretty_name: GuiaCat task_categories: - text-classification task_ids: - sentiment-classification - sentiment-scoring --- # Dataset Card for GuiaCat ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage** [Projecte AINA](https://projecteaina.cat/tech/) - **Repository** [HuggingFace](https://huggingface.co/projecte-aina) - **Point of Contact** langtech@bsc.es ### Dataset Summary GuiaCat is a dataset consisting of 5.750 restaurant reviews in Catalan, with 5 associated scores and a label of sentiment. The data was provided by [GuiaCat](https://guiacat.cat) and curated by the BSC. This work is licensed under a [Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License](https://creativecommons.org/licenses/by-nc-nd/4.0/). ### Supported Tasks and Leaderboards This corpus is mainly intended for sentiment analysis. ### Languages The dataset is in Catalan (`ca-ES`). ## Dataset Structure The dataset consists of restaurant reviews labelled with 5 scores: service, food, price-quality, environment, and average. Reviews also have a sentiment label, derived from the average score, all stored as a csv file. ### Data Instances ``` 7,7,7,7,7.0,"Aquest restaurant té una llarga història. Ara han tornat a canviar d'amos i aquest canvi s'ha vist molt repercutit en la carta, preus, servei, etc. Hi ha molta varietat de menjar, i tot boníssim, amb especialitats molt ben trobades. El servei molt càlid i agradable, dóna gust que et serveixin així. I la decoració molt agradable també, bastant curiosa. En fi, pel meu gust, un bon restaurant i bé de preu.",bo 8,9,8,7,8.0,"Molt recomanable en tots els sentits. El servei és molt atent, pulcre i gens agobiant; alhora els plats també presenten un aspecte acurat, cosa que fa, juntament amb l'ambient, que t'oblidis de que, malauradament, està situat pròxim a l'autopista.Com deia, l'ambient és molt acollidor, té un menjador principal molt elegant, perfecte per quedar bé amb tothom!Tot i això, destacar la bona calitat / preu, ja que aquest restaurant té una carta molt extensa en totes les branques i completa, tant de menjar com de vins. Pel qui entengui de vins, podriem dir que tot i tenir una carta molt rica, es recolza una mica en els clàssics.",molt bo ``` ### Data Fields - service: a score from 0 to 10 grading the service - food: a score from 0 to 10 grading the food - price-quality: a score from 0 to 10 grading the relation between price and quality - environment: a score from 0 to 10 grading the environment - avg: average of all the scores - text: the review - label: it can be "molt bo", "bo", "regular", "dolent", "molt dolent" ### Data Splits * dev.csv: 500 examples * test.csv: 500 examples * train.csv: 4,750 examples ## Dataset Creation ### Curation Rationale We created this corpus to contribute to the development of language models in Catalan, a low-resource language. ### Source Data The data of this dataset has been provided by [GuiaCat](https://guiacat.cat). #### Initial Data Collection and Normalization [N/A] #### Who are the source language producers? The language producers were the users from GuiaCat. ### Annotations The annotations are automatically derived from the scores that the users provided while reviewing the restaurants. #### Annotation process The mapping between average scores and labels is: - Higher than 8: molt bo - Between 8 and 6: bo - Between 6 and 4: regular - Between 4 and 2: dolent - Less than 2: molt dolent #### Who are the annotators? Users ### Personal and Sensitive Information No personal information included, although it could contain hate or abusive language. ## Considerations for Using the Data ### Social Impact of Dataset We hope this corpus contributes to the development of language models in Catalan, a low-resource language. ### Discussion of Biases We are aware that this data might contain biases. We have not applied any steps to reduce their impact. ### Other Known Limitations [N/A] ## Additional Information ### Dataset Curators Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es). This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina). ### Licensing Information This work is licensed under a [Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License](https://creativecommons.org/licenses/by-nc-nd/4.0/). ### Citation Information ``` ``` ### Contributions We want to thank GuiaCat for providing this data.
提供机构:
projecte-aina
原始信息汇总

数据集概述

数据集名称

  • 名称: GuiaCat

数据集摘要

支持的任务和排行榜

  • 主要任务: 情感分析

数据集结构

  • 数据实例: 示例包括服务、食物、价格质量、环境、平均分和评论文本及情感标签。
  • 数据字段: 包括服务评分、食物评分、价格质量评分、环境评分、平均分、评论文本和情感标签。
  • 数据分割: 训练集4,750条,开发集500条,测试集500条。

数据集创建

  • 来源数据: 数据由GuiaCat用户提供。
  • 注释: 注释是根据用户提供的评分自动生成的。
  • 个人和敏感信息: 数据中不包含个人信息,但可能包含仇恨或滥用语言。

使用数据的考虑

  • 社会影响: 旨在促进加泰罗尼亚语语言模型的发展。
  • 偏见讨论: 数据可能包含偏见,但未采取措施减少其影响。

附加信息

  • 数据集整理者: 巴塞罗那超级计算中心文本挖掘单元(TeMU)。
  • 资金支持: 由加泰罗尼亚政府数字政策和领土部门资助。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作