projecte-aina/GuiaCat
收藏Hugging Face2024-02-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/projecte-aina/GuiaCat
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- ca
license:
- cc-by-nc-nd-4.0
multilinguality:
- monolingual
pretty_name: GuiaCat
task_categories:
- text-classification
task_ids:
- sentiment-classification
- sentiment-scoring
---
# Dataset Card for GuiaCat
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage** [Projecte AINA](https://projecteaina.cat/tech/)
- **Repository** [HuggingFace](https://huggingface.co/projecte-aina)
- **Point of Contact** langtech@bsc.es
### Dataset Summary
GuiaCat is a dataset consisting of 5.750 restaurant reviews in Catalan, with 5 associated scores and a label of sentiment. The data was provided by [GuiaCat](https://guiacat.cat) and curated by the BSC.
This work is licensed under a [Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License](https://creativecommons.org/licenses/by-nc-nd/4.0/).
### Supported Tasks and Leaderboards
This corpus is mainly intended for sentiment analysis.
### Languages
The dataset is in Catalan (`ca-ES`).
## Dataset Structure
The dataset consists of restaurant reviews labelled with 5 scores: service, food, price-quality, environment, and average. Reviews also have a sentiment label, derived from the average score, all stored as a csv file.
### Data Instances
```
7,7,7,7,7.0,"Aquest restaurant té una llarga història. Ara han tornat a canviar d'amos i aquest canvi s'ha vist molt repercutit en la carta, preus, servei, etc. Hi ha molta varietat de menjar, i tot boníssim, amb especialitats molt ben trobades. El servei molt càlid i agradable, dóna gust que et serveixin així. I la decoració molt agradable també, bastant curiosa. En fi, pel meu gust, un bon restaurant i bé de preu.",bo
8,9,8,7,8.0,"Molt recomanable en tots els sentits. El servei és molt atent, pulcre i gens agobiant; alhora els plats també presenten un aspecte acurat, cosa que fa, juntament amb l'ambient, que t'oblidis de que, malauradament, està situat pròxim a l'autopista.Com deia, l'ambient és molt acollidor, té un menjador principal molt elegant, perfecte per quedar bé amb tothom!Tot i això, destacar la bona calitat / preu, ja que aquest restaurant té una carta molt extensa en totes les branques i completa, tant de menjar com de vins. Pel qui entengui de vins, podriem dir que tot i tenir una carta molt rica, es recolza una mica en els clàssics.",molt bo
```
### Data Fields
- service: a score from 0 to 10 grading the service
- food: a score from 0 to 10 grading the food
- price-quality: a score from 0 to 10 grading the relation between price and quality
- environment: a score from 0 to 10 grading the environment
- avg: average of all the scores
- text: the review
- label: it can be "molt bo", "bo", "regular", "dolent", "molt dolent"
### Data Splits
* dev.csv: 500 examples
* test.csv: 500 examples
* train.csv: 4,750 examples
## Dataset Creation
### Curation Rationale
We created this corpus to contribute to the development of language models in Catalan, a low-resource language.
### Source Data
The data of this dataset has been provided by [GuiaCat](https://guiacat.cat).
#### Initial Data Collection and Normalization
[N/A]
#### Who are the source language producers?
The language producers were the users from GuiaCat.
### Annotations
The annotations are automatically derived from the scores that the users provided while reviewing the restaurants.
#### Annotation process
The mapping between average scores and labels is:
- Higher than 8: molt bo
- Between 8 and 6: bo
- Between 6 and 4: regular
- Between 4 and 2: dolent
- Less than 2: molt dolent
#### Who are the annotators?
Users
### Personal and Sensitive Information
No personal information included, although it could contain hate or abusive language.
## Considerations for Using the Data
### Social Impact of Dataset
We hope this corpus contributes to the development of language models in Catalan, a low-resource language.
### Discussion of Biases
We are aware that this data might contain biases. We have not applied any steps to reduce their impact.
### Other Known Limitations
[N/A]
## Additional Information
### Dataset Curators
Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es).
This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
### Licensing Information
This work is licensed under a [Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License](https://creativecommons.org/licenses/by-nc-nd/4.0/).
### Citation Information
```
```
### Contributions
We want to thank GuiaCat for providing this data.
提供机构:
projecte-aina
原始信息汇总
数据集概述
数据集名称
- 名称: GuiaCat
数据集摘要
- 描述: GuiaCat是一个包含5,750条加泰罗尼亚语餐厅评论的数据集,每条评论附有5个评分和一个情感标签。数据由GuiaCat提供,由BSC进行整理。
- 语言: 加泰罗尼亚语 (
ca-ES) - 许可: Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License
支持的任务和排行榜
- 主要任务: 情感分析
数据集结构
- 数据实例: 示例包括服务、食物、价格质量、环境、平均分和评论文本及情感标签。
- 数据字段: 包括服务评分、食物评分、价格质量评分、环境评分、平均分、评论文本和情感标签。
- 数据分割: 训练集4,750条,开发集500条,测试集500条。
数据集创建
- 来源数据: 数据由GuiaCat用户提供。
- 注释: 注释是根据用户提供的评分自动生成的。
- 个人和敏感信息: 数据中不包含个人信息,但可能包含仇恨或滥用语言。
使用数据的考虑
- 社会影响: 旨在促进加泰罗尼亚语语言模型的发展。
- 偏见讨论: 数据可能包含偏见,但未采取措施减少其影响。
附加信息
- 数据集整理者: 巴塞罗那超级计算中心文本挖掘单元(TeMU)。
- 资金支持: 由加泰罗尼亚政府数字政策和领土部门资助。



