five

CaSSA-catalan-structured-sentiment-analysis

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/records/12189000
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Summary The CaSSA dataset is a corpus of 6,400 reviews and forum messages annotated with polar expressions. Each piece of text is annotated with all the expressions of polarity that it contains. For each polar expression, we annotated the expression itself, the target (the object of the expression), and the source (the subject expressing the sentiment). 25,453 polar expressions have been annotated. We provide the following files and folders: The dataset folder contains the final dataset, as well as the dataloader and README file as provided in Hugging Face. The annotations file. The annotation guidelines file. Supported Tasks and Leaderboards This dataset can be used to train models for sentiment analysis. Languages The dataset is in Catalan (ca-ES). Dataset Structure Each instance in the dataset is a text. For each text, there can be 0 to unlimited polar expressions, which are contained in the "opinions" field. Each opinion contains a source, a target, a polar expression, a polarity value and an intensity value. Data Instances {"sent_id": "2d6a3a0f-6686-4d8b-9c5f-51c424ff90be","text": "El seu menú de nit de cap de setmana es boníssim, plats fets amb criteri i que surten com un rellotge. Servei proper i amable. Per poc mes de 20 euros entre pisos i flautes menges com un rei.", "opinions":     [      {        "Source": None,         "Target": [["Servei"], ["103:109"]],         "Polar_expression": [["proper"], ["110:116"]],         "Polarity": "Neutral",         "Intensity": "Standard"      },       {        "Source": None,         "Target": [["Servei"], ["103:109"]],         "Polar_expression": [["amable"], ["119:125"]],         "Polarity": "Positive",         "Intensity": "Standard"      },       {        "Source": None,         "Target": None,         "Polar_expression": [["menges com un rei"], ["173:190"]],         "Polarity": "Positive",         "Intensity": "Strong"      },       {        "Source": [["seu"], ["3:6"]],         "Target": [["menú de nit de cap de setmana"], ["7:36"]],         "Polar_expression": [["bon\u00edssim"], ["40:48"]],         "Polarity": "Positive",         "Intensity": "Strong"},       {        "Source": None,         "Target": [["plats"], ["50:55"]],         "Polar_expression": [["amb criteri"], ["61:72"]],         "Polarity": "Positive",         "Intensity": "Standard"      }    ]} Data Splits The dataset does not contain splits. Dataset Creation We created this corpus to contribute to the development of language models in Catalan, a low-resource language. Source Data The data was collected using the messages from the GuiaCat online guide and the forum Racó Català. Initial Data Collection and Normalization We selected all the restaurant reviews we had from GuiaCat, and used a LLM to select messages in Racó Català that were written in the style of reviews. Who are the source language producers? The source language producers are users of GuiaCat and Racó Català. Annotations Each opinion contains a source, a target, a polar expression, a polarity value and an intensity value. Source, Target, and Polar_expressions are spans, which are represented both by the string and by the position of the characters. Polarity and Intensity are labels, which can respectively be, Positive, Negative and Neutral, and Standard and Strong. Annotation process The data was annotated by 2 annotators. In the cases in which they did not fully agree, a third annotator selected the preferred annotation. Who are the annotators? All the annotators are native speakers of Catalan. Personal and Sensitive Information The data from Racó Català was annonymised to remove user names and emails, which were changed to random Catalan names. The mentions to the forum itself have also been changed. Social Impact of Dataset We hope this corpus contributes to the development of language models in Catalan, a low-resource language. Discussion of Biases We are aware that, since the data comes from online reviews and a public forum, this will contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact. Dataset Curators Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center. This work has been promoted and financed by the Generalitat de Catalunya through the Aina project. Licensing Information This work is licensed under a Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License. **The license has been updated to a more restrictive open license. Consequently, any downloads initiated after 12/03/2024 must adhere to the current licensing terms. Citation Information @inproceedings{gonzalez-agirre-etal-2024-building-data,    title = "Building a Data Infrastructure for a Mid-Resource Language: The Case of {C}atalan",    author = "Gonzalez-Agirre, Aitor  and      Marimon, Montserrat  and      Rodriguez-Penagos, Carlos  and      Aula-Blasco, Javier  and      Baucells, Irene  and      Armentano-Oller, Carme  and      Palomar-Giner, Jorge  and      Kulebi, Baybars  and      Villegas, Marta",    editor = "Calzolari, Nicoletta  and      Kan, Min-Yen  and      Hoste, Veronique  and      Lenci, Alessandro  and      Sakti, Sakriani  and      Xue, Nianwen",    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",    month = may,    year = "2024",    address = "Torino, Italia",    publisher = "ELRA and ICCL",    url = "https://aclanthology.org/2024.lrec-main.231",    pages = "2556--2566",}
创建时间:
2024-06-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作