CaSSA-catalan-structured-sentiment-analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/records/12189000

下载链接

链接失效反馈

官方服务：

资源简介：

Dataset Summary The CaSSA dataset is a corpus of 6,400 reviews and forum messages annotated with polar expressions. Each piece of text is annotated with all the expressions of polarity that it contains. For each polar expression, we annotated the expression itself, the target (the object of the expression), and the source (the subject expressing the sentiment). 25,453 polar expressions have been annotated. We provide the following files and folders: The dataset folder contains the final dataset, as well as the dataloader and README file as provided in Hugging Face. The annotations file. The annotation guidelines file. Supported Tasks and Leaderboards This dataset can be used to train models for sentiment analysis. Languages The dataset is in Catalan (ca-ES). Dataset Structure Each instance in the dataset is a text. For each text, there can be 0 to unlimited polar expressions, which are contained in the "opinions" field. Each opinion contains a source, a target, a polar expression, a polarity value and an intensity value. Data Instances {"sent_id": "2d6a3a0f-6686-4d8b-9c5f-51c424ff90be","text": "El seu menú de nit de cap de setmana es boníssim, plats fets amb criteri i que surten com un rellotge. Servei proper i amable. Per poc mes de 20 euros entre pisos i flautes menges com un rei.", "opinions": [ { "Source": None, "Target": [["Servei"], ["103:109"]], "Polar_expression": [["proper"], ["110:116"]], "Polarity": "Neutral", "Intensity": "Standard" }, { "Source": None, "Target": [["Servei"], ["103:109"]], "Polar_expression": [["amable"], ["119:125"]], "Polarity": "Positive", "Intensity": "Standard" }, { "Source": None, "Target": None, "Polar_expression": [["menges com un rei"], ["173:190"]], "Polarity": "Positive", "Intensity": "Strong" }, { "Source": [["seu"], ["3:6"]], "Target": [["menú de nit de cap de setmana"], ["7:36"]], "Polar_expression": [["bon\u00edssim"], ["40:48"]], "Polarity": "Positive", "Intensity": "Strong"}, { "Source": None, "Target": [["plats"], ["50:55"]], "Polar_expression": [["amb criteri"], ["61:72"]], "Polarity": "Positive", "Intensity": "Standard" } ]} Data Splits The dataset does not contain splits. Dataset Creation We created this corpus to contribute to the development of language models in Catalan, a low-resource language. Source Data The data was collected using the messages from the GuiaCat online guide and the forum Racó Català. Initial Data Collection and Normalization We selected all the restaurant reviews we had from GuiaCat, and used a LLM to select messages in Racó Català that were written in the style of reviews. Who are the source language producers? The source language producers are users of GuiaCat and Racó Català. Annotations Each opinion contains a source, a target, a polar expression, a polarity value and an intensity value. Source, Target, and Polar_expressions are spans, which are represented both by the string and by the position of the characters. Polarity and Intensity are labels, which can respectively be, Positive, Negative and Neutral, and Standard and Strong. Annotation process The data was annotated by 2 annotators. In the cases in which they did not fully agree, a third annotator selected the preferred annotation. Who are the annotators? All the annotators are native speakers of Catalan. Personal and Sensitive Information The data from Racó Català was annonymised to remove user names and emails, which were changed to random Catalan names. The mentions to the forum itself have also been changed. Social Impact of Dataset We hope this corpus contributes to the development of language models in Catalan, a low-resource language. Discussion of Biases We are aware that, since the data comes from online reviews and a public forum, this will contain biases, hate speech and toxic content. We have not applied any steps to reduce their impact. Dataset Curators Language Technologies Unit (LangTech) at the Barcelona Supercomputing Center. This work has been promoted and financed by the Generalitat de Catalunya through the Aina project. Licensing Information This work is licensed under a Creative Commons Attribution Non-commercial No-Derivatives 4.0 International License. **The license has been updated to a more restrictive open license. Consequently, any downloads initiated after 12/03/2024 must adhere to the current licensing terms. Citation Information @inproceedings{gonzalez-agirre-etal-2024-building-data, title = "Building a Data Infrastructure for a Mid-Resource Language: The Case of {C}atalan", author = "Gonzalez-Agirre, Aitor and Marimon, Montserrat and Rodriguez-Penagos, Carlos and Aula-Blasco, Javier and Baucells, Irene and Armentano-Oller, Carme and Palomar-Giner, Jorge and Kulebi, Baybars and Villegas, Marta", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.231", pages = "2556--2566",}

创建时间：

2024-06-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集