NLUCat

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/record/10362025

下载链接

链接失效反馈

官方服务：

资源简介：

NLUCat Dataset Description Dataset Summary NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it. The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.). The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems. The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.) This dataset can be used to train models for intent classification, spans identification and examples generation. This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace. In this repository you'll find the following items: NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team NLUCat_dataset.json: the completed NLUCat dataset NLUCat_stats.tsv: statistics about de NLUCat dataset dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers reports: folder with the reports done as feedback to the annotators during the annotation process This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made. Supported Tasks and Leaderboards Intent classification, spans identification and examples generation. Languages The dataset is in Catalan (ca-ES). Dataset Structure Data Instances Three JSON files, one for each split. Data Fields example: `str`. Example annotation: `dict`. Annotation of the example intent: `str`. Intent tag slots: `list`. List of slots Tag:`str`. tag to the slot Text:`str`. Text of the slot Start_char: `int`. First character of the span End_char: `int`. Last character of the span Example An example looks as follows: { "example": "Demana una ambulància; la meva dona està de part.", "annotation": { "intent": "call_emergency", "slots": [ { "Tag": "service", "Text": "ambulància", "Start_char": 11, "End_char": 21 }, { "Tag": "situation", "Text": "la meva dona està de part", "Start_char": 23, "End_char": 48 } ] } }, Data Splits NLUCat.train: 9128 examples NLUCat.dev: 1441 examples NLUCat.test: 1441 examples Dataset Creation Curation Rationale We created this dataset to contribute to the development of language models in Catalan, a low-resource language. When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population. Source Data Initial Data Collection and Normalization We commissioned a company to create fictitious examples for the creation of this dataset. Who are the source language producers? We commissioned the writing of the examples to the company m47 labs. Annotations Annotation process The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.* First step: translation or elaboration of the instructions given to the annotators to write the examples.* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations. Who are the annotators? The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process. Personal and Sensitive Information No personal or sensitive information included. The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real. Considerations for Using the Data Social Impact of Dataset We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs. Discussion of Biases When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities. Other Known Limitations [N/A] Additional Information Dataset Curators Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es) This work has been promoted and financed by the Generalitat de Catalunya through the Aina project. Licensing Information This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit, provide a link to the license, and indicate if changes were made. Citation Information DOI Contributions The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

创建时间：

2024-03-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集