iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/12821881

下载链接

链接失效反馈

官方服务：

资源简介：

The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). These corpora were compiled and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a sample of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The sample texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers and students were asked to classify the texts according to the complexity levels of the project, here informally defined as: · Very Easy (everyone can understand the text or most of the text). · Easy (a person with less than the 9th year of schooling can understand the text or most of the text) · Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it) · More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it). Annotators were also asked to mark the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), The full details regarding the students and the trainers’ tasks, data qualitative and quantitative description and inter-annotator agreement are described here: https://zenodo.org/records/14653180 The results are here presented in Excel format. For each language, and for each group (trainers and students), two pairs of files exist – the annotation and the classification files – resulting in four files per language and twelve files, in total. In all files, the data is organized as a matrix, with each row representing an ‘answer’ from a particular participant, and the columns containing various details about that specific input, as shown below: Column name Data Annotator's ID The randomly generated ID code for each annotator, together with information on the dataset assigned to them. Progress Information on the completion of the task (for each text). Duration (seconds) Time used in the completion of the task (for each text). File Name N1 = Very Easy N2 = Easy N3 = Plain N4=More Complex File internal identification, providing its iRead4Skills classification. Text The content of the file, i.e. the text itself. Annotated Level Level assigned by the annotator (trainer). Proficiency SubLevel (Likert Scale - 1 to 5) SubLevel assigned by the annotator (trainer) for FR data. Corresponding CEFR Level CEFR level closest to the iRead4Skills Additional Info Observations made by the trainers/students Annotated Term Word or set of words selected for annotation Term Label Annotation assigned to the Annotated Term (difficult word, word order, etc.) Term Index Position of the annotated term in the text Annotator's Proficiency Level Level of AL/VET of the student Text adequate for user Validation of the text by the students The content of the column “File Name” is color-coded, where a green shade alludes to a text with a lower level of complexity and a red one alludes to one with a higher level of complexity. The complete datasets are available under creative CC BY-NC-ND 4.0.

创建时间：

2025-01-15