iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12821881
下载链接
链接失效反馈官方服务:
资源简介:
The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in Excel format (.xlsx). These corpora were compiled and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com).
This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform.
The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a sample of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The sample texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution:
· 140 Very Easy texts
· 140 Easy texts
· 140 Plain texts
· 42 More Complex texts.
Trainers and students were asked to classify the texts according to the complexity levels of the project, here informally defined as:
· Very Easy (everyone can understand the text or most of the text).
· Easy (a person with less than the 9th year of schooling can understand the text or most of the text)
· Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it)
· More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it).
Annotators were also asked to mark the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), The full details regarding the students and the trainers’ tasks, data qualitative and quantitative description and inter-annotator agreement are described here: https://zenodo.org/records/14653180
The results are here presented in Excel format. For each language, and for each group (trainers and students), two pairs of files exist – the annotation and the classification files – resulting in four files per language and twelve files, in total.
In all files, the data is organized as a matrix, with each row representing an ‘answer’ from a particular participant, and the columns containing various details about that specific input, as shown below:
Column name
Data
Annotator's ID
The randomly generated ID code for each annotator, together with information on the dataset assigned to them.
Progress
Information on the completion of the task (for each text).
Duration (seconds)
Time used in the completion of the task (for each text).
File Name
N1 = Very Easy
N2 = Easy
N3 = Plain
N4=More Complex
File internal identification, providing its iRead4Skills classification.
Text
The content of the file, i.e. the text itself.
Annotated Level
Level assigned by the annotator (trainer).
Proficiency SubLevel
(Likert Scale - 1 to 5)
SubLevel assigned by the annotator (trainer) for FR data.
Corresponding CEFR Level
CEFR level closest to the iRead4Skills
Additional Info
Observations made by the trainers/students
Annotated Term
Word or set of words selected for annotation
Term Label
Annotation assigned to the Annotated Term (difficult word, word order, etc.)
Term Index
Position of the annotated term in the text
Annotator's Proficiency Level
Level of AL/VET of the student
Text adequate for user
Validation of the text by the students
The content of the column “File Name” is color-coded, where a green shade alludes to a text with a lower level of complexity and a red one alludes to one with a higher level of complexity.
The complete datasets are available under creative CC BY-NC-ND 4.0.
创建时间:
2025-01-15



