somosnlp-hackathon-2022/readability-es-caes
收藏Hugging Face2023-04-13 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/somosnlp-hackathon-2022/readability-es-caes
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- other
language_creators:
- other
language:
- es
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- unknown
source_datasets:
- original
task_categories:
- text-classification
task_ids: []
pretty_name: readability-es-caes
tags:
- readability
---
# Dataset Card for [readability-es-caes]
## Dataset Description
### Dataset Summary
This dataset is a compilation of short articles from websites dedicated to learn Spanish as a second language. These articles have been compiled from the following sources:
- [CAES corpus](http://galvan.usc.es/caes/) (Martínez et al., 2019): the "Corpus de Aprendices del Español" is a collection of texts produced by Spanish L2 learners from Spanish learning centers and universities. These text are produced by students of all levels (A1 to C1), with different backgrounds (11 native languages) and levels of experience.
### Languages
Spanish
## Dataset Structure
Texts are tokenized to create a paragraph-based dataset
### Data Fields
The dataset is formatted as a json lines and includes the following fields:
- **Category:** when available, this includes the level of this text according to the Common European Framework of Reference for Languages (CEFR).
- **Level:** standardized readability level: simple or complex.
- **Level-3:** standardized readability level: basic, intermediate or advanced.
- **Text:** original text formatted into sentences.
## Additional Information
### Licensing Information
https://creativecommons.org/licenses/by-nc-sa/4.0/
### Citation Information
Please cite this page to give credit to the authors :)
### Team
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
- [Pedro Cuenca](https://twitter.com/pcuenq)
- [Sergio Morales](https://www.fireblend.com/)
- [Fernando Alva-Manchego](https://feralvam.github.io/)
提供机构:
somosnlp-hackathon-2022
原始信息汇总
数据集概述
数据集描述
数据集总结
本数据集是由专注于学习西班牙语的网站上的短篇文章汇编而成。这些文章主要来源于以下资源:
- CAES corpus (Martínez et al., 2019):“Corpus de Aprendices del Español”是一个由西班牙语作为第二语言学习者编写的文本集合,这些文本由来自学习中心和大学的学生编写,涵盖所有级别(A1至C1),具有不同的背景(11种母语)和经验水平。
语言
西班牙语
数据集结构
文本已分词,形成基于段落的数据集。
数据字段
数据集采用json lines格式,包含以下字段:
- Category: 根据欧洲共同框架(CEFR),当可用时,包括文本的级别。
- Level: 标准化可读性级别:简单或复杂。
- Level-3: 标准化可读性级别:基础、中级或高级。
- Text: 原始文本,格式化为句子。
附加信息
许可信息
本数据集遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License。



