five

somosnlp-hackathon-2022/readability-es-caes

收藏
Hugging Face2023-04-13 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/somosnlp-hackathon-2022/readability-es-caes
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - other language_creators: - other language: - es license: - cc-by-4.0 multilinguality: - monolingual size_categories: - unknown source_datasets: - original task_categories: - text-classification task_ids: [] pretty_name: readability-es-caes tags: - readability --- # Dataset Card for [readability-es-caes] ## Dataset Description ### Dataset Summary This dataset is a compilation of short articles from websites dedicated to learn Spanish as a second language. These articles have been compiled from the following sources: - [CAES corpus](http://galvan.usc.es/caes/) (Martínez et al., 2019): the "Corpus de Aprendices del Español" is a collection of texts produced by Spanish L2 learners from Spanish learning centers and universities. These text are produced by students of all levels (A1 to C1), with different backgrounds (11 native languages) and levels of experience. ### Languages Spanish ## Dataset Structure Texts are tokenized to create a paragraph-based dataset ### Data Fields The dataset is formatted as a json lines and includes the following fields: - **Category:** when available, this includes the level of this text according to the Common European Framework of Reference for Languages (CEFR). - **Level:** standardized readability level: simple or complex. - **Level-3:** standardized readability level: basic, intermediate or advanced. - **Text:** original text formatted into sentences. ## Additional Information ### Licensing Information https://creativecommons.org/licenses/by-nc-sa/4.0/ ### Citation Information Please cite this page to give credit to the authors :) ### Team - [Laura Vásquez-Rodríguez](https://lmvasque.github.io/) - [Pedro Cuenca](https://twitter.com/pcuenq) - [Sergio Morales](https://www.fireblend.com/) - [Fernando Alva-Manchego](https://feralvam.github.io/)
提供机构:
somosnlp-hackathon-2022
原始信息汇总

数据集概述

数据集描述

数据集总结

本数据集是由专注于学习西班牙语的网站上的短篇文章汇编而成。这些文章主要来源于以下资源:

  • CAES corpus (Martínez et al., 2019):“Corpus de Aprendices del Español”是一个由西班牙语作为第二语言学习者编写的文本集合,这些文本由来自学习中心和大学的学生编写,涵盖所有级别(A1至C1),具有不同的背景(11种母语)和经验水平。

语言

西班牙语

数据集结构

文本已分词,形成基于段落的数据集。

数据字段

数据集采用json lines格式,包含以下字段:

  • Category: 根据欧洲共同框架(CEFR),当可用时,包括文本的级别。
  • Level: 标准化可读性级别:简单或复杂。
  • Level-3: 标准化可读性级别:基础、中级或高级。
  • Text: 原始文本,格式化为句子。

附加信息

许可信息

本数据集遵循Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作