five

misyaz/BAREC-Corpus-v1.0

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/misyaz/BAREC-Corpus-v1.0
下载链接
链接失效反馈
官方服务:
资源简介:
BAREC(平衡阿拉伯语可读性评估语料库)是一个用于细粒度阿拉伯语可读性评估的大规模数据集。该数据集包含超过100万个单词,在句子级别上标注了19个可读性级别,并额外映射到7、5和3个级别的更粗粒度方案中。数据集支持多类可读性分类任务,包括19级、7级、5级和3级分类。数据实例包含句子文本、单词计数、词法信息、可读性级别(19级、7级、5级和3级)、标注者ID、文档来源、书籍、作者、领域(艺术与人文、STEM或社会科学)和文本类别(基础、高级或专业)。数据集分为训练集(80%)、开发集(10%)和测试集(10%),并在可读性级别、领域和文本类别上保持平衡。

BAREC (the Balanced Arabic Readability Evaluation Corpus) is a large-scale dataset for fine-grained Arabic readability assessment. The dataset includes over 1M words, annotated at the sentence level across 19 readability levels, with additional mappings to coarser 7, 5, and 3 level schemes. It supports multi-class readability classification tasks in 19, 7, 5, and 3 levels. Data instances include sentence text, word count, lexical information, readability levels (19, 7, 5, and 3 levels), annotator ID, document source, book, author, domain (Arts & Humanities, STEM, or Social Sciences), and text class (Foundational, Advanced, or Specialized). The dataset is split into train (80%), dev (10%), and test (10%) sets, balanced across readability levels, domains, and text classes.
提供机构:
misyaz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作