BAREC-Shared-Task-2025-doc
收藏BAREC Shared Task 2025 数据集概述
数据集基本信息
- 名称: BAREC (Balanced Arabic Readability Evaluation Corpus)
- 许可证: MIT
- 任务类别: 文本分类
- 语言: 阿拉伯语 (现代标准阿拉伯语)
- 标签: 可读性评估
- 规模: 1K<n<10K
- 别名: BAREC 2025: Readability Assessment Shared Task
数据集摘要
- 用途: 用于BAREC Shared Task 2025,专注于细粒度阿拉伯语可读性评估
- 数据量: 超过100万单词
- 标注粒度: 19个可读性级别,并映射到7、5和3级别的粗粒度方案
- 标注层级: 句子级标注,文档级可读性分数基于最困难句子的19级方案确定
支持任务与排行榜
- 任务类型: 多类可读性分类
- 分类方案:
- 19级 (默认)
- 7级
- 5级
- 3级
- 共享任务详情: 访问Shared Task Website
数据集结构
数据实例示例
json { "ID": 1010219, "Document": "BAREC_Majed_1481_2007_038.txt", "Sentences": "موزة الحبوبة وشقيقها رشود آيس كريم بالكريمة.. أم كريمة بالآيس كريم؟!", "Sentence_Count": 3, "Word_Count": 15, "Readability_Level": "8-Ha", "Readability_Level_19": 8, "Readability_Level_7": 3, "Readability_Level_5": 2, "Readability_Level_3": 1, "Source": "Majed", "Book": "Edition: 1481", "Author": "#", "Domain": "Arts & Humanities", "Text_Class": "Foundational" }
数据字段
- ID: 唯一文档标识符
- Document: 文档文件名
- Sentences: 文档全文
- Sentence_Count: 句子数量
- Word_Count: 总词数
- Readability_Level: 19级可读性级别 (1-alif到19-qaf)
- Readability_Level_19: 19级可读性级别 (1到19)
- Readability_Level_7: 7级可读性级别 (1到7)
- Readability_Level_5: 5级可读性级别 (1到5)
- Readability_Level_3: 3级可读性级别 (1到3)
- Source: 文档来源
- Book: 书名
- Author: 作者名
- Domain: 领域 (Arts & Humanities, STEM 或 Social Sciences)
- Text_Class: 读者群体 (Foundational, Advanced 或 Specialized)
数据划分
- 训练集: 80%
- 开发集: 10%
- 测试集: 10%
- 划分层级: 文档级
- 平衡性: 在可读性级别、领域和文本类别上保持平衡
评估指标
- 准确率: Acc<sup>19</sup>, Acc<sup>7</sup>, Acc<sup>5</sup>, Acc<sup>3</sup>
- 相邻准确率: ±1 Acc<sup>19</sup>
- 平均距离: Dist (Mean Absolute Error)
- 二次加权Kappa: QWK
引用
bibtex @inproceedings{elmadani-etal-2025-readability, title = "A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment", author = "Elmadani, Khalid N. and Habash, Nizar and Taha-Thomure, Hanada", booktitle = "Findings of the Association for Computational Linguistics: ACL 2025", year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics" }
@inproceedings{habash-etal-2025-guidelines, title = "Guidelines for Fine-grained Sentence-level Arabic Readability Annotation", author = "Habash, Nizar and Taha-Thomure, Hanada and Elmadani, Khalid N. and Zeino, Zeina and Abushmaes, Abdallah", booktitle = "Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX)", year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics" }




