SEACrowd/ara_close

Name: SEACrowd/ara_close
Creator: SEACrowd
Published: 2024-06-24 13:25:26
License: 暂无描述

Hugging Face2024-06-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/SEACrowd/ara_close

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是本研究的贡献，包含用Bikol语编写的短篇小说，用于可读性评估。数据结合了其他收集的菲律宾语言语料库，如Tagalog和Cebuano。这些语言的数据分布在菲律宾小学系统的前三个年级（L1、L2、L3）。数据集来源于Lets Read Asia (LRA)、Bloom Library、Department of Education和Adarna House。数据集支持的任务是可读性评估。

The dataset contribution of this study is a compilation of short fictional stories written in Bikol for readability assessment. The data was combined with other collected Philippine language corpora, such as Tagalog and Cebuano. The data from these languages are all distributed across the Philippine elementary systems first three grade levels (L1, L2, L3). We sourced this dataset from Lets Read Asia (LRA), Bloom Library, Department of Education, and Adarna House. The dataset supports the task of readability assessment.

提供机构：

SEACrowd

原始信息汇总

Ara Close 数据集概述

数据集描述

Ara Close 数据集是一个用于可读性评估的短篇虚构故事集合，主要使用 Bikol 语言编写。该数据集结合了其他菲律宾语言（如 Tagalog 和 Cebuano）的语料库，这些语言的数据分布在菲律宾小学系统的前三个年级（L1, L2, L3）。数据集来源于 Lets Read Asia (LRA)、Bloom Library、Department of Education 和 Adarna House。

语言

Bikol (bcl)
Cebuano (ceb)

支持的任务

可读性评估 (Readability Assessment)

数据集版本

源版本: 1.0.0
SEACrowd 版本: 2024.06.20

数据集许可证

Creative Commons Attribution 4.0 (cc-by-4.0)

引用

如果您在工作中使用了 Ara Close 数据集，请引用以下内容：

@inproceedings{imperial-kochmar-2023-automatic, title = "Automatic Readability Assessment for Closely Related Languages", author = "Imperial, Joseph Marvin and Kochmar, Ekaterina", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.331", doi = "10.18653/v1/2023.findings-acl.331", pages = "5371--5386", abstract = "In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models{} accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines{---}Tagalog, Bikol, and Cebuano{---}to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol.", }

@article{lovenia2024seacrowd, title={SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages}, author={Holy Lovenia and Rahmad Mahendra and Salsabil Maulana Akbar and Lester James V. Miranda and Jennifer Santoso and Elyanah Aco and Akhdan Fadhilah and Jonibek Mansurov and Joseph Marvin Imperial and Onno P. Kampman and Joel Ruben Antony Moniz and Muhammad Ravi Shulthan Habibi and Frederikus Hudi and Railey Montalan and Ryan Ignatius and Joanito Agili Lopo and William Nixon and Börje F. Karlsson and James Jaya and Ryandito Diandaru and Yuze Gao and Patrick Amadeus and Bin Wang and Jan Christian Blaise Cruz and Chenxi Whitehouse and Ivan Halim Parmonangan and Maria Khelli and Wenyu Zhang and Lucky Susanto and Reynard Adha Ryanda and Sonny Lazuardi Hermawan and Dan John Velasco and Muhammad Dehan Al Kautsar and Willy Fitra Hendria and Yasmin Moslem and Noah Flynn and Muhammad Farid Adilazuarda and Haochen Li and Johanes Lee and R. Damanhuri and Shuo Sun and Muhammad Reza Qorib and Amirbek Djanibekov and Wei Qi Leong and Quyet V. Do and Niklas Muennighoff and Tanrada Pansuwan and Ilham Firdausi Putra and Yan Xu and Ngee Chia Tai and Ayu Purwarianti and Sebastian Ruder and William Tjhi and Peerat Limkonchotiwat and Alham Fikri Aji and Sedrick Keh and Genta Indra Winata and Ruochen Zhang and Fajri Koto and Zheng-Xin Yong and Samuel Cahyawijaya}, year={2024}, eprint={2406.10118}, journal={arXiv preprint arXiv: 2406.10118} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集