filbench/UD_Tagalog-NewsCrawl

Name: filbench/UD_Tagalog-NewsCrawl
Creator: filbench
Published: 2025-07-23 03:41:21
License: 暂无描述

Hugging Face2025-07-23 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/filbench/UD_Tagalog-NewsCrawl

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: tokens sequence: string - name: lemmas sequence: string - name: xpos_tags sequence: string - name: upos_tags sequence: string - name: feats sequence: string - name: heads sequence: int64 - name: deprels sequence: string splits: - name: train num_bytes: 18262593 num_examples: 12495 - name: validation num_bytes: 2357509 num_examples: 1561 - name: test num_bytes: 2349279 num_examples: 1563 download_size: 5193712 dataset_size: 22969381 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* task_categories: - token-classification language: - tl tags: - parsing pretty_name: UD_Tagalog-NewsCrawl size_categories: - 10K<n<100K --- # UD_Tagalog-NewsCrawl **Paper**: https://arxiv.org/abs/2505.20428 The Tagalog Universal Dependencies NewsCrawl dataset consists of annotated text extracted from the Leipzig Tagalog Corpus. Data included in the Leipzig Tagalog Corpus were crawled from Tagalog-language online news sites by the Leipzig University Institute for Computer Science. The text data was automatically parsed and annotated by [Angelina Aquino](https://researchers.cdu.edu.au/en/persons/angelina-aquino) (University of the Philippines), and then manually corrected according the UD guidelines adapted for Tagalog by [Elsie Marie Or](https://www.researchgate.net/profile/Elsie-Or) (University of the Philippines), [Maria Bardají Farré](https://ifl.phil-fak.uni-koeln.de/en/general-linguistics/people/maria-bardaji-i-farre) (University of Cologne), and [Dr. Nikolaus Himmelmann](https://ifl.phil-fak.uni-koeln.de/en/prof-himmelmann) (University of Cologne). Further verification and automated corrections were done by [Lester James Miranda](https://ljvmiranda921.github.io) (Allen AI). Due to the source of the data, several typos, grammatical errors, incomplete sentences, and Tagalog-English code-mixing can be found in the dataset. ## Treebank structure - Train: 12495 sents, 286891 tokens - Dev: 1561 sents, 37045 tokens - Test: 1563 sents, 36974 tokens ## Acknowledgments Aside from the named persons in the previous section, the following also contributed to the project as manual annotators of the dataset: - Patricia Anne Asuncion - Paola Ellaine Luzon - Jenard Tricano - Mary Dianne Jamindang - Michael Wilson Rosero - Jim Bagano - Yeddah Joy Piedad - Farah Cunanan - Calen Manzano - Aien Gengania - Prince Heinreich Omang - Noah Cruz - Leila Ysabelle Suarez - Orlyn Joyce Esquivel - Andre Magpantay The annotation project was made possible by the Deutsche Forschungsgemeinschaft (DFG)-funded project titled "Information distribution and language structure - correlation of grammatical expressions of the noun/verb distinction and lexical information content in Tagalog, Indonesian and German." The DFG project team is composed of Dr. Nikolaus Himmelmann and Maria Bardají Farré from the University of Cologne, and Dr. Gerhard Heyer, Dr. Michael Richter, and Tariq Yousef from the Leipzig University. ## Citation ``` @inproceedings{aquino-etal-2025-ud, title = "The {UD}-{N}ews{C}rawl Treebank: Reflections and Challenges from a Large-scale {T}agalog Syntactic Annotation Project", author = "Aquino, Angelina Aspra and Miranda, Lester James Validad and Or, Elsie Marie T.", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.357/", pages = "7219--7239", ISBN = "979-8-89176-251-0", abstract = "This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according tothe Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog." } ```

提供机构：

filbench

5,000+

优质数据集

54 个

任务类型

进入经典数据集