five

PACTib - PArsed Corpus of Tibetan (11th-21st c.)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/12104249
下载链接
链接失效反馈
官方服务:
资源简介:
This PArsed Corpus of Tibetan (PACTib) contains >5000 historical Tibetan texts (>82m words) from over 10 different centuries. The original texts are from the Buddhist Digital Resource Center (BDRC) automatically enriched with linguistic annotation in the form of segmentation (tokenisation), Part-of-Speech Tags and constituency parses. Files in this deposit are:- a csv file with an overview of all texts with metadata linking file IDs + date ranges- segmented & POS-tagged txt files (using the ACTib segmenter & tagger)- parsed txt files (using the ACTib parser - forth.) Note that only the dated files are part of this collection. More information about the corpus can be found in:Meelen, M., & Roux, É. (2020). Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories (pp. 31-42).
创建时间:
2024-06-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作