five

Buddhist Chinese Word Embeddings

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6782931
下载链接
链接失效反馈
官方服务:
资源简介:
Buddhist Chinese word embeddings trained with FastText on the Buddhist texts present in the Kanseki repository. There are four models present here (and the full binary output for one), each differing in how the Kanseki repository was segmented into tokens. The "chinese_model" was segmented into 1-grams (individual characters). The "chinese_model_word" was segmented into words using a dictionary of buddhist terms and phrases not found were segmented with the classical Chinese word segmenter distributed with the stanza python library (https://stanfordnlp.github.io/stanza/available_models.html). The "chinese_model_hybrid_char_term" was segmented with a glossary of Buddhist terms and phrases not found were divided into 1-grams. "chinese_model_hybrid_char_term_2" uses a more extensive glossary of Buddhist terms and missing sections are divided into 1-grams. All models are default 100 dimensional FastText models created for this pilot study: Felbur, Rafal, Marieke Meelen & Paul Vierthaler (2022), 'Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan' in Journal of Open Humanities Data. This research was done with generous funding from the Open Philology project. This project (running 2018–2022) is funded by the European Research Council (ERC) under the Horizon 2020 program (Advanced Grant agreement No 741884). It is based at the Leiden University Institute for Area Studies.
创建时间:
2022-06-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作