Buddhist Chinese Word Embeddings
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6782931
下载链接
链接失效反馈官方服务:
资源简介:
Buddhist Chinese word embeddings trained with FastText on the Buddhist texts present in the Kanseki repository.
There are four models present here (and the full binary output for one), each differing in how the Kanseki repository was segmented into tokens. The "chinese_model" was segmented into 1-grams (individual characters). The "chinese_model_word" was segmented into words using a dictionary of buddhist terms and phrases not found were segmented with the classical Chinese word segmenter distributed with the stanza python library (https://stanfordnlp.github.io/stanza/available_models.html). The "chinese_model_hybrid_char_term" was segmented with a glossary of Buddhist terms and phrases not found were divided into 1-grams. "chinese_model_hybrid_char_term_2" uses a more extensive glossary of Buddhist terms and missing sections are divided into 1-grams.
All models are default 100 dimensional FastText models created for this pilot study: Felbur, Rafal, Marieke Meelen & Paul Vierthaler (2022), 'Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan' in Journal of Open Humanities Data.
This research was done with generous funding from the Open Philology project. This project (running 2018–2022) is funded by the European Research Council (ERC) under the Horizon 2020 program (Advanced Grant agreement No 741884). It is based at the Leiden University Institute for Area Studies.
创建时间:
2022-06-30



