MayaGalvez/linguistic_representation_mBERT
收藏Hugging Face2023-01-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MayaGalvez/linguistic_representation_mBERT
下载链接
链接失效反馈官方服务:
资源简介:
This dataset obtains genealogical and typological information for the 104 languages used for pre-training of the language model multilingual BERT (Devlin et al., 2019).
The genealogical information covers the language family and the genus for each language.
For typological description of the pre-training languages, 36 features from WALS (Dryer & Haspelmath, 2013) were used.
The information provided here can be used, among other things, to investigate how the pre-training corpus is structured from a genealogical and typological perspective and to what extent, if any, this structure is related to the performance of the language model.
In addition to the table of linguistic features, a pdf file was uploaded listing all the grammars and language descriptive materials used to compile the linguistic information.
提供机构:
MayaGalvez
原始信息汇总
数据集概述
数据集内容
- 语言数量:包含104种语言的信息。
- 信息类型:
- 谱系信息:涵盖每种语言的语系和语族。
- 类型学描述:使用WALS(Dryer & Haspelmath, 2013)的36个特征描述预训练语言的类型学特性。
数据集用途
- 用于研究预训练语料库从谱系和类型学角度的结构,以及这种结构与语言模型性能之间的关联。
附加文件
- 提供了一个PDF文件,列出了用于编译语言信息的语法书和语言描述材料。



