five

MayaGalvez/linguistic_representation_mBERT

收藏
Hugging Face2023-01-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/MayaGalvez/linguistic_representation_mBERT
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset obtains genealogical and typological information for the 104 languages used for pre-training of the language model multilingual BERT (Devlin et al., 2019). The genealogical information covers the language family and the genus for each language. For typological description of the pre-training languages, 36 features from WALS (Dryer & Haspelmath, 2013) were used. The information provided here can be used, among other things, to investigate how the pre-training corpus is structured from a genealogical and typological perspective and to what extent, if any, this structure is related to the performance of the language model. In addition to the table of linguistic features, a pdf file was uploaded listing all the grammars and language descriptive materials used to compile the linguistic information.
提供机构:
MayaGalvez
原始信息汇总

数据集概述

数据集内容

  • 语言数量:包含104种语言的信息。
  • 信息类型
    • 谱系信息:涵盖每种语言的语系和语族。
    • 类型学描述:使用WALS(Dryer & Haspelmath, 2013)的36个特征描述预训练语言的类型学特性。

数据集用途

  • 用于研究预训练语料库从谱系和类型学角度的结构,以及这种结构与语言模型性能之间的关联。

附加文件

  • 提供了一个PDF文件,列出了用于编译语言信息的语法书和语言描述材料。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作