MayaGalvez/linguistic_representation_mBERT

Name: MayaGalvez/linguistic_representation_mBERT
Creator: MayaGalvez
Published: 2023-01-26 10:56:38
License: 暂无描述

Hugging Face2023-01-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/MayaGalvez/linguistic_representation_mBERT

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset obtains genealogical and typological information for the 104 languages used for pre-training of the language model multilingual BERT (Devlin et al., 2019). The genealogical information covers the language family and the genus for each language. For typological description of the pre-training languages, 36 features from WALS (Dryer & Haspelmath, 2013) were used. The information provided here can be used, among other things, to investigate how the pre-training corpus is structured from a genealogical and typological perspective and to what extent, if any, this structure is related to the performance of the language model. In addition to the table of linguistic features, a pdf file was uploaded listing all the grammars and language descriptive materials used to compile the linguistic information.

提供机构：

MayaGalvez

原始信息汇总

数据集概述

数据集内容

语言数量：包含104种语言的信息。
信息类型：
- 谱系信息：涵盖每种语言的语系和语族。
- 类型学描述：使用WALS（Dryer & Haspelmath, 2013）的36个特征描述预训练语言的类型学特性。

数据集用途

用于研究预训练语料库从谱系和类型学角度的结构，以及这种结构与语言模型性能之间的关联。

附加文件

提供了一个PDF文件，列出了用于编译语言信息的语法书和语言描述材料。

5,000+

优质数据集

54 个

任务类型

进入经典数据集