Data for Finnish Dialect Detection
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5330672
下载链接
链接失效反馈官方服务:
资源简介:
The data used in the paper "Finnish Dialect Identification: The Effect of Audio and Text".
If you use the data, please cite:
Hämäläinen, Mika; Alnajjar, Khalid; Partanen, Niko & Rueter, Jack (2021). Finnish Dialect Identification: The Effect of Audio and Text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
The metadata.json contains the dialectal and normalized transcriptions, the length of the wav files in milliseconds, the path to the wav file, the role of the speaker and the dialect. The data.zip* files contain the wav files. They are partial zip files to make Zenodo upload easier. Code_models.zip contains the code for training the bimodal model and the final trained model presented in the paper. There is also an OpenNMT model that is the text only model described in the paper. You can use it by running: python3 -m onmt.bin.translate -model nmt-model_step_100000.pt -src source_test.txt -output pred.txt -replace_unk
Our dataset is licensed under CC BY NC ND 4.0. Academic use only.
The corpus is based on Suomenkielen näytteitä (CC BY Kotimaisten kielten keskus).
本数据集来自论文《芬兰方言识别:音频与文本的影响》(Finnish Dialect Identification: The Effect of Audio and Text)。
若使用本数据集,请引用以下文献:
Hämäläinen, Mika; Alnajjar, Khalid; Partanen, Niko & Rueter, Jack (2021). 芬兰方言识别:音频与文本的影响. 见:2021年国际自然语言处理实证方法会议(Conference on Empirical Methods in Natural Language Processing,简称EMNLP)论文集。
"metadata.json" 文件包含方言转写与标准化转写内容、WAV格式音频文件的时长(单位为毫秒)、音频文件路径、说话人角色及其所属方言。"data.zip*" 文件为WAV格式音频文件,采用分卷压缩形式以方便Zenodo平台上传。"Code_models.zip" 包含用于训练本文所述双模态模型的代码,以及最终训练得到的模型。此外还包含本文提及的纯文本模型所用的OpenNMT模型。可通过以下命令调用该模型:python3 -m onmt.bin.translate -model nmt-model_step_100000.pt -src source_test.txt -output pred.txt -replace_unk
本数据集采用知识共享署名-非商业性使用-禁止演绎4.0国际许可协议(CC BY NC ND 4.0)授权,仅可用于学术用途。
本语料库基于《Suomenkielen näytteitä》(知识共享署名许可(CC BY),芬兰国家语言中心(Kotimaisten kielten keskus))制作。
创建时间:
2021-11-06



