zouharvi/pwesuite-eval

Name: zouharvi/pwesuite-eval
Creator: zouharvi
Published: 2024-07-21 11:40:36
License: 暂无描述

Hugging Face2024-07-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/zouharvi/pwesuite-eval

下载链接

链接失效反馈

官方服务：

资源简介：

PWESuite Evaluation v1数据集是一个用于评估语音词嵌入的多语言数据集，涵盖了多种语言（如英语、阿姆哈拉语、孟加拉语等）。数据集包含词的正字法表示、国际音标表示、ARPABET表示、语言代码和用途等特征。数据集的规模在10万到100万之间，训练集包含1,738,008个样本。该数据集的创建目的是为了公平评估过去、现在和未来的语音词嵌入方法，包括内在评估（如词检索和声音相似性相关性）和外在任务（如押韵和同源词检测以及声音类比）。数据集的使用需要引用相关的论文。

The PWESuite Evaluation v1 dataset is a multilingual dataset used for evaluating phonetic word embeddings, covering multiple languages (e.g., English, Amharic, Bengali, etc.). The dataset includes features such as orthographic representation of words (token_ort), International Phonetic Alphabet representation (token_ipa), ARPABET representation (token_arp), language code (lang), and purpose (purpose). The dataset size ranges between 100,000 and 1,000,000, with the training set containing 1,738,008 samples. The dataset was created to fairly evaluate past, present, and future phonetic word embedding methods, including intrinsic evaluations (e.g., word retrieval and sound similarity correlation) and extrinsic tasks (e.g., rhyme and cognate detection and sound analogies). Use of the dataset requires citation of the associated paper.

提供机构：

zouharvi

原始信息汇总

数据集概述

基本信息

名称: PWESuite Evaluation v1
多语言支持: 支持英语、阿姆哈拉语、孟加拉语、斯瓦希里语、乌兹别克语、西班牙语、波兰语、法语、德语
类别: 多语言
标签: 单词、词嵌入、语音学、同源词、韵律、类比
大小: 100K<n<1M

数据集结构

特征:
- token_ort: 字符串类型
- token_ipa: 字符串类型
- token_arp: 字符串类型
- lang: 字符串类型
- purpose: 字符串类型
分割:
- train: 1738008个样本

许可证

许可证: Apache-2.0

引用信息

论文: PWESuite: Phonetic Word Embeddings and Tasks They Facilitate
作者: Vilém Zouhar, Kalvin Chang, Chenxuan Cui, Nathaniel Carlson, Nathaniel Robinson, Mrinmaya Sachan, David Mortensen
发表年份: 2023

5,000+

优质数据集

54 个

任务类型

进入经典数据集