bookbot/id_word2phoneme
收藏Hugging Face2023-03-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bookbot/id_word2phoneme
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- id
- ms
source_datasets:
- original
task_categories:
- text2text-generation
task_ids: []
pretty_name: ID Word2Phoneme
---
# Dataset Card for ID Word2Phoneme
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-splits)
- [Additional Information](#additional-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [Github](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt)
- **Repository:** [Github](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt)
- **Point of Contact:**
- **Size of downloaded dataset files:**
- **Size of the generated dataset:**
- **Total amount of disk used:**
### Dataset Summary
Originally a [Malay/Indonesian Lexicon](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt) retrieved from [ipa-dict](https://github.com/open-dict-data/ipa-dict). We removed the accented letters (because Indonesian graphemes do not use accents), separated homographs, and removed backslashes in phonemes -- resulting in a word-to-phoneme dataset.
### Languages
- Indonesian
- Malay
## Dataset Structure
### Data Instances
| word | phoneme |
| ----- | ------- |
| aba | aba |
| ab | ab |
| ab’ad | abʔad |
| abad | abad |
| abadi | abadi |
| ... | ... |
### Data Fields
- `word`: Word (grapheme) as a string.
- `phoneme`: Phoneme (IPA) as a string.
### Data Splits
| train |
| ----- |
| 27553 |
## Additional Information
### Citation Information
```
@misc{open-dict-data-no-date,
author = {{Open-Dict-Data}},
title = {{GitHub - open-dict-data/ipa-dict: Monolingual wordlists with pronunciation information in IPA}},
url = {https://github.com/open-dict-data/ipa-dict},
}
```
提供机构:
bookbot
原始信息汇总
数据集概述
数据集描述
数据集总结
- 来源: 原始数据集为Malay/Indonesian Lexicon,来源于ipa-dict。
- 处理: 移除了带重音的字符(因印尼字母不使用重音),分离了同形异义词,并移除了音素中的反斜杠,最终形成了一个词到音素的转换数据集。
语言
- 印尼语
- 马来语
数据集结构
数据实例
-
结构: 每个实例包含两个字段:
word(词)和phoneme(音素)。 -
示例:
word phoneme aba aba ab ab ab’ad abʔad abad abad abadi abadi ... ...
数据字段
word: 字符串类型的词(字形)。phoneme: 字符串类型的音素(国际音标)。
数据分割
- 训练集: 包含27553个实例。



