five

bookbot/id_word2phoneme

收藏
Hugging Face2023-03-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bookbot/id_word2phoneme
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - id - ms source_datasets: - original task_categories: - text2text-generation task_ids: [] pretty_name: ID Word2Phoneme --- # Dataset Card for ID Word2Phoneme ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-splits) - [Additional Information](#additional-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [Github](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt) - **Repository:** [Github](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt) - **Point of Contact:** - **Size of downloaded dataset files:** - **Size of the generated dataset:** - **Total amount of disk used:** ### Dataset Summary Originally a [Malay/Indonesian Lexicon](https://github.com/open-dict-data/ipa-dict/blob/master/data/ma.txt) retrieved from [ipa-dict](https://github.com/open-dict-data/ipa-dict). We removed the accented letters (because Indonesian graphemes do not use accents), separated homographs, and removed backslashes in phonemes -- resulting in a word-to-phoneme dataset. ### Languages - Indonesian - Malay ## Dataset Structure ### Data Instances | word | phoneme | | ----- | ------- | | aba | aba | | ab | ab | | ab’ad | abʔad | | abad | abad | | abadi | abadi | | ... | ... | ### Data Fields - `word`: Word (grapheme) as a string. - `phoneme`: Phoneme (IPA) as a string. ### Data Splits | train | | ----- | | 27553 | ## Additional Information ### Citation Information ``` @misc{open-dict-data-no-date, author = {{Open-Dict-Data}}, title = {{GitHub - open-dict-data/ipa-dict: Monolingual wordlists with pronunciation information in IPA}}, url = {https://github.com/open-dict-data/ipa-dict}, } ```
提供机构:
bookbot
原始信息汇总

数据集概述

数据集描述

数据集总结

  • 来源: 原始数据集为Malay/Indonesian Lexicon,来源于ipa-dict
  • 处理: 移除了带重音的字符(因印尼字母不使用重音),分离了同形异义词,并移除了音素中的反斜杠,最终形成了一个词到音素的转换数据集。

语言

  • 印尼语
  • 马来语

数据集结构

数据实例

  • 结构: 每个实例包含两个字段:word(词)和phoneme(音素)。

  • 示例:

    word phoneme
    aba aba
    ab ab
    ab’ad abʔad
    abad abad
    abadi abadi
    ... ...

数据字段

  • word: 字符串类型的词(字形)。
  • phoneme: 字符串类型的音素(国际音标)。

数据分割

  • 训练集: 包含27553个实例。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作