YembaEGRA

Name: YembaEGRA
Creator: Mendeley Data
License: 暂无描述

doi.org2025-01-21 收录

下载链接：

http://doi.org/10.17632/74p9d5frg3.1

下载链接

链接失效反馈

官方服务：

资源简介：

The Yemba language is a Bantu language spoken in the western region of Cameroon. It is one of the ten languages spoken by the Bamileke peoples. The basic education system in Cameroon is made up of three levels: level 1: SIL-CEP, level 2: CE1-CE2, level 3: CM1-CM2 and lessons in “National Languages and Cultures” (introduced in 2019) are guided by a curriculum which for each level contains all the teaching units as well as eight thematic fields (or centers of interest) around which learning takes place. This corpus was built for learning automatic speech recognition models that can be used to facilitate the learning and assessment of national languages in the basic education system in Cameroon. The corpus of words available in this directory was formed for each center of interest by an educational facilitator who proposed a set of words. A linguist specializing in the Yemba language translated them to obtain a corpus of 60 words. These words were then pronounced twice by 69 native speakers, level 3 students including 36 girls and 33 boys. The recordings were carried out in classrooms and quiet rooms close to the public schools of Melah and Toudjoua (located in the village of Bamendou in the Menoua department, West region, Cameroon). In the metadata folder the corpus of words is present in a csv file named words_corpus. Information about each speaker is grouped in the speakers_description file in csv format including gender, age, class. The audio folder is divided into eight sub-folders named CI1 to CI8 each corresponding to a center of interest, within these folders we have three sub-folders named 1 to 3 for each level. Each of these subfolders contains the audio files of the words belonging to the center of interest and the level considered; These audios are grouped in subfolders named W1 to Wx (where x is the number of words of the center of interest). Each word folder contains audio files in wav format. Each audio file was named as follows: spkr_<speaker id>_word_<word id>_ occ_<occurence number>_ci_<area of interest id>_l_<level id>.wav. For example, the files spkr_2_word_40_occ_1_ci_5_l_3.wav and spkr_2_word_40_occ_2_ci_5_l_3.wav correspond respectively to the files of occurrences 1 and 2 of word 40 belonging to center of interest 5, pronounced by speaker number 2 of level 3.

耶姆巴语是一种流行于喀麦隆西部地区的班图语系语言。它是巴米莱克人使用的十种语言之一。喀麦隆的基础教育体系由三个层级构成：第一层级为SIL-CEP，第二层级为CE1-CE2，第三层级为CM1-CM2，而“国家语言与文化”课程（自2019年起引入）则遵循包含所有教学单元及八个主题领域（或兴趣中心）的课程大纲，这些主题领域构成了学习活动的核心。本语料库旨在构建能够促进喀麦隆基础教育体系中国家语言学习与评估的自动语音识别模型。本目录下可用的词汇语料库由教育辅助人员根据各个兴趣中心提出的一组词汇构成。专注于耶姆巴语的语文学家将这些词汇翻译成中文，从而形成包含60个词汇的语料库。随后，这些词汇由69位本土发音者，包括36名女生和33名男生，以每位发音者两次的方式进行朗读。录音工作在梅拉和图久亚（位于喀麦隆西部地区的梅努阿省巴门杜村）的公立学校教室和安静房间内进行。在元数据文件夹中，词汇语料库以名为words_corpus的csv文件形式存在。每位发言者的信息按性别、年龄、班级分组存储在speakers_description文件中，文件格式为csv。音频文件夹被划分为八个子文件夹，分别命名为CI1至CI8，每个子文件夹对应一个兴趣中心。在这些子文件夹中，每个层级都有三个子文件夹，分别命名为1至3。每个子文件夹包含属于特定兴趣中心和层级的词汇的音频文件，这些音频文件被归类在名为W1至Wx的子文件夹中（其中x代表该兴趣中心的词汇数量）。每个词汇文件夹包含wav格式的音频文件。每个音频文件均按照以下格式命名：spkr_<发言人ID>_word_<词汇ID>_occ_<出现次数>_ci_<兴趣中心ID>_l_<层级ID>.wav。例如，文件spkr_2_word_40_occ_1_ci_5_l_3.wav和spkr_2_word_40_occ_2_ci_5_l_3.wav分别对应于词汇40在兴趣中心5的出现次数1和2的音频文件，由层级3的发言人2朗读。

提供机构：

Mendeley Data

5,000+

优质数据集

54 个

任务类型

进入经典数据集