M2MET
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/M2MET
下载链接
链接失效反馈官方服务:
资源简介:
AliMeeting总共包含118.75小时的语音数据,包括104.75小时的训练集(Train)、4小时的验证集(Eval)和10小时的测试集(Test)。训练集和验证集分别包含212场和8场会议,其中每场会议由多个说话人进行15到30分钟的讨论。训练和验证集中参与会议的总人数分别为456人和25人,并且参会的男女比例人数均衡。训练集和验证集将在挑战开始时通过邮件发送给参与者,而测试集数据将在最后的评测阶段发布。
该数据集收集于13个不同的会议室,按照大小规格分为小型、中型和大型三种,房间面积从8到55平方米不等。不同房间具有不同的布局和声学特性,每个房间的详细参数也将发送给参与者。会议场地的墙体材料类型包括水泥、玻璃等。会议场地的家具包括沙发、电视、黑板、风扇、空调、植物等。在录制过程中,麦克风阵列放置于桌上,多个说话人围坐在桌边进行自然对话。麦克风阵列离说话人距离约0.3到5.0米之间。所有说话人的母语均是汉语,并且说的都是普通话,没有浓重的口音。在会议录制期间可能会产生各种室内的噪音,包括键盘声、开门/关门声、风扇声、气泡声等。所有说话人在会议的录制期间均保持相同位置,不发生走动。训练集和验证集的说话人没有重复。图1展示了一个会议室的布局以及麦克风的拓扑结构。
每场会议的说话人数量从2到4人不等。同时为了覆盖各种内容的会议场景,我们选择了多种会议主题,包括医疗、教育、商业、组织管理、工业生产等不同内容的例会。训练集和验证集的平均语音重叠率分别为42.27%和34.76%。AliMeeting训练集和验证集的详细信息见表1。表2显示了训练集和验证集中不同发言者人数会议的语音重叠率和会议数量。
我们还使用耳机麦克风记录了每个说话人的近场音频信号,并确保只转录对应说话人自己的语音。需要注意的是,麦克风阵列记录的远场音频和耳机麦克风记录的近场音频在时间上是同步的。每场会议的所有抄本均以TextGrid格式存储,内容包括会议的时长、说话人信息(说话人数量、说话人ID、性别等)、每个说话人的片段总数、每个片段的时间戳和转录内容。
AliMeeting contains a total of 118.75 hours of speech data, including 104.75 hours of training set (Train), 4 hours of validation set (Eval), and 10 hours of test set (Test). The training and validation sets consist of 212 and 8 meetings respectively, with each meeting involving multiple speakers conducting discussions lasting 15 to 30 minutes. The total number of participants in the training and validation sets is 456 and 25 respectively, with a balanced gender ratio among attendees. The training and validation sets will be sent to participants via email when the challenge starts, while the test set data will be released during the final evaluation stage.
This dataset was collected in 13 different meeting rooms, which are categorized into small, medium and large sizes based on their specifications, with room areas ranging from 8 to 55 square meters. Different rooms have distinct layouts and acoustic characteristics, and the detailed parameters of each room will also be sent to participants. The wall materials of the meeting venues include cement, glass, etc. The furniture in the venues includes sofas, televisions, blackboards, fans, air conditioners, plants, etc. During the recording process, the microphone array was placed on the table, and multiple speakers sat around the table for natural conversations. The distance between the microphone array and the speakers ranges approximately from 0.3 to 5.0 meters. All speakers are native Mandarin speakers with no heavy accents. Various indoor noises may occur during meeting recording, including keyboard clicks, door opening/closing sounds, fan noises, bubble sounds, etc. All speakers remained stationary throughout the meeting recording period, without moving around. There is no overlap of speakers between the training and validation sets. Figure 1 shows the layout of a meeting room and the topology of the microphone array.
The number of speakers per meeting ranges from 2 to 4. To cover a wide range of meeting scenarios, we selected multiple meeting topics, including regular meetings covering different content such as medical care, education, business, organizational management, industrial production, etc. The average speech overlap rates of the training and validation sets are 42.27% and 34.76% respectively. Detailed information of the AliMeeting training and validation sets is shown in Table 1. Table 2 displays the speech overlap rates and the number of meetings for different speaker counts in the training and validation sets.
We also recorded the near-field audio signals of each speaker using head-mounted microphones, ensuring that only the speech of the corresponding speaker was transcribed. It should be noted that the far-field audio recorded by the microphone array and the near-field audio recorded by the head-mounted microphones are temporally synchronized. The transcripts of each meeting are stored in TextGrid format, including the meeting duration, speaker information (number of speakers, speaker ID, gender, etc.), the total number of segments per speaker, the timestamp and transcription content of each segment.
提供机构:
OpenDataLab
创建时间:
2023-06-25
搜集汇总
数据集介绍

背景与挑战
背景概述
M2MET(AliMeeting)是一个中文普通话会议语音数据集,包含118.75小时的语音数据,采集自13个不同大小和声学特性的会议室,涉及2到4个说话人的自然对话,主题覆盖医疗、教育等多个领域,语音重叠率高(训练集达42.27%),适用于多通道多说话人语音识别研究。数据集提供远场和近场同步音频及详细转录,由阿里巴巴等机构于2022年发布,主要用于ICASSP 2022挑战任务。
以上内容由遇见数据集搜集并总结生成



