SeniorTalk
收藏魔搭社区2026-05-17 更新2025-04-05 收录
下载链接:
https://modelscope.cn/datasets/BAAI/SeniorTalk
下载链接
链接失效反馈官方服务:
资源简介:
# SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
[](https://huggingface.co/datasets/BAAI/SeniorTalk)
[](https://www.arxiv.org/pdf/2503.16578)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://github.com/flageval-baai/SeniorTalk)
## Introduction
**SeniorTalk** is a comprehensive, open-source Mandarin Chinese speech dataset specifically designed for research on elderly aged 75 to 85. This dataset addresses the critical lack of publicly available resources for this age group, enabling advancements in automatic speech recognition (ASR), speaker verification (SV), speaker dirazation (SD), speech editing and other related fields. The dataset is released under a **CC BY-NC-SA 4.0 license**, meaning it is available for non-commercial use.
## Dataset Details
This dataset contains 55.53 hours of high-quality speech data collected from 202 elderly across 16 provinces in China. Key features of the dataset include:
* **Age Range:** 75-85 years old (inclusive). This is a crucial age range often overlooked in speech datasets.
* **Speakers:** 202 unique elderly speakers.
* **Geographic Diversity:** Speakers from 16 of China's 34 provincial-level administrative divisions, capturing a range of regional accents.
* **Gender Balance:** Approximately 7:13 representation of male and female speakers, largely attributed to the differing average ages of males and females among the elderly.
* **Recording Conditions:** Recordings were made in quiet environments using a variety of smartphones (both Android and iPhone devices) to ensure real-world applicability.
* **Content:** Natural, conversational speech during age-appropriate activities. The content is unrestricted, promoting spontaneous and natural interactions.
* **Audio Format:** WAV files with a 16kHz sampling rate.
* **Transcriptions:** Carefully crafted, character-level manual transcriptions.
* **Annotations:** The dataset includes annotations for each utterance, and for the speakers level.
* **Session-level**: `sentence_start_time`,`sentence_end_time`,`overlapped speech`
* **Utterance-level**: `id`, `accent_level`, `text` (transcription).
* **Token-level**: `special token`([SONANT],[MUSIC],[NOISE]....)
* **Speaker-level**: `speaker_id`, `age`, `gender`, `location` (province), `device`.
### Dataset Structure
## Dialogue Dataset
The dataset is split into two subsets:
| Split | # Speakers | # Dialogues | Duration (hrs) | Avg. Dialogue Length (h) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train` | 182 | 91 | 49.83 | 0.54 |
| `test` | 20 | 10 | 5.70 | 0.57 |
| **Total** | **202** | **101** | **55.53** | **0.55** |
The dataset file structure is as follows.
```
dialogue_data/
├── wav
│ ├── train/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # annotation of topics and duration
SPKINFO.txt # annotation of location , age , gender and device
```
Each WAV file has a corresponding TXT file with the same name, containing its annotations.
For more details, please refer to our paper [SeniorTalk](https://www.arxiv.org/abs/2503.16578).
## ASR Dataset
The dataset is split into three subsets:
| Split | # Speakers | # Utterances | Duration (hrs) | Avg. Utterance Length (s) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train` | 162 | 47,269 | 29.95 | 2.28 |
| `validation` | 20 | 6,891 | 4.09 | 2.14 |
| `test` | 20 | 5,869 | 3.77 | 2.31 |
| **Total** | **202** | **60,029** | **37.81** | **2.27** |
The dataset file structure is as follows.
```
sentence_data/
├── wav
│ ├── train/*.tar
│ ├── dev/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # annotation of topics and duration
SPKINFO.txt # annotation of location , age , gender and device
```
Each WAV file has a corresponding TXT, containing its annotations.
For more details, please refer to our paper [SeniorTalk](https://www.arxiv.org/abs/2503.16578).
## Dataset Access Control
This dataset is available to researchers upon request for academic and non-commercial use. To request access, please follow these steps:
1. **Request Access on Hugging Face:** Make sure you are logged into your Hugging Face account and click the "Request access to this dataset" button on this page.
2. **Submit Application via Email:** Send an email to **`your-research-email@example.com`** with the following information:
* **Subject:** Dataset Access Request: [Your Name/Institution]
* **Body:**
* Your Hugging Face Username.
* Your full name, title, and academic/institutional affiliation.
* A link to your professional profile (e.g., university page, Google Scholar, LinkedIn).
* A brief description of your research project and how you intend to use the dataset.
We will review your application and grant access on Hugging Face upon approval. Please allow 3-5 business days for processing.
## 📚 Cite me
```
@misc{chen2025seniortalkchineseconversationdataset,
title={SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors},
author={Yang Chen and Hui Wang and Shiyao Wang and Junyang Chen and Jiabei He and Jiaming Zhou and Xi Yang and Yequan Wang and Yonghua Lin and Yong Qin},
year={2025},
eprint={2503.16578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.16578},
}
```
# SeniorTalk:面向高龄老年人的带丰富标注的中文对话数据集
[](https://huggingface.co/datasets/BAAI/SeniorTalk)
[](https://www.arxiv.org/pdf/2503.16578)
[](https://creativecommons.org/licenses/by-nc-sa/4.0/)
[](https://github.com/flageval-baai/SeniorTalk)
## 简介
**SeniorTalk** 是一款专为75至85岁高龄老年人研究打造的开源普通话语音综合数据集。本数据集填补了该年龄段公开语音资源严重匮乏的空白,可助力自动语音识别(Automatic Speech Recognition, ASR)、说话人验证(Speaker Verification, SV)、说话人分段(Speaker Diarization, SD)、语音编辑及其他相关领域的研究进展。本数据集采用**CC BY-NC-SA 4.0**许可协议发布,仅可用于非商业用途。
## 数据集详情
本数据集包含来自中国16个省份的202位老年人的55.53小时高质量语音数据。数据集的核心特性如下:
* **年龄范围**:75至85周岁(含两端)。该年龄段常被现有语音数据集所忽视,具有重要研究价值。
* **说话人数量**:202位独立老年说话人。
* **地域多样性**:说话人来自中国34个省级行政区中的16个,涵盖多种地域口音。
* **性别比例**:男女说话人占比约为7:13,这一比例主要受老年群体中男女平均寿命差异的影响。
* **录制环境**:在安静环境下使用多款智能手机(包含Android与iPhone设备)进行录制,以确保数据集的真实应用场景适用性。
* **内容属性**:贴合年龄场景的自然会话语音,内容无限制,可促进自发且自然的交互。
* **音频格式**:采用16kHz采样率的WAV格式文件。
* **转录文本**:经过精心制作的字符级人工转录文本。
* **标注信息**:数据集包含每一条话语以及说话人层面的标注:
* **会话级标注**:`sentence_start_time`(语句起始时间)、`sentence_end_time`(语句结束时间)、`overlapped speech`(重叠语音)
* **话语级标注**:`id`(编号)、`accent_level`(口音等级)、`text`(转录文本)。
* **Token级标注**:`special token`(特殊标记(Token)),包含`[SONANT]`、`[MUSIC]`、`[NOISE]`等。
* **说话人级标注**:`speaker_id`(说话人ID)、`age`(年龄)、`gender`(性别)、`location`(省份)、`device`(录制设备)。
### 数据集结构
#### 对话数据集
本数据集被划分为两个子集,详细信息如下表所示:
| 数据集划分 | 说话人数量 | 对话数量 | 总时长(小时) | 平均对话时长(小时) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train`(训练集) | 182 | 91 | 49.83 | 0.54 |
| `test`(测试集) | 20 | 10 | 5.70 | 0.57 |
| **总计** | **202** | **101** | **55.53** | **0.55** |
本数据集的文件结构如下:
dialogue_data/
├── wav
│ ├── train/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # 话题与时长标注文件
SPKINFO.txt # 地域、年龄、性别与设备标注文件
每个WAV文件均对应一个同名的TXT文件,包含其对应的标注信息。
如需了解更多细节,请参考我们的论文 [SeniorTalk](https://www.arxiv.org/abs/2503.16578)。
#### ASR数据集
本数据集被划分为三个子集,详细信息如下表所示:
| 数据集划分 | 说话人数量 | 话语数量 | 总时长(小时) | 平均话语时长(秒) |
| :--------- | :--------: | :----------: | :------------: | :-----------------------: |
| `train`(训练集) | 162 | 47,269 | 29.95 | 2.28 |
| `validation`(验证集) | 20 | 6,891 | 4.09 | 2.14 |
| `test`(测试集) | 20 | 5,869 | 3.77 | 2.31 |
| **总计** | **202** | **60,029** | **37.81** | **2.27** |
本数据集的文件结构如下:
sentence_data/
├── wav
│ ├── train/*.tar
│ ├── dev/*.tar
│ └── test/*.tar
└── transcript/*.txt
UTTERANCEINFO.txt # 话题与时长标注文件
SPKINFO.txt # 地域、年龄、性别与设备标注文件
每个WAV文件均对应一个同名TXT文件,包含其对应的标注信息。
如需了解更多细节,请参考我们的论文 [SeniorTalk](https://www.arxiv.org/abs/2503.16578)。
## 数据集访问权限申请
本数据集仅面向学术与非商业用途的研究人员开放申请。如需申请访问,请遵循以下步骤:
1. **在Hugging Face平台提交访问申请**:登录您的Hugging Face账号,点击本页面的“Request access to this dataset”按钮。
2. **通过邮件提交申请材料**:发送邮件至**`your-research-email@example.com`**,邮件内容需包含以下信息:
* **主题**:数据集访问申请:[您的姓名/所属机构]
* **正文**:
* 您的Hugging Face用户名。
* 您的全名、职称以及所属学术/机构单位。
* 您的个人专业主页链接(例如大学页面、Google Scholar、LinkedIn)。
* 您的研究项目简介以及本数据集的使用计划。
我们将审核您的申请,审核通过后将在Hugging Face平台开放访问权限。处理周期约为3至5个工作日。
## 📚 引用本数据集
@misc{chen2025seniortalkchineseconversationdataset,
title={SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors},
author={Yang Chen and Hui Wang and Shiyao Wang and Junyang Chen and Jiabei He and Jiaming Zhou and Xi Yang and Yequan Wang and Yonghua Lin and Yong Qin},
year={2025},
eprint={2503.16578},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.16578},
}
提供机构:
maas
创建时间:
2025-03-31
搜集汇总
数据集介绍

背景与挑战
背景概述
SeniorTalk是一个专注于75-85岁老年人的中文语音数据集,包含55.53小时的语音数据和丰富的注释信息,覆盖202位来自中国16个省份的老年人,适用于语音识别等研究领域。数据集采用CC BY-NC-SA 4.0许可,仅限非商业用途。
以上内容由遇见数据集搜集并总结生成



