skit-ai/skit-s2i
收藏Hugging Face2024-10-02 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/skit-ai/skit-s2i
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-4.0
size_categories:
- 1K<n<10K
task_categories:
- audio-classification
- automatic-speech-recognition
pretty_name: Skit-S2I
tags:
- intent-recognition
- speech
dataset_info:
features:
- name: audio
dtype: audio
- name: intent_class
dtype: int64
- name: template
dtype: string
- name: speaker_id
dtype: int64
splits:
- name: train
num_bytes: 698801842.48
num_examples: 10445
- name: test
num_bytes: 93949690.4
num_examples: 1400
download_size: 495247674
dataset_size: 792751532.88
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
Skit-S2I is a **Speech to Intent** dataset for Indian English (`en-IN`), that covers 14 coarse-grained intents from the Banking domain. This work is inspired by a similar release in the [Minds-14 dataset](https://huggingface.co/datasets/PolyAI/minds14) - here, we restrict ourselves to Indian English but with a larger training set. The dataset is split into:
- test - `100` samples per intent
- train - `>650` samples per intent
The data was generated by 11 Indian speakers, recording over a telephony line. We also provide access to anonymised speaker information - like gender, languages spoken, native language - to enable more structured discussions around robustness and bias, in the models you train.
<div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400">
<p>This Datasheet follows from the <a href="https://arxiv.org/pdf/1803.09010.pdf" target="_blank">Datasheets for datasets</a> paper.</p>
</div>
# Motivation
**Q1) For what purpose was the dataset created ? Was there a specific task in mind ? Was there a specific gap that needed to be filled ?**
Ans. This is a dataset for Intent classification from (Indian English) speech, and covers 14 coarse-grained intents from the Banking domain. While there are other datasets that have approached this task, here we provide a much largee training dataset (`>650` samples per intent) to train models in an end-to-end fashion. We also provide anonymised speaker information to help answer questions around model robustness and bias.
**Q2) Who created the dataset and on behalf of which entity ?**
Ans. The (internal) Operations team at Skit was involved in the generation of the dataset, and provided their information for (anonymous) release. [Unnati Senani](https://unnu.so/about/) was involved in the curation of utterance templates, and [Kriti Anandan](https://github.com/kritianandan98) and [Kumarmanas Nethil](https://huggingface.co/janaab) were involved in the planning and collection of utterances - using an internal tool called [sandbox](https://github.com/skit-ai/sandbox). These contributors worked on this dataset as part of the Conversational UX and ML teams at Skit.
**Q3) Who funded the creation of the dataset ?**
Ans. Skit funded the creation of this dataset.
# Composition
**Q4) What do the instances that comprise the dataset consist of ?**
Ans. The intent dataset is split across `train.csv` and `test.csv`. In both, individual instances consist of the following fields:
- `id`
- `intent_class`
- `template`
- `audio_path`
- `speaker_id`
You can trace more information on the intents, using the shared `intent_class` field in `intent_info.csv`:
- `intent_class`
- `intent_name`
- `description`
You can trace more information on the speakers, using the shared `speaker_id` field in `speaker_info.csv`:
- `speaker_id`
- `native_language`
- `languages_spoken`
- `places_lived`
- `gender`
**Q5) How many instances are there in total (of each type, if appropriate) ?**
Ans. In all there are `11845` samples, across the train and test splits:
- `test.csv` has a total of `1400` samples, with exactly `100` samples per intent
- `train.csv` has a total of `10445` samples, with atleast `650` samples per intent
The 11 speakers are distributed across the dataset, but unequally. However:
- each intent has data from all speakers
- the speakers are stratified across the train and test split - for each intent independently
Some statistics on the speakers are provided below. More granular information can be found in `speaker_info.csv`:
- Native languages: `Hindi`(4), `Bengali`(3), `Kannada`(2), `Malayalam`(1), `Punjabi`(1)
- Languages spoken: `Hindi`, `English`, `Bengali`, `Odia`, `Kannada`, `Punjabi`, `Malayalam`, `Bihari`, `Marathi`
- Indian states lived in: `Bihar`, `Odisha`, `Karnataka`, `West Bengal`, `Punjab`, `Kerala`, `Jharkhand`, `Maharashtra`
**Q6) Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set ?**
Ans. For each intent, our Conversational UX team generated a list of templates. These are meant to be a (satisfactory) representation of all the variations in utterances, seen in human speech. These templates were used as a guide by the speakers when generating data. So, this dataset is limited by the templates and the variations that speakers added (spontaneously).
**Q7) Are there recommended data splits (e.g., training, development/validation, testing) ?**
Ans. The recommended split into train and test sets is provided as `train.csv` and `test.csv` respectively.
**Q8) Are there any errors, sources of noise, or redundancies in the dataset?**
Ans. There could be channel noise present in the dataset, because the data was generated through telephone calls. However, background noise will not be as prevalent as in real-world scenarios, since these telephone calls were made in a semi-controlled environment.
**Q9) Other comments.**
Ans. Speakers were responsible for generating variations in utterances, using the `template` field as a guide. So, there could be some unintentional overlap across the content of utterances.
# Collection Process
**Q10) How was the data associated with each instance acquired ?**
Ans. Members of the (internal) Operation team generated each utterance - using the associated `template` field as a guide, and injecting their own variations into the speech utterance.
**Q11) Who was involved in the data collection process and how were they compensated ?**
Ans. The data was generated by the (internal) Operations team and they are/were full-time employees.
**Q12) Over what timeframe was the data collected ?**
Ans. This data was collected over a time period of 1 month.
**Q13) Was any preprocessing/cleaning/labelling of the data done ?**
Ans. Audio instances in the dataset were *auto-labelled* with their associated `intent` and `template` fields. For more information on this, refer to the documentation of [sandbox](https://github.com/skit-ai/sandbox).
# Recommended Uses
**Q14) Has the dataset been used for any tasks already ?**
Ans. It has been used to benchmark models for the task of intent classification from speech.
**Q15) What (other) tasks could the dataset be used for ?**
Ans. We provide speaker characteristics. So, this dataset could be used for alternate classification tasks from speech - like, gender or native language.
# Distribution and Maintenance
**Q16) Will the dataset be distributed under a copyright or other intellectual property (IP) license ?**
Ans. This dataset is being distributed under a [CC BY NC license](https://creativecommons.org/licenses/by-nc/4.0/).
**Q17) Who will be maintaining the dataset ?**
Ans. The research team at Skit will be maintaining the dataset. They can be contacted by sending an email to ml-research@skit.ai.
**Q18) Will the dataset be updated in the future (e.g., to correct labelling errors, add new instances, delete instances) ?**
Ans. Incase there are errors, we will try to collate and share an updated version every 3 months. We also plan to add more instances and variations to the dataset - to make it more robust.
语言:
- en
许可证:CC BY-NC 4.0
规模类别:
- 1000 < 样本数 < 10000
任务类别:
- 音频分类(audio-classification)
- 自动语音识别(automatic-speech-recognition)
正式名称:Skit-S2I
标签:
- 意图识别(intent-recognition)
- 语音(speech)
数据集信息:
特征:
- 名称:audio(音频),数据类型:audio
- 名称:intent_class(意图类别),数据类型:int64
- 名称:template(模板),数据类型:string
- 名称:speaker_id(说话人ID),数据类型:int64
数据集划分:
- 训练集(train):字节数698801842.48,样本数10445
- 测试集(test):字节数93949690.4,样本数1400
下载大小:495247674字节
总数据集大小:792751532.88字节
配置:
- 配置名称:default(默认配置)
数据文件:
- 划分:train,路径:data/train-*
- 划分:test,路径:data/test-*
---
Skit-S2I是一款面向印度英语(en-IN)的**语音转意图(Speech to Intent)**数据集,涵盖金融银行领域的14种粗粒度意图。本数据集的设计灵感源自[Minds-14数据集](https://huggingface.co/datasets/PolyAI/minds14)的同类发布工作,本次数据集仅针对印度英语,但训练集规模更大。数据集划分如下:
- 测试集:每个意图对应100条样本
- 训练集:每个意图对应超过650条样本
本数据集由11名印度说话人通过电话线路录制生成。我们还提供匿名化的说话人信息(包括性别、掌握语言、母语等),以支持针对训练模型的鲁棒性与偏差开展更系统性的分析讨论。
<div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400">
<p>本数据集说明书遵循《数据集说明书》(Datasheets for datasets)论文的规范,原文链接:https://arxiv.org/pdf/1803.09010.pdf</p>
</div>
# 动机
**Q1:本数据集的创建目的是什么?是否针对特定任务?是否填补了特定的研究空白?**
答:本数据集旨在实现(印度英语)语音的意图分类任务,涵盖金融银行领域的14种粗粒度意图。尽管已有其他数据集针对该任务展开研究,但本次数据集提供了更大规模的训练集(每个意图超650条样本),可用于端到端的模型训练。此外,我们还提供匿名化的说话人信息,以助力研究者针对训练模型的鲁棒性与偏差开展更系统性的分析讨论。
**Q2:本数据集由谁创建?代表哪个实体?**
答:本数据集由Skit公司内部运营团队参与生成,并经匿名化处理后公开发布。[Unnati Senani](https://unnu.so/about/)负责整理语句模板,[Kriti Anandan](https://github.com/kritianandan98)与[Kumarmanas Nethil](https://huggingface.co/janaab)依托内部工具[sandbox](https://github.com/skit-ai/sandbox)参与了语句的规划与采集工作。上述贡献者均隶属于Skit公司对话用户体验(Conversational UX)与机器学习(ML)团队。
**Q3:本数据集的制作由谁资助?**
答:本数据集的制作由Skit公司提供资助。
# 数据集构成
**Q4:数据集中的样本包含哪些内容?**
答:本意图分类数据集分为`train.csv`与`test.csv`两个文件。两类文件中的单条样本均包含以下字段:
- `id`(样本ID)
- `intent_class`(意图类别)
- `template`(语句模板)
- `audio_path`(音频路径)
- `speaker_id`(说话人ID)
可通过`intent_info.csv`中的共享字段`intent_class`查询意图的详细信息:
- `intent_class`(意图类别)
- `intent_name`(意图名称)
- `description`(意图描述)
可通过`speaker_info.csv`中的共享字段`speaker_id`查询说话人的详细信息:
- `speaker_id`(说话人ID)
- `native_language`(母语)
- `languages_spoken`(掌握语言)
- `places_lived`(居住地区)
- `gender`(性别)
**Q5:数据集总共有多少样本?各类型样本的数量分别是多少?**
答:训练集与测试集总计包含11845条样本:
- `test.csv`共计1400条样本,每个意图恰好对应100条样本
- `train.csv`共计10445条样本,每个意图至少包含650条样本
11名说话人分布于全数据集,但分布并不均衡。不过:
- 每个意图均覆盖全部11名说话人的数据
- 针对每个意图独立进行分层抽样,确保说话人样本均匀分布于训练集与测试集
以下为说话人的部分统计信息,更细粒度的信息可查阅`speaker_info.csv`:
- 母语分布:印地语(4人)、孟加拉语(3人)、卡纳达语(2人)、马拉雅拉姆语(1人)、旁遮普语(1人)
- 掌握语言:印地语、英语、孟加拉语、奥里亚语、卡纳达语、旁遮普语、马拉雅拉姆语、比哈尔语、马拉地语
- 居住的印度邦/邦级行政区:比哈尔邦、奥里萨邦、卡纳塔克邦、西孟加拉邦、旁遮普邦、喀拉拉邦、贾坎德邦、马哈拉施特拉邦
**Q6:本数据集是否涵盖所有可能的样本?还是从更大的样本集合中抽取的子集(未必经过随机抽样)?**
答:针对每个意图,对话用户体验团队首先生成了一组模板,这些模板能够较好地覆盖人类口语中语句的各类变体。录制时,说话人以该模板为参考生成语音样本。因此,本数据集的覆盖范围受限于模板库与说话人自发添加的语句变体,并非涵盖所有可能的样本。
**Q7:是否有推荐的数据集划分方式(如训练集、验证集、测试集)?**
答:本数据集已提供推荐的训练集与测试集划分,分别对应`train.csv`与`test.csv`文件。
**Q8:数据集中是否存在错误、噪声源或冗余内容?**
答:由于样本通过电话线路录制生成,本数据集可能存在信道噪声。但由于本次录制在半受控环境中进行,背景噪声的出现频率不会像真实场景中那样高。
**Q9:其他说明。**
答:说话人以`template`字段为参考生成语句变体,因此不同样本的语句内容可能存在无意的重叠。
# 数据采集流程
**Q10:单条样本的数据是如何获取的?**
答:内部运营团队成员以关联的`template`字段为参考,在语音语句中加入自主变体,生成目标语句。
**Q11:谁参与了数据采集工作?他们是否获得了报酬?**
答:参与数据采集的人员为Skit公司内部运营团队的全职员工,本次采集未额外支付报酬。
**Q12:数据采集工作持续了多长时间?**
答:本数据集的采集工作耗时1个月。
**Q13:是否对数据进行了预处理、清洗或标注?**
答:本数据集的音频样本已通过自动标注的方式完成`intent`(意图)与`template`(模板)字段的标注。相关细节可查阅内部工具[sandbox](https://github.com/skit-ai/sandbox)的文档。
# 推荐应用场景
**Q14:本数据集是否已被用于某些任务?**
答:本数据集已被用于语音意图分类任务的模型性能基准测试。
**Q15:本数据集还可用于哪些其他任务?**
答:我们提供了说话人的相关特征信息,因此本数据集还可用于其他语音分类任务,例如性别识别或母语识别。
# 发布与维护
**Q16:本数据集将采用何种版权或知识产权许可协议进行发布?**
答:本数据集采用[CC BY-NC许可证](https://creativecommons.org/licenses/by-nc/4.0/)进行发布。
**Q17:谁将负责维护本数据集?**
答:本数据集的维护工作由Skit公司研究团队负责,可通过邮箱ml-research@skit.ai联系该团队。
**Q18:未来本数据集是否会进行更新(例如修正标注错误、新增样本、删除样本)?**
答:若存在标注错误,我们将每3个月整理并发布更新版本。此外,我们还计划新增更多样本与语句变体,以提升数据集的鲁棒性。
提供机构:
skit-ai



