five

The People’s Speech|语音识别数据集|数据多样性数据集

收藏
arXiv2021-11-18 更新2024-06-21 收录
语音识别
数据多样性
下载链接:
https://github.com/mlcommons/peoples-speech
下载链接
链接失效反馈
资源简介:
The People’s Speech是一个大规模、多样化的英语语音识别数据集,由哈佛大学等研究机构创建,包含30,000小时的语音数据,主要来源于互联网档案馆。数据集通过搜索互联网上具有适当许可的音频数据及其现有转录来收集,使用Apache 2.0许可发布其数据收集系统。该数据集旨在解决自动语音识别系统训练数据的质量和多样性问题,特别是在商业应用中,通过提供大量、多样化的语音数据来提高模型的准确性和泛化能力。
提供机构:
哈佛大学
创建时间:
2021-11-18
AI搜集汇总
数据集介绍
main_image_url
构建方式
The People’s Speech数据集通过从互联网上搜索具有适当许可的音频数据及其现有转录文本,构建了一个包含30,000小时监督学习对话英语语音识别数据集。数据集的构建方法包括使用强制对齐技术(forced alignment)将音频与转录文本对齐,并通过开源的Apache 2.0许可发布其数据收集系统。该数据集的构建过程利用了互联网档案馆(Internet Archive)中的丰富资源,确保了数据的多样性和合法性。
特点
The People’s Speech数据集的主要特点在于其大规模、多样性和商业可用性。该数据集涵盖了多种语音场景,包括电影、电视、新闻、音乐等,且包含自然背景噪音,使其更贴近实际应用环境。此外,数据集采用CC-BY和CC-BY-SA许可,允许学术和商业用途,确保了广泛的应用场景。
使用方法
The People’s Speech数据集适用于训练和评估自动语音识别(ASR)系统。用户可以通过下载数据集并使用其提供的强制对齐工具进行数据预处理,进而训练深度学习模型。数据集的多样性和大规模特性使其特别适合用于开发能够泛化到不同环境、不同说话者的语音识别系统。此外,数据集的开源工具和详细的文档支持用户进行定制化处理和扩展。
背景与挑战
背景概述
The People’s Speech数据集是由NVIDIA、Landing AI、Factored等机构的研究人员共同开发的一个大规模、多样化的英语语音识别数据集,创建于2021年。该数据集包含30,000小时的监督学习语音数据,涵盖了多种场景和背景噪音,旨在为学术和商业用途提供高质量的语音识别训练数据。数据集的核心研究问题是如何从互联网上收集并整理出大规模、多样化的语音数据,并确保其合法性和商业可用性。通过使用Creative Commons Attribution (CC-BY)和Creative Commons Attribution-ShareAlike (CC-BY-SA)许可,该数据集解决了语音识别领域中数据许可和商业使用限制的难题,推动了语音识别技术的发展。
当前挑战
The People’s Speech数据集在构建过程中面临了多个挑战。首先,数据集的多样性和大规模性要求研究人员从互联网上收集大量带有转录的音频数据,并确保这些数据的许可允许商业使用。其次,数据集的构建过程中需要处理多种语言和背景噪音,这增加了数据处理的复杂性。此外,数据集的强制对齐过程也面临技术挑战,如处理不准确的转录、长音频文件的分段以及确保对齐的准确性。最后,数据集的维护和更新也是一个持续的挑战,特别是在处理法律和伦理问题方面,如确保数据来源的合法性和处理潜在的版权纠纷。
常用场景
经典使用场景
The People’s Speech数据集最经典的使用场景之一是用于构建和训练自动语音识别(ASR)系统。由于该数据集包含了30,000小时的多样化英语语音数据,涵盖了从政府演讲、访谈、健康讲座到娱乐节目等多种场景,因此它非常适合用于训练能够适应不同环境、不同说话者的语音识别模型。通过使用该数据集,研究人员和开发者可以构建出具有高度泛化能力的ASR系统,能够在多种实际应用场景中表现出色。
实际应用
The People’s Speech数据集在实际应用中具有广泛的潜力,特别是在商业语音识别系统中。例如,它可以用于开发智能语音助手、语音转文字服务、语音翻译工具等。由于数据集包含了丰富的背景噪音和多样化的语音内容,训练出的模型能够在嘈杂的环境中保持较高的识别准确率,适用于会议记录、电话客服、语音搜索等多种商业场景。此外,该数据集的开放许可使得企业可以合法地使用这些数据进行商业开发,降低了技术门槛。
衍生相关工作
The People’s Speech数据集的发布激发了许多相关研究工作。例如,基于该数据集的语音识别模型在Librispeech测试集上取得了9.98%的词错误率,展示了其在提升语音识别性能方面的潜力。此外,该数据集的开放性和多样性也启发了其他研究者探索如何从互联网资源中构建更大规模、更多样化的语音数据集。例如,一些研究开始关注如何扩展到非英语语言的数据集构建,以及如何利用弱监督学习方法处理无标签的语音数据。这些衍生工作进一步推动了语音识别领域的技术进步。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

学生课堂行为数据集 (SCB-dataset3)

学生课堂行为数据集(SCB-dataset3)由成都东软学院创建,包含5686张图像和45578个标签,重点关注六种行为:举手、阅读、写作、使用手机、低头和趴桌。数据集覆盖从幼儿园到大学的不同场景,通过YOLOv5、YOLOv7和YOLOv8算法评估,平均精度达到80.3%。该数据集旨在为学生行为检测研究提供坚实基础,解决教育领域中学生行为数据集的缺乏问题。

arXiv 收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

PDT Dataset

PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。

arXiv 收录

PlantVillage

在这个数据集中,39 种不同类别的植物叶子和背景图像可用。包含 61,486 张图像的数据集。我们使用了六种不同的增强技术来增加数据集的大小。这些技术是图像翻转、伽玛校正、噪声注入、PCA 颜色增强、旋转和缩放。

OpenDataLab 收录