five

Treble10-Speech

收藏
魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/treble-technologies/Treble10-Speech
下载链接
链接失效反馈
官方服务:
资源简介:
# **Treble10-Speech (16 kHz)** ## Dataset Description - **Paper:** https://arxiv.org/abs/2510.23141 - **Point of contact:** contact@treble.tech [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pSJHitYAOGIv2uzr7kGNDxqVflWvTQ2c?usp=sharing) The **Treble10-Speech** dataset is a dataset for automatic speech recognition (ASR), containing pre-convolved speech files using high fidelity room-acoustic simulations from the [Treble10-RIR dataset](https://huggingface.co/datasets/treble-technologies/Treble10-RIR) with 10 different furnished rooms: 2 bathrooms, 2 bedrooms, 2 living rooms with hallway, 2 living rooms without hallway, 2 meeting rooms. The room volumes range between 14 and 46 m3, resulting in reverberation times between 0.17 and 0.84 s. ## Examples: accessing a reverberant mono speech file: ```python from datasets import load_dataset, Audio import matplotlib.pyplot as plt import numpy as np ds = load_dataset( "treble-technologies/Treble10-Speech", split="speech_mono", streaming=True, ) ds = ds.cast_column("audio", Audio()) # Read the samples from the TorchCodec decoder object: rec = next(iter(ds)) samples = rec["audio"].get_all_samples() speech_mono = samples.data sr = samples.sample_rate print(f"Mono speech has this shape: {speech_mono.shape}, and a sampling rate of {sr} Hz.") # We can access and compare individual channels from the mono device like this t_axis = np.arange(speech_mono.shape[1]) / sr plt.figure() plt.plot(t_axis, speech_mono.numpy().T, label="Mono speech") plt.xlabel("Time (s)") plt.ylabel("Amplitude") plt.legend() plt.show() ``` ## Example: Accessing a reverberant speech file encoded to 8th-order Ambisonics ```python from datasets import load_dataset, Audio import io, soundfile as sf # Load dataset in streaming mode ds = load_dataset("treble-technologies/Treble10-Speech", split="speech_hoa8", streaming=True) # Disable automatic decoding (we'll do it manually) ds = ds.cast_column("audio", Audio(decode=False)) # Get one sample from the iterator sample = next(iter(ds)) # Fetch raw audio bytes audio_bytes = sample["audio"]["bytes"] # Some older datasets may not have "bytes", so fall back to reading from the file if audio_bytes is None: # Use huggingface's file object directly with sample["audio"]["path"].open("rb") as f: audio_bytes = f.read() # Decode the HOA audio directly from memory rir_hoa, sr = sf.read(io.BytesIO(audio_bytes)) print(f"Loaded HOA audio: shape={rir_hoa.shape}, sr={sr}") ``` ## Example: Accessing a reverberant speech file at the microphones of a 6-channel device ```python from datasets import load_dataset, Audio import matplotlib.pyplot as plt import numpy as np ds = load_dataset( "treble-technologies/Treble10-Speech", split="speech_6ch", streaming=True, ) ds = ds.cast_column("audio", Audio()) # Read the samples from the TorchCodec decoder object: rec = next(iter(ds)) samples = rec["audio"].get_all_samples() speech_6ch = samples.data sr = samples.sample_rate print(f"6 channel speech has this shape: {speech_6ch.shape}, and a sampling rate of {sr} Hz.") # We can access and compare individual channels from the 6ch device like this speech0 = speech_6ch[0] # mic 0 speech4 = speech_6ch[4] # mic 4 t_axis = np.arange(speech0.shape[0]) / sr plt.figure() plt.plot(t_axis, speech0.numpy(), label="Microphone 0") plt.plot(t_axis, speech4.numpy(), label="Microphone 4") plt.xlabel("Time (s)") plt.ylabel("Amplitude") plt.legend() plt.show() ``` ## Dataset Details The dataset contains three subsets: - **Treble10-Speech-mono**: This subset contains reverberant mono speech files, obtained by convolving dry speech signals with mono room impulse responses (RIRs). In each room, RIRs are available between 5 sound sources and several receivers. The receivers are placed along horizontal receiver grids with 0.5 m resolution at three heights (0.5 m, 1.0 m, 1.5 m). The validity of all source and receiver positions is checked to ensure that none of them intersects with the room geometry or furniture. - **Treble10-Speech-hoa8**: This subset contains reverberant speech files encoded in 8th-order Ambisonics. These reverberant speech files are obtained by convolving dry speech signals with 8th-order Ambisonics RIRs. The sound sources and receivers are identical to the Speech-mono subset. - **Treble10-Speech-6ch**: For this subset, a 6-channel cylindrical device is placed at the receiver positions from the Speech-mono subset. RIRs are then acquired between the 5 sound sources from above and each of the 6 device microphones. In other words, there is a 6-channel DeviceRIR for each source-receiver combination of the Speech-mono subset. Each channel of the DeviceRIR is then convolved with the same dry speech signal, resulting in a 6-channel reverberant speech signal. This 6-channel reverberant speech signal resembles the recordings you would obtain when placing that 6-channel device at the corresponding receiver position and recording speech played back at the source position. All RIRs (mono/HOA/device) that were used to generate reverberant speech for this dataset were simulated with the Treble SDK. We use a hybrid simulation paradigm that combines a numerical wave-based solver (discontinuous Galerkin finite element method, DG-FEM) at low to midrange frequencies with geometrical acoustics (GA) simulations at high frequencies. For this dataset, the transition frequency between the wave-based and the GA simulation is set at 5 kHz. The resulting hybrid RIRs are broadband signals with a 32 kHz sampling rate, thus covering the entire frequency range of the signal and containing audio content up to 16 kHz. All dry speech files that were used to generate reverberant speech files through convolution with the above RIRs were taken from the _test_ splits of the [LibriSpeech corpus](https://www.openslr.org/12). As the dry speech files were sampled at 16 kHz, the RIRs were downsampled while generating the Treble10-Speech set. You can create your own 32kHz speech samples by downloading the [Treble10-RIR](https://huggingface.co/datasets/treble-technologies/Treble10-RIR) dataset and convolving them with audio signals of your choice. ## Uses Use cases such as far-field automatic speech recognition (ASR), speech enhancement, dereverberation, and source separation benefit greatly from the **Treble10-Speech** dataset. To illustrate this, consider the contrast between near-field and far-field ASR. In near-field setups, such as smartphones or headsets, the microphone is close to the speaker, capturing a clean signal dominated by the direct sound. In far-field scenarios, as in smart speakers or conference-room devices, the microphone is several meters away, and the recorded signal becomes a complex blend of direct sound, reverberation, and background noise. This difference is not merely spatial but physical: in far-field conditions, sound waves reflect off walls, diffract around objects, and decay over time, all of which are captured by the RIR. To achieve robust performance in such environments, ASR and related models must be trained on datasets that accurately represent these intricate acoustic interactions—precisely what **Treble10-Speech** provides. Similarly, the performance of such systems can only be reliably determined when evaluating them on data that is accurate enough to model sound propagation in complex environments. ## Dataset Structure Each subset of **Treble10-Speech** corresponds to a different channel configuration of the simulated room impulse responses (RIRs). All subsets share the same metadata schema and organization. |Split | Description | Channels | |--------------|---------------------|----------| |`speech_mono` | Single-channel reverberant mono speech | 1 | |`speech_hoa8` | Reverberant speech encoded as 8th-order Ambisonics (ACN/SN3D format) | 81 | |`speech_6ch` | Reverberant speech at the microphones of a six-channel home audio device | 6 | The six-channel device has microphones positioned at the following locatiosn relative to the center of the device: |Channel|Position [m]| |-------|--------| |0 |[0.03, 0., 0.]| |1 |[0.015.. 0.026., 0.]| |2 |[-0.0145, 0.026, 0.]| |3 |[-0.03, 0., 0.]| |4 |[-0.015, -0.026, 0.]| |5 |[0.015, -0.026, 0.]| ### File Contents Each `.parquet` file contains the metadata for one subset (split) of the dataset. As this set of reverberant speech signals may be used for a variety of potential audio machine-learning tasks, we leave the actual segmentation of the data to the users. The metadata links each reverberant speech file to its corresponding dry speech file and includes detailed acoustic parameters. | Column | Description | |---------|-------------| | **audio** | The convolved speech file. | | **audio_filename** | Filename and relative path of the RIR WAV file. | | **room** | Short room nickname (e.g., `Room1`, `Room5`). | | **room_description** | Descriptive room type (e.g., `meeting_room`, `living_room`). | | **room_volume** | Volume of the room in cubic meters. | | **source** | Label of the source. | | **source_position** | 3D coordinates of the source in meters. | | **receiver** | Label of the receiver. | | **receiver_position** | 3D coordinates of the receiver in meters. | | **direct_path_length** | Distance between source and receiver in meters. | | **rir_format** | Format of the RIR used (`mono`, `6ch`, or `hoa8`) | | **Frequencies, EDT, T30, C50, Average Absorption** | Octave-band acoustic parameters. | | **librispeech_split** | Source split of the dry speech signal. | | **librispeech_file** | The file path and name of the dry signal as local to the Librispeech dataset. | | **transcript** | The transcript of the utterance. | ## Acoustic Parameters The RIRs that were used to generate the reverberant speech signals are presented with a few relevant acoustical parameters describing the acoustical field as sampled with the specific source/receiver pairs. ### T30: Reverberation Time T30 is a measure of how long a sound takes to fade away in a room after the sound source stops emitting noise. It is a key measure of how reverberant a space is. Specifically, it's the time needed for the sound energy to drop by 60 decibels, estimated from the first 30 dBs of the decay.' A short T30 correlates to a "dry" sounding room, like a small office or recording booth (ideally, under 0.2s). A long T30 correlates to a room that sounds "wet", such as a concert hall or parking garage (1.0s or more). ### EDT: Early Decay Time Early Decay Time is another measure of reverberation, but is calculated from the first 10 dB of energy decay. EDT is highly correlated with the psychoacoustic perception of reverberation, and can also provide information about the uniformity of the acoustic field within a space. If EDT is approximately equal to T30, the reverberation is approximately a single-slope decay. IF EDT is much shorter than T30, this indicates the existence of a double-slope energy decay, which may form when two rooms are acoustically coupled. ### C50: Clarity Index (Speech) C50 is an energy ratio between the early arriving sound (the first 50 milliseconds) to the late arrinng sound (from 50 milliseconds to the end of the RIR). C50 is typically used as a measure of the potential speech intelligibility and clarity of a room, as it quantifies how much the early sound is obscured by the room's reverberation. ' High C50 values (above 0dB) are typically considered to be ideal for clear and intelligible speech. Low C50 values (below 0dB) are typically considered to be difficult for speech clarity. ## More Information More information on the dataset can be found on the corresponding blog post. ## Licensing Information The **Treble10-Speech** dataset combines two components with different licenses: Speech recordings (dry signals) — sourced from the LibriSpeech corpus, licensed under the [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/deed.en). Acoustic impulse responses (RIRs) and acoustical metadata — originating from the Treble10-RIR dataset, licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license](https://creativecommons.org/licenses/by-nc-sa/4.0/). The convolved (“wet”) speech recordings in this dataset are derivative works that combine both sources. As a result, they are governed by the CC-BY-4.0 license. The room impulse responses and all acoustical metadata associated with them remain governed by the (CC-BY-NC-SA-4.0). ## Citation Information ``` @misc{mullins2025treble10highqualitydatasetfarfield, title={Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement}, author={Sarabeth S. Mullins and Georg G\"otz and Eric Bezzam and Steven Zheng and Daniel Gert Nielsen}, year={2025}, eprint={2510.23141}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2510.23141}, } ```

# **Treble10-Speech(16 kHz)** ## 数据集说明 - **论文链接:** https://arxiv.org/abs/2510.23141 - **联系人:** contact@treble.tech [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pSJHitYAOGIv2uzr7kGNDxqVflWvTQ2c?usp=sharing) **Treble10-Speech** 数据集是面向自动语音识别(Automatic Speech Recognition, ASR)的专用数据集,其内容为通过结合[Treble10-RIR数据集](https://huggingface.co/datasets/treble-technologies/Treble10-RIR)的高保真房间声学仿真结果,对干语音信号进行卷积得到的带混响语音文件。该数据集涵盖10种不同的带家具房间:2间浴室、2间卧室、2间带走廊的客厅、2间无走廊的客厅以及2间会议室。房间体积介于14至46立方米,对应的混响时间范围为0.17至0.84秒。 ## 示例:访问单声道混响语音文件 python from datasets import load_dataset, Audio import matplotlib.pyplot as plt import numpy as np ds = load_dataset( "treble-technologies/Treble10-Speech", split="speech_mono", streaming=True, ) ds = ds.cast_column("audio", Audio()) # 从TorchCodec解码器对象中读取采样数据: rec = next(iter(ds)) samples = rec["audio"].get_all_samples() speech_mono = samples.data sr = samples.sample_rate print(f"单声道语音的形状为:{speech_mono.shape},采样率为 {sr} Hz。") # 可通过如下方式访问并对比单声道设备的各通道数据 t_axis = np.arange(speech_mono.shape[1]) / sr plt.figure() plt.plot(t_axis, speech_mono.numpy().T, label="单声道语音") plt.xlabel("时间(s)") plt.ylabel("幅度") plt.legend() plt.show() ## 示例:访问编码为8阶Ambisonics的混响语音文件 python from datasets import load_dataset, Audio import io, soundfile as sf # 以流式加载模式加载数据集 ds = load_dataset("treble-technologies/Treble10-Speech", split="speech_hoa8", streaming=True) # 禁用自动解码(将手动完成解码) ds = ds.cast_column("audio", Audio(decode=False)) # 从迭代器中获取一个样本 sample = next(iter(ds)) # 获取原始音频字节数据 audio_bytes = sample["audio"]["bytes"] # 部分旧数据集可能不包含"bytes"字段,因此需回退到直接读取文件 if audio_bytes is None: # 直接使用Hugging Face的文件对象读取 with sample["audio"]["path"].open("rb") as f: audio_bytes = f.read() # 直接从内存中解码高阶Ambisonics音频 rir_hoa, sr = sf.read(io.BytesIO(audio_bytes)) print(f"已加载高阶Ambisonics音频:形状={rir_hoa.shape}, 采样率={sr}") ## 示例:访问6通道设备麦克风采集的混响语音文件 python from datasets import load_dataset, Audio import matplotlib.pyplot as plt import numpy as np ds = load_dataset( "treble-technologies/Treble10-Speech", split="speech_6ch", streaming=True, ) ds = ds.cast_column("audio", Audio()) # 从TorchCodec解码器对象中读取采样数据: rec = next(iter(ds)) samples = rec["audio"].get_all_samples() speech_6ch = samples.data sr = samples.sample_rate print(f"6通道语音的形状为:{speech_6ch.shape},采样率为 {sr} Hz。") # 可通过如下方式访问并对比6通道设备的各通道数据 speech0 = speech_6ch[0] # 麦克风0 speech4 = speech_6ch[4] # 麦克风4 t_axis = np.arange(speech0.shape[0]) / sr plt.figure() plt.plot(t_axis, speech0.numpy(), label="麦克风0") plt.plot(t_axis, speech4.numpy(), label="麦克风4") plt.xlabel("时间(s)") plt.ylabel("幅度") plt.legend() plt.show() ## 数据集详情 本数据集包含三个子集: - **Treble10-Speech-mono**:该子集包含单声道混响语音文件,通过将干语音信号与单声道房间脉冲响应(Room Impulse Response, RIR)卷积得到。每个房间中,5个声源与多个接收器之间均生成了对应的RIR。接收器沿水平网格布置,分辨率为0.5米,布置高度分别为0.5米、1.0米与1.5米。所有声源与接收器的位置均经过有效性校验,确保不会与房间几何结构或家具发生碰撞。 - **Treble10-Speech-hoa8**:该子集包含编码为8阶Ambisonics(8th-order Ambisonics,ACN/SN3D格式)的混响语音文件,通过将干语音信号与8阶Ambisonics房间脉冲响应卷积得到。声源与接收器的布置与Speech-mono子集完全一致。 - **Treble10-Speech-6ch**:该子集将6通道圆柱设备布置在Speech-mono子集的接收器位置处,通过将5个上方声源与设备的6个麦克风分别配对,采集得到设备RIR。换言之,Speech-mono子集的每个声源-接收器组合均对应一个6通道的设备RIR。将每个通道的设备RIR与相同的干语音信号卷积,即可得到6通道混响语音信号,该信号与在对应接收器位置放置该6通道设备、并在声源位置播放语音时采集到的录音效果一致。 本数据集用于生成混响语音的所有房间脉冲响应(单声道/高阶Ambisonics/设备通道)均通过Treble SDK仿真得到。我们采用混合仿真范式:中低频段使用数值波动求解器——间断伽辽金有限元法(Discontinuous Galerkin Finite Element Method, DG-FEM),高频段则采用几何声学(Geometrical Acoustics, GA)仿真,两者的过渡频率设置为5kHz。生成的混合RIR为32kHz采样率的宽带信号,覆盖了16kHz以内的全音频频段。 本数据集用于生成混响语音的所有干语音文件均取自[LibriSpeech语料库(LibriSpeech corpus)](https://www.openslr.org/12)的测试集。由于干语音文件的采样率为16kHz,因此在生成Treble10-Speech数据集时对RIR进行了下采样。用户可通过下载[Treble10-RIR数据集](https://huggingface.co/datasets/treble-technologies/Treble10-RIR),并将其与自定义音频信号卷积,自行生成32kHz的语音样本。 ## 应用场景 远场自动语音识别、语音增强、去混响以及声源分离等任务均可从**Treble10-Speech**数据集中获益良多。以近场与远场ASR的差异为例:近场场景如智能手机或头戴设备中,麦克风靠近声源,采集到的信号以直达声为主,较为清晰;而远场场景如智能音箱或会议室设备中,麦克风距离声源数米,采集到的信号是直达声、混响与背景噪声的复杂混合。这种差异不仅体现在空间维度,更具有物理本质:远场条件下,声波会在墙面反射、绕射并随时间衰减,这些特性均可通过RIR精准捕捉。要在这类场景下实现鲁棒的识别性能,ASR及相关模型必须在能够准确模拟复杂环境声学交互的数据集上进行训练,而Treble10-Speech恰好提供了这样的支撑。同理,只有在能够精准建模复杂环境声传播的数据集上进行评估,才能可靠地衡量这类系统的实际性能。 ## 数据集结构 **Treble10-Speech**的每个子集对应不同的房间脉冲响应(RIR)通道配置,所有子集共享相同的元数据架构与组织方式。 |分割名称 | 描述 | 通道数 | |--------------|---------------------|----------| |`speech_mono` | 单通道混响单声道语音 | 1 | |`speech_hoa8` | 编码为8阶Ambisonics(ACN/SN3D格式)的混响语音 | 81 | |`speech_6ch` | 6通道家用音频设备麦克风采集的混响语音 | 6 | 该6通道设备的麦克风相对于设备中心的位置如下: |通道号|位置 [米]| |-------|--------| |0 |[0.03, 0., 0.]| |1 |[0.015, 0.026, 0.]| |2 |[-0.0145, 0.026, 0.]| |3 |[-0.03, 0., 0.]| |4 |[-0.015, -0.026, 0.]| |5 |[0.015, -0.026, 0.]| ### 文件内容 每个`.parquet`文件包含对应子集(分割)的元数据。由于该混响语音信号集可用于多种音频机器学习任务,我们将数据的实际分割工作交由用户自行完成。元数据将每个混响语音文件与其对应的干语音文件关联,并包含详细的声学参数。 | 列名 | 描述 | |---------|-------------| | **audio** | 卷积得到的语音文件。 | | **audio_filename** | 房间脉冲响应WAV文件的文件名与相对路径。 | | **room** | 房间简称(例如 `Room1`、`Room5`)。 | | **room_description** | 房间类型描述(例如 `meeting_room`、`living_room`)。 | | **room_volume** | 房间体积,单位为立方米。 | | **source** | 声源标签。 | | **source_position** | 声源的三维坐标,单位为米。 | | **receiver** | 接收器标签。 | | **receiver_position** | 接收器的三维坐标,单位为米。 | | **direct_path_length** | 声源与接收器之间的直线距离,单位为米。 | | **rir_format** | 所用房间脉冲响应的格式(`mono`、`6ch`或`hoa8`) | | **Frequencies, EDT, T30, C50, Average Absorption** | 倍频带声学参数。 | | **librispeech_split** | 干语音信号的LibriSpeech源分割集。 | | **librispeech_file** | 干语音信号文件在LibriSpeech数据集中的本地路径与文件名。 | | **transcript** | 语音片段的转写文本。 | ## 声学参数 用于生成混响语音信号的房间脉冲响应附带了若干关键声学参数,用于描述特定声源-接收器组合下的声学场特性。 ### T30:混响时间 T30用于衡量声源停止发声后,房间内声音衰减至原能量1/1000(即60分贝)所需的时间,通常通过前30分贝的衰减曲线估算得到,是反映空间混响特性的核心指标。较短的T30对应“干”声房间,例如小型办公室或录音棚(理想值低于0.2秒);较长的T30对应“湿”声房间,例如音乐厅或停车场(1.0秒及以上)。 ### EDT:早期衰变时间 EDT是另一项混响特性指标,通过声音能量前10分贝的衰减曲线计算得到。EDT与人类对混响的心理声学感知高度相关,同时可用于反映空间内声学场的均匀性。若EDT与T30近似相等,则混响衰减近似为单斜率衰减;若EDT远小于T30,则说明存在双斜率能量衰减,通常出现在两个声学耦合的房间中。 ### C50:语言清晰度指数(Speech Clarity Index) C50是前50毫秒内到达的早期声音能量与50毫秒之后到达的后期声音能量的比值,常用于衡量房间的语音清晰度与可懂度,量化混响对早期直达声的遮蔽程度。C50值高于0分贝通常被认为是语音清晰可懂的理想状态,低于0分贝则意味着语音清晰度较差。 ## 更多信息 可通过对应博客文章获取该数据集的更多信息。 ## 许可信息 **Treble10-Speech** 数据集由两个采用不同许可协议的组件组合而成: - 语音录音(干信号):源自LibriSpeech语料库,采用[知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/deed.zh)进行许可。 - 声学脉冲响应(RIR)与声学元数据:源自Treble10-RIR数据集,采用[知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC-BY-NC-SA-4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.zh)进行许可。 本数据集中卷积得到的“带混响”语音录音属于衍生作品,结合了上述两个组件,因此受CC-BY-4.0协议管辖。而房间脉冲响应及其相关的所有声学元数据仍受CC-BY-NC-SA-4.0协议管辖。 ## 引用信息 @misc{mullins2025treble10highqualitydatasetfarfield, title={Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement}, author={Sarabeth S. Mullins and Georg G"otz and Eric Bezzam and Steven Zheng and Daniel Gert Nielsen}, year={2025}, eprint={2510.23141}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2510.23141}, }
提供机构:
maas
创建时间:
2025-10-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作