juliasdata/medical-audio-sample-brazilian-portuguese
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/juliasdata/medical-audio-sample-brazilian-portuguese
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "Julia's Data - Brazilian Portuguese Medical Audio Sample"
language:
- pt
multilinguality: monolingual
license: other
license_name: juliasdata-sample-evaluation-license-1.0
license_link: https://huggingface.co/datasets/juliasdata/medical-audio-sample-brazilian-portuguese/resolve/main/LICENSE
size_categories:
- n<1K
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- audio
- speech
- medical
- healthcare
- brazilian-portuguese
- pt-BR
---
# Julia's Data: Brazilian Portuguese Medical Audio Sample
Public sample of a Brazilian Portuguese medical audio dataset built for ASR,
TTS, and conversational AI evaluation. This repository contains deidentified
clinical source material transformed into five spoken content types and
recorded by a human speaker.
This sample includes 1 record, 20 aligned audio segments, 1 speaker, and about
5.26 minutes of audio.
Full dataset and commercial licensing: [juliasdata.com](https://juliasdata.com)
Commercial overview: [juliasdata.com/commercial](https://juliasdata.com/commercial)
Contact: [julia@juliasdata.com](mailto:julia@juliasdata.com)
License: `juliasdata-sample-evaluation-license-1.0`
## Quick Start
**Programmatic ingestion**: start with `manifests/segments.jsonl`. It has one
row per audio segment with transcript text, ordering fields, speaker IDs, and
paths to both source and WAV audio files. Join on `segment_id` to
`manifests/recording_sessions.jsonl` via `step_id` or `manifests/speakers.jsonl`
via `speaker_id` for session and speaker metadata.
**Audio-first preview**: `metadata.jsonl` is a flat index of the WAV files in
this sample. It is included to make simple audio loading and Hub preview
workflows easier.
**Human review**: open any folder under `records/<record_id>/`. Each record is
self-contained with a summary, source notes, rights, provenance, and per-step
transcript and audio files.
**Package metadata**: see `DELIVERY.json` for export identity, audio conversion
config, and schema versions. See `SCHEMA.md` for field-level documentation of
every JSON and JSONL artifact.
## Sample Scope
This public sample is a manually trimmed delivery package derived from a larger
internal export. It includes all five spoken content types used in Julia's
Data:
| Step Type | Source Code | What It Contains |
| --- | --- | --- |
| `source_notes_narration` | `raw_notes` | Direct narration of the source text |
| `long_form_narration` | `long_form` | Expanded narrative retelling |
| `structured_question_answer` | `qa` | Question, clean answer, and natural answer |
| `terminology_definition_pair` | `terminology` | Medical term and spoken definition |
| `multi_speaker_dialog` | `dialog` | Multi-speaker conversation |
Per-step segment counts in this sample:
- `source_notes_narration`: 2
- `long_form_narration`: 2
- `structured_question_answer`: 3
- `terminology_definition_pair`: 8
- `multi_speaker_dialog`: 5
## Package Layout
```text
juliasdata-delivery-sample-2026-03-21-v2/
DELIVERY.json Package identity, audio config, schema versions
metadata.jsonl Flat WAV index for audio-first loading
SCHEMA.md Field-level reference for every JSON/JSONL artifact
SHA256SUMS Integrity checksums for all delivered files
manifests/ Dataset-wide JSONL indexes (one entity per line)
records.jsonl
steps.jsonl
segments.jsonl <- primary ingest artifact
speakers.jsonl
recording_sessions.jsonl
provenance.jsonl
records/ Per-record folders for isolated review
<record_id>/
record.json Record summary
source_notes.txt Original deidentified source text
rights.json PHI and consent status
provenance/ Full LLM generation audit trail
steps/
<step_type>/
transcript.jsonl Segment text and ordering
media.jsonl Audio file metadata and checksums
audio/
source/ Original WebM recordings
wav/ Converted WAV derivatives
```
## Data Model
A **record** is one clinical source note and everything derived from it.
Each step is divided into **segments**, the atomic unit of audio and transcript
alignment. Segments are grouped and ordered using:
- `segment_index`: absolute playback order within the step
- `group_index`: logical content group, such as one QA item or one term pair
- `sequence_in_group`: position within that group
- `segment_role`: semantic label such as `paragraph`, `question`,
`clean_answer`, `natural_answer`, `term`, `definition`, or `dialog_line`
Important text fields:
- `text_verbatim`: transcript text exactly as delivered
- `text_normalized`: whitespace-collapsed and trimmed, with casing and
punctuation preserved
Do not rely on filename sort order. Segment order is defined by
`segment_index`, `group_index`, and `sequence_in_group`.
## Audio
Every segment with audio includes two files:
- `audio/source/`: original browser-recorded WebM/Opus
- `audio/wav/`: converted WAV derivative
WAV conversion target: PCM signed 16-bit little-endian, mono, 48 kHz.
Conversion details are recorded in `DELIVERY.json` under `audio.conversion`.
`transcript.jsonl` maps each segment to its audio files. `media.jsonl` provides
per-file technical metadata such as size, checksum, codec, duration, sample
rate, and conversion provenance.
## Deidentification And Rights
The included sample record has an accompanying `rights.json` file. That record
metadata indicates:
- `contains_phi: false`
- `deidentified: true`
- speaker consent was confirmed for the included speaker
- commercial voice use is allowed for the included speaker
For broader access, pilot packs, or commercial licensing of the full dataset,
see [juliasdata.com/commercial](https://juliasdata.com/commercial).
## License Summary
This repository is released under the custom
`juliasdata-sample-evaluation-license-1.0`.
- Internal research and evaluation use is allowed, including by commercial
teams.
- Publishing aggregate results and benchmarks with attribution is allowed.
- Redistribution, mirroring, resale, sublicensing, or inclusion in another
public dataset is not allowed.
- Production use, commercial exploitation of the sample itself, and voice
cloning or impersonation use require separate written permission.
- Any future commercial purchase or separate dataset delivery is governed by
its own written agreement, not by this sample repository license.
See `LICENSE` for the full terms.
## Intended Use
This sample is best suited for:
- evaluating Brazilian Portuguese medical speech quality
- testing ASR and TTS pipelines on domain-specific audio
- reviewing dataset structure, manifests, and provenance fields
- validating ingestion against a realistic delivery package
## Limitations
This repository is a sample, not the full dataset.
- It contains 1 record and 1 speaker only.
- It is too small to be treated as a benchmark.
- Provenance files may reference broader generation artifacts than the trimmed
audio subset included here.
## Checksums
All files in this prepared sample folder are listed in `SHA256SUMS`. Verify
them with:
```bash
shasum -a 256 -c SHA256SUMS
```
---
数据集名称:"朱莉娅数据集——巴西葡萄牙语医疗音频样本"
语言:
- pt(葡萄牙语)
多语言属性:单语言
许可类型:其他
许可名称:juliasdata-sample-evaluation-license-1.0
许可链接:https://huggingface.co/datasets/juliasdata/medical-audio-sample-brazilian-portuguese/resolve/main/LICENSE
规模类别:
- 样本量小于1000
任务类别:
- 自动语音识别(automatic-speech-recognition)
- 文本转语音(text-to-speech)
标签:
- 音频
- 语音
- 医疗
- 医疗保健
- 巴西葡萄牙语
- pt-BR
---
# 朱莉娅数据集:巴西葡萄牙语医疗音频样本
本数据集为面向自动语音识别(automatic-speech-recognition, ASR)、文本转语音(text-to-speech, TTS)及对话式人工智能评测打造的巴西葡萄牙语医疗音频公开样本。本仓库包含经去标识化处理的临床原始素材,经转换为5种口语内容类型后由人类发声者录制。
本次公开样本包含1条记录、20个对齐音频片段、1位发声者,总音频时长约5.26分钟。
完整数据集及商业授权请访问:[juliasdata.com](https://juliasdata.com)
商业概览:[juliasdata.com/commercial](https://juliasdata.com/commercial)
联系方式:[julia@juliasdata.com](mailto:julia@juliasdata.com)
许可协议:`juliasdata-sample-evaluation-license-1.0`
## 快速入门
**程序化加载**:请从`manifests/segments.jsonl`开始。该文件为每个音频片段提供一行数据,包含转录文本、排序字段、发声者ID以及原始音频与WAV音频文件的路径。可通过`segment_id`字段关联`manifests/recording_sessions.jsonl`(使用`step_id`)或`manifests/speakers.jsonl`(使用`speaker_id`)以获取会话与发声者元数据。
**音频优先预览**:`metadata.jsonl`为本样本中WAV文件的扁平化索引,旨在简化音频加载与Hugging Face Hub预览工作流。
**人工审阅**:打开`records/<record_id>/`下的任意文件夹即可。每条记录均为独立完整的单元,包含摘要、原始素材说明、权利信息、来源记录,以及各步骤的转录文本与音频文件。
**包元数据**:请查阅`DELIVERY.json`获取导出标识、音频转换配置及schema版本信息;如需了解所有JSON与JSONL工件的字段级说明,请参阅`SCHEMA.md`。
## 样本范围
本次公开样本为从更大规模内部导出包中手动裁剪得到的交付包,包含朱莉娅数据集所用的全部5种口语内容类型,具体如下表所示:
| 步骤类型 | 源代码 | 内容说明 |
| --- | --- | --- |
| `source_notes_narration` | `raw_notes` | 原始文本的直接旁白 |
| `long_form_narration` | `long_form` | 扩展后的叙事性复述 |
| `structured_question_answer` | `qa` | 结构化问答内容,包含问题、标准答案与自然答案 |
| `terminology_definition_pair` | `terminology` | 医疗术语与口语化定义 |
| `multi_speaker_dialog` | `dialog` | 多发声者对话 |
本次样本中各步骤的片段数量如下:
- `source_notes_narration`:2个
- `long_form_narration`:2个
- `structured_question_answer`:3个
- `terminology_definition_pair`:8个
- `multi_speaker_dialog`:5个
## 包结构
text
juliasdata-delivery-sample-2026-03-21-v2/
DELIVERY.json 包标识、音频配置与schema版本
metadata.jsonl 用于音频优先加载的WAV文件扁平化索引
SCHEMA.md 所有JSON与JSONL工件的字段级参考文档
SHA256SUMS 所有交付文件的完整性校验和
manifests/ 全数据集JSONL索引(每行代表一个实体)
records.jsonl
steps.jsonl
segments.jsonl <- 主要加载工件
speakers.jsonl
recording_sessions.jsonl
provenance.jsonl
records/ 用于独立审阅的单记录文件夹
<record_id>/
record.json 记录摘要
source_notes.txt 经去标识化处理的原始临床文本
rights.json 受保护健康信息(Protected Health Information, PHI)与知情同意状态
provenance/ 大语言模型(Large Language Model)生成全流程审计轨迹
steps/
<step_type>/
transcript.jsonl 片段文本与排序信息
media.jsonl 音频文件元数据与校验和
audio/
source/ 原始WebM录制文件
wav/ 转换后的WAV衍生文件
## 数据模型
**记录(record)**:指单条临床原始素材笔记及其所有衍生内容。
每个步骤均划分为**片段(segment)**,即音频与转录文本对齐的原子单元。片段通过以下字段进行分组与排序:
- `segment_index`: 步骤内的绝对播放顺序
- `group_index`: 逻辑内容组,例如单个问答项或单个术语对
- `sequence_in_group`: 该组内的位置
- `segment_role`: 语义标签,例如`paragraph`(段落)、`question`(问题)、`clean_answer`(标准答案)、`natural_answer`(自然答案)、`term`(术语)、`definition`(定义)或`dialog_line`(对话台词)
重要文本字段:
- `text_verbatim`: 与交付内容完全一致的转录文本
- `text_normalized`: 经空白符折叠与修剪后的文本,保留大小写与标点符号
请勿依赖文件名排序。片段顺序由`segment_index`、`group_index`与`sequence_in_group`共同决定。
## 音频
每个带音频的片段均包含两类文件:
- `audio/source/`: 原始浏览器录制的WebM/Opus文件
- `audio/wav/`: 转换后的WAV衍生文件
WAV转换目标参数:PCM 16位有符号小端字节序、单声道、48kHz。转换细节记录于`DELIVERY.json`的`audio.conversion`字段下。
`transcript.jsonl`用于将每个片段与其音频文件建立映射。`media.jsonl`则提供单文件的技术元数据,包括文件大小、校验和、编解码器、时长、采样率及转换来源记录。
## 去标识化与权利声明
本次样本包含的记录附带`rights.json`文件,其元数据显示:
- `contains_phi: false`:不包含受保护健康信息
- `deidentified: true`:已完成去标识化处理
- 已确认本次发声者的知情同意
- 允许本次发声者的语音用于商业用途
如需获取更广泛的数据集访问权限、试用包或完整数据集的商业授权,请访问[juliasdata.com/commercial](https://juliasdata.com/commercial)。
## 许可协议摘要
本仓库采用自定义许可协议`juliasdata-sample-evaluation-license-1.0`发布。
- 允许包括商业团队在内的所有用户进行内部研究与评测。
- 允许在标注来源的前提下发布聚合结果与评测基准。
- 禁止对本仓库内容进行再分发、镜像、转售、分许可或将其纳入其他公开数据集。
- 若需将本样本用于生产环境、商业开发、语音克隆或模仿行为,需另行获得书面许可。
- 未来任何商业采购或单独数据集交付均受其自身书面协议约束,不受本样本仓库许可条款管辖。
完整许可条款请参阅`LICENSE`文件。
## 预期用途
本样本最适用于:
- 评估巴西葡萄牙语医疗语音质量
- 在领域特定音频上测试ASR与TTS pipeline
- 审阅数据集结构、索引文件及来源记录字段
- 针对真实交付包验证数据加载流程
## 局限性
本仓库仅为样本,而非完整数据集:
- 仅包含1条记录与1位发声者。
- 规模过小,无法作为评测基准使用。
- 来源记录文件可能引用了比本次裁剪后的音频子集更广泛的生成工件。
## 校验和
本准备好的样本文件夹中的所有文件均列于`SHA256SUMS`中,可通过以下命令校验:
bash
shasum -a 256 -c SHA256SUMS
提供机构:
juliasdata



