laubonghaudoi/legco-speech
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/laubonghaudoi/legco-speech
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- yue
license: cc0-1.0
task_categories:
- automatic-speech-recognition
- audio-to-audio
- audio-classification
- text-generation
tags:
- cantonese
- speech
- hong-kong
- legco
- legislative-council
size_categories:
- 1M<n<10M
configs:
- config_name: raw
data_files:
- split: train
path: raw/train-*.parquet
- config_name: segmented
data_files:
- split: train
path: segmented/train-*.parquet
default: true
---
# 香港立法會會議語音數據集
## Dataset Description
- **License:** [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/)
- **Language:** Cantonese
- **Audio Format:** 16kHz OPUS
- **Total Duration (Raw):** 22,195.55 hours
- **Total Duration (Segmented):** 20,71.21 hours
- **Average Meeting Duration:** 5692.79 seconds (1.58 hours)
- **Average Segment Duration:** 7.71 seconds
- **Median Meeting Duration:** 6153.00 seconds (1.71 hours)
- **Median Segment Duration:** 8.17 seconds
- **Total number of characters:** 377,722,545
- **Average characters per segment:** 39.52
- **Median characters per segment:** 40
本數據集係由[香港立法會會議](https://www.youtube.com/legcogovhk)製成嘅大規模語音數據集。原始錄音總時長 22,196 個鐘,切分語音後總時長 20,471 個鐘。數據集分兩個子集,`raw`同`segmented`,分別為原始錄音同VAD識別切分後嘅語音。
## 數據集製作流程
1. 先去[香港特別行政區立法會 YouTube](https://www.youtube.com/legcogovhk)下載所有會議紀錄並轉為 16kHz 採樣率嘅 OPUS音頻
1. 用 [fsmn-vad](https://huggingface.co/funasr/fsmn-vad) 切分所有語音,並用 [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) 轉寫成粵文 srt 字幕
1. 轉寫後用正則表達式修正字幕中常見轉寫錯誤
1. 將數據集分成 `raw`、 `segmented` 兩個子集傳到HF
| 子集 subset | `raw` |`segment`|
|---|---|---|
| 總行數 Row number | 14,036 | 9,557,109 |
| 總時長 Total duration | 22,195.55 hr (79,903,980.00 s) | 20471.21 hr (73,696,365.27 s)|
| 平均時長 Average duration | 1.58 hr (5692.79 s) | 7.71 s|
| 中位時長 Median duration | 1.71 hr (6153.00 s) | 8.17 s |
| 平均字數 Average subtitle characters | 26,910.98 | 39.52|
| 中位字數 Median subtitle characters | 29,219.00 |40.00|
## 數據列
### `segment`
| 名稱 | 類型 | 描述 |
|---|---|---|
| `video_id` | string | 源 YouTube 影片 ID |
| `segment_id` | int | 字幕 SRT 文件中嘅片段 ID |
| `audio` | Audio | 音頻 |
| `text` | string | 粵文轉寫文本 |
| `start_time` | float | 該片段開始時間點 |
| `end_time` | float | 該片段結束時間點 |
| `duration` | float | 該片段時長 |
### `raw`
| 名稱 | 類型 | 描述 |
|---|---|---|
| `id` | string | YouTube 影片 ID|
| `audio` | string | 會議音頻路徑(如 `audio/2026/xxx.opus`) |
| `transcription` | string | 粵文轉寫字幕 SRT |
| `title` | string | YouTube 影片標題 |
| `description` | string | YouTube 影片描述 |
| `publish_date` | string | YouTube 影片發佈日期 |
| `duration` | string | 總時長(HH:MM:SS 格式)|
| `duration_seconds` | int | 總時長(秒數) |
原始 SRT文件都放喺`raw/transcriptions/`
## Usage
```python
from datasets import load_dataset
# Sentence-level segments (default, recommended for training)
ds = load_dataset("<user>/legco-speech", "segmented", split="train", streaming=True)
sample = next(iter(ds))
print(sample["text"])
# 香港嘅社會同城市結構已經跨越咗壯年嘅階段。
print(sample["start_time"], sample["end_time"], sample["duration"])
# 0.42 9.08 8.66
# Full meeting recordings
ds_raw = load_dataset("<user>/legco-speech", "raw", split="train", streaming=True)
sample = next(iter(ds_raw))
print(sample["audio"])
# audio/2026/uoCHLDPldq4.opus
print(sample["title"])
# 房屋事務委員會特別會議 (2026/02/23)
# To load the actual audio, either cast the column:
from datasets import Audio
ds_raw = ds_raw.cast_column("audio", Audio(sampling_rate=16000))
# Or download individual files:
from huggingface_hub import hf_hub_download
path = hf_hub_download("<user>/legco-speech", sample["audio"], repo_type="dataset")
```
提供机构:
laubonghaudoi



