laubonghaudoi/legco-speech

Name: laubonghaudoi/legco-speech
Creator: laubonghaudoi
Published: 2026-02-26 07:09:41
License: 暂无描述

Hugging Face2026-02-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/laubonghaudoi/legco-speech

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - yue license: cc0-1.0 task_categories: - automatic-speech-recognition - audio-to-audio - audio-classification - text-generation tags: - cantonese - speech - hong-kong - legco - legislative-council size_categories: - 1M<n<10M configs: - config_name: raw data_files: - split: train path: raw/train-*.parquet - config_name: segmented data_files: - split: train path: segmented/train-*.parquet default: true --- # 香港立法會會議語音數據集 ## Dataset Description - **License:** [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/) - **Language:** Cantonese - **Audio Format:** 16kHz OPUS - **Total Duration (Raw):** 22,195.55 hours - **Total Duration (Segmented):** 20,71.21 hours - **Average Meeting Duration:** 5692.79 seconds (1.58 hours) - **Average Segment Duration:** 7.71 seconds - **Median Meeting Duration:** 6153.00 seconds (1.71 hours) - **Median Segment Duration:** 8.17 seconds - **Total number of characters:** 377,722,545 - **Average characters per segment:** 39.52 - **Median characters per segment:** 40 本數據集係由[香港立法會會議](https://www.youtube.com/legcogovhk)製成嘅大規模語音數據集。原始錄音總時長 22,196 個鐘，切分語音後總時長 20,471 個鐘。數據集分兩個子集，`raw`同`segmented`，分別為原始錄音同VAD識別切分後嘅語音。 ## 數據集製作流程 1. 先去[香港特別行政區立法會 YouTube](https://www.youtube.com/legcogovhk)下載所有會議紀錄並轉為 16kHz 採樣率嘅 OPUS音頻 1. 用 [fsmn-vad](https://huggingface.co/funasr/fsmn-vad) 切分所有語音，並用 [Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) 轉寫成粵文 srt 字幕 1. 轉寫後用正則表達式修正字幕中常見轉寫錯誤 1. 將數據集分成 `raw`、 `segmented` 兩個子集傳到HF | 子集 subset | `raw` |`segment`| |---|---|---| | 總行數 Row number | 14,036 | 9,557,109 | | 總時長 Total duration | 22,195.55 hr (79,903,980.00 s) | 20471.21 hr (73,696,365.27 s)| | 平均時長 Average duration | 1.58 hr (5692.79 s) | 7.71 s| | 中位時長 Median duration | 1.71 hr (6153.00 s) | 8.17 s | | 平均字數 Average subtitle characters | 26,910.98 | 39.52| | 中位字數 Median subtitle characters | 29,219.00 |40.00| ## 數據列 ### `segment` | 名稱 | 類型 | 描述 | |---|---|---| | `video_id` | string | 源 YouTube 影片 ID | | `segment_id` | int | 字幕 SRT 文件中嘅片段 ID | | `audio` | Audio | 音頻 | | `text` | string | 粵文轉寫文本 | | `start_time` | float | 該片段開始時間點 | | `end_time` | float | 該片段結束時間點 | | `duration` | float | 該片段時長 | ### `raw` | 名稱 | 類型 | 描述 | |---|---|---| | `id` | string | YouTube 影片 ID| | `audio` | string | 會議音頻路徑（如 `audio/2026/xxx.opus`） | | `transcription` | string | 粵文轉寫字幕 SRT | | `title` | string | YouTube 影片標題 | | `description` | string | YouTube 影片描述 | | `publish_date` | string | YouTube 影片發佈日期 | | `duration` | string | 總時長（HH:MM:SS 格式）| | `duration_seconds` | int | 總時長（秒數） | 原始 SRT文件都放喺`raw/transcriptions/` ## Usage ```python from datasets import load_dataset # Sentence-level segments (default, recommended for training) ds = load_dataset("<user>/legco-speech", "segmented", split="train", streaming=True) sample = next(iter(ds)) print(sample["text"]) # 香港嘅社會同城市結構已經跨越咗壯年嘅階段。 print(sample["start_time"], sample["end_time"], sample["duration"]) # 0.42 9.08 8.66 # Full meeting recordings ds_raw = load_dataset("<user>/legco-speech", "raw", split="train", streaming=True) sample = next(iter(ds_raw)) print(sample["audio"]) # audio/2026/uoCHLDPldq4.opus print(sample["title"]) # 房屋事務委員會特別會議 (2026/02/23) # To load the actual audio, either cast the column: from datasets import Audio ds_raw = ds_raw.cast_column("audio", Audio(sampling_rate=16000)) # Or download individual files: from huggingface_hub import hf_hub_download path = hf_hub_download("<user>/legco-speech", sample["audio"], repo_type="dataset") ```

提供机构：

laubonghaudoi

5,000+

优质数据集

54 个

任务类型

进入经典数据集