TalTechNLP/riigikogu-audio-stenograms-2018-2025
收藏Hugging Face2026-03-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/TalTechNLP/riigikogu-audio-stenograms-2018-2025
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- automatic-speech-recognition
language:
- et
---
## Riigikogu Stenograms 2018-2025
This dataset contains stenograms (approximate transcripts) of the sessions of the Estonian Parliament _Riigikogu_, spanning a time period from the very end of 2017 to May 2025, together with the corresponding audio.
The transcripts are not verbatim (word-by-word) transcripts but are edited for readability and grammatical correctness. Sentence start and end times (w.r.t. to the corresponding audio file) are provided.
The transcripts are converted into Transcriber XML (.trs) files. Speaker names and topic names (e.g. "Tulumaksuseaduse muutmise seaduse eelnõu esimene lugemine") are also given.
## Data
Stenograms originate from the Open API provided by Riigikogu (https://www.riigikogu.ee/avaandmed/). Audio data was originally hosted at YouTube by Riigikogu.
Stenograms originally contain only rough timestamps that indicate start times of speeches. However, the data provided here does have sentence start and end times.
Audio was aligned to the transcripts in a chunked forced-alignment pipeline. The transcript JSON were first split into ordered sentences and grouped into larger time-based chunks, and the corresponding audio regions were extracted with a small amount
of surrounding context. Each audio chunk was then processed with a multilingual Wav2Vec2 CTC speech model, while the transcript text was normalized and converted into a token sequence suitable for alignment. Forced alignment was used to match the
transcript tokens to the model’s frame-level predictions, which yields approximate start and end times for each sentence. Based on our broad evaluation, the timestamps are accurate to one second
for more than 99% of the sentences.
There are 1001 session recordings, with a total duration of ~3084 hours.
提供机构:
TalTechNLP



