michaellin/morse-translate-transcribe-100k
收藏Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/michaellin/morse-translate-transcribe-100k
下载链接
链接失效反馈官方服务:
资源简介:
这个合成的文本数据集包含10万个样本,每个样本由三个部分组成:英文文本、摩尔斯电码和类别。数据集中的摩尔斯电码模拟了业余无线电等无线电传真的可能传输情况,英文文本则是从摩尔斯电码的真实数据生成的。数据集样本分为四类:Ragchew(70%,操作员之间的随意对话)、Contest(10%,简短的公式化比赛交流)、Radiogram(10%,ARRL风格的正式消息)和Prose(10%,大写的普通英文文本)。该数据集的原始目的是将摩尔斯电码转换为音频,用于训练whisper-small-morse模型。建议在使用时按类别进行分层拆分。
This synthetic text dataset contains 100k samples of triples `(english, morse, category)`, heavily utilizing Claude Sonnet 4.6. Morse data was generated first. It contains simulated plausible transmissions that may happen over radiotelegraphy, particularly amateur radio. English interpretations (a.k.a. "translation") are generated from the Morse ground truth. The dataset samples follow four categories: Ragchew (70%), Contest (10%), Radiogram (10%), and Prose (10%). The original intent of this dataset generation was converting Morse to audio to train whisper-small-morse.
提供机构:
michaellin



