five

anakib1/synth-rag

收藏
Hugging Face2024-05-31 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/anakib1/synth-rag
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: Large-gera features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 213252706.0 num_examples: 200 download_size: 209824384 dataset_size: 213252706.0 - config_name: Large-gera2 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 106430452.0 num_examples: 100 download_size: 104457471 dataset_size: 106430452.0 - config_name: Large-gera3 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 186858486.0 num_examples: 200 download_size: 182341561 dataset_size: 186858486.0 - config_name: Large-gera4 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 221199483.0 num_examples: 200 download_size: 217891357 dataset_size: 221199483.0 - config_name: Large-gera5 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 113959164.0 num_examples: 100 download_size: 112464725 dataset_size: 113959164.0 - config_name: Large-gera6 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 112992422.0 num_examples: 100 download_size: 110622823 dataset_size: 112992422.0 - config_name: MWP-ru features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 34334807.0 num_examples: 20 download_size: 34328238 dataset_size: 34334807.0 - config_name: MWP-ru-mistral features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 27609775.0 num_examples: 14 download_size: 27573701 dataset_size: 27609775.0 - config_name: MWP-ru-mistral-vosk features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 51682924.0 num_examples: 20 download_size: 51519771 dataset_size: 51682924.0 - config_name: MWP-ukr features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 38452053.0 num_examples: 20 download_size: 38448310 dataset_size: 38452053.0 - config_name: base features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 109817719.0 num_examples: 100 download_size: 108440974 dataset_size: 109817719.0 - config_name: base-summary features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 109867587.0 num_examples: 100 download_size: 108466548 dataset_size: 109867587.0 - config_name: concept features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string splits: - name: train num_bytes: 7900550.0 num_examples: 5 download_size: 6952224 dataset_size: 7900550.0 - config_name: dummy features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string splits: - name: train num_bytes: 39684356.0 num_examples: 20 download_size: 38522196 dataset_size: 39684356.0 - config_name: large features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 1676337265.0 num_examples: 1620 download_size: 1647107918 dataset_size: 1676337265.0 - config_name: large-sb1 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 213069301.0 num_examples: 200 download_size: 209633218 dataset_size: 213069301.0 - config_name: large-zahar-1 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 18514760.0 num_examples: 20 download_size: 18234499 dataset_size: 18514760.0 - config_name: large-zahar-3 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 200360483.0 num_examples: 200 download_size: 197003536 dataset_size: 200360483.0 - config_name: large-zahar-4 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 209593222.0 num_examples: 200 download_size: 205262821 dataset_size: 209593222.0 - config_name: large-zahar-test-200 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 189662264.0 num_examples: 200 download_size: 186092807 dataset_size: 189662264.0 - config_name: sb-val-time features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 48565301.0 num_examples: 29 download_size: 47467688 dataset_size: 48565301.0 - config_name: working-example features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string splits: - name: train num_bytes: 104460945.0 num_examples: 51 download_size: 91278093 dataset_size: 104460945.0 - config_name: zahar-val-test-0 features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 1192326.0 num_examples: 1 download_size: 1204199 dataset_size: 1192326.0 - config_name: zahar-val-time features: - name: audio dtype: audio - name: theme dtype: string - name: transcription dtype: string - name: summary dtype: string - name: noise dtype: string splits: - name: train num_bytes: 15378824.0 num_examples: 10 download_size: 14824313 dataset_size: 15378824.0 configs: - config_name: Large-gera data_files: - split: train path: Large-gera/train-* - config_name: Large-gera2 data_files: - split: train path: Large-gera2/train-* - config_name: Large-gera3 data_files: - split: train path: Large-gera3/train-* - config_name: Large-gera4 data_files: - split: train path: Large-gera4/train-* - config_name: Large-gera5 data_files: - split: train path: Large-gera5/train-* - config_name: Large-gera6 data_files: - split: train path: Large-gera6/train-* - config_name: MWP-ru data_files: - split: train path: MWP-ru/train-* - config_name: MWP-ru-mistral data_files: - split: train path: MWP-ru-mistral/train-* - config_name: MWP-ru-mistral-vosk data_files: - split: train path: MWP-ru-mistral-vosk/train-* - config_name: MWP-ukr data_files: - split: train path: MWP-ukr/train-* - config_name: base data_files: - split: train path: base/train-* - config_name: base-summary data_files: - split: train path: base-summary/train-* - config_name: concept data_files: - split: train path: concept/train-* - config_name: dummy data_files: - split: train path: dummy/train-* - config_name: large data_files: - split: train path: large/train-* - config_name: large-sb1 data_files: - split: train path: large-sb1/train-* - config_name: large-zahar-1 data_files: - split: train path: large-zahar-1/train-* - config_name: large-zahar-3 data_files: - split: train path: large-zahar-3/train-* - config_name: large-zahar-4 data_files: - split: train path: large-zahar-4/train-* - config_name: large-zahar-test-200 data_files: - split: train path: large-zahar-test-200/train-* - config_name: sb-val-time data_files: - split: train path: sb-val-time/train-* - config_name: working-example data_files: - split: train path: working-example/train-* - config_name: zahar-val-test-0 data_files: - split: train path: zahar-val-test-0/train-* - config_name: zahar-val-time data_files: - split: train path: zahar-val-time/train-* ---

### 数据集信息 本数据集包含多配置分支,各配置详情如下: 1. **配置名:Large-gera** 特征字段: - 音频(audio):数据类型为音频数据 - 主题(theme):数据类型为字符串 - 转录文本(transcription):数据类型为字符串 - 摘要(summary):数据类型为字符串 - 噪声(noise):数据类型为字符串 数据划分:仅包含训练集(train),字节数为213252706.0,样本量为200 下载大小:209824384,数据集总大小:213252706.0 2. **配置名:Large-gera2** 特征字段与上述一致,训练集字节数为106430452.0,样本量为100,下载大小为104457471,数据集总大小为106430452.0 3. **配置名:Large-gera3** 特征字段与上述一致,训练集字节数为186858486.0,样本量为200,下载大小为182341561,数据集总大小为186858486.0 4. **配置名:Large-gera4** 特征字段与上述一致,训练集字节数为221199483.0,样本量为200,下载大小为217891357,数据集总大小为221199483.0 5. **配置名:Large-gera5** 特征字段与上述一致,训练集字节数为113959164.0,样本量为100,下载大小为112464725,数据集总大小为113959164.0 6. **配置名:Large-gera6** 特征字段与上述一致,训练集字节数为112992422.0,样本量为100,下载大小为110622823,数据集总大小为112992422.0 7. **配置名:MWP-ru** 特征字段与上述一致,训练集字节数为34334807.0,样本量为20,下载大小为34328238,数据集总大小为34334807.0 8. **配置名:MWP-ru-mistral** 特征字段与上述一致,训练集字节数为27609775.0,样本量为14,下载大小为27573701,数据集总大小为27609775.0 9. **配置名:MWP-ru-mistral-vosk** 特征字段与上述一致,训练集字节数为51682924.0,样本量为20,下载大小为51519771,数据集总大小为51682924.0 10. **配置名:MWP-ukr** 特征字段与上述一致,训练集字节数为38452053.0,样本量为20,下载大小为38448310,数据集总大小为38452053.0 11. **配置名:base** 特征字段与上述一致,训练集字节数为109817719.0,样本量为100,下载大小为108440974,数据集总大小为109817719.0 12. **配置名:base-summary** 特征字段与上述一致,训练集字节数为109867587.0,样本量为100,下载大小为108466548,数据集总大小为109867587.0 13. **配置名:concept** 特征字段: - 音频(audio):数据类型为音频数据 - 主题(theme):数据类型为字符串 - 转录文本(transcription):数据类型为字符串 (无摘要与噪声字段) 数据划分:仅包含训练集(train),字节数为7900550.0,样本量为5 下载大小:6952224,数据集总大小:7900550.0 14. **配置名:dummy** 特征字段: - 音频(audio):数据类型为音频数据 - 主题(theme):数据类型为字符串 - 转录文本(transcription):数据类型为字符串 (无摘要与噪声字段) 数据划分:仅包含训练集(train),字节数为39684356.0,样本量为20 下载大小:38522196,数据集总大小为39684356.0 15. **配置名:large** 特征字段与Large-gera一致,训练集字节数为1676337265.0,样本量为1620,下载大小为1647107918,数据集总大小为1676337265.0 16. **配置名:large-sb1** 特征字段与上述一致,训练集字节数为213069301.0,样本量为200,下载大小为209633218,数据集总大小为213069301.0 17. **配置名:large-zahar-1** 特征字段与上述一致,训练集字节数为18514760.0,样本量为20,下载大小为18234499,数据集总大小为18514760.0 18. **配置名:large-zahar-3** 特征字段与上述一致,训练集字节数为200360483.0,样本量为200,下载大小为197003536,数据集总大小为200360483.0 19. **配置名:large-zahar-4** 特征字段与上述一致,训练集字节数为209593222.0,样本量为200,下载大小为205262821,数据集总大小为209593222.0 20. **配置名:large-zahar-test-200** 特征字段与上述一致,训练集字节数为189662264.0,样本量为200,下载大小为186092807,数据集总大小为189662264.0 21. **配置名:sb-val-time** 特征字段与上述一致,训练集字节数为48565301.0,样本量为29,下载大小为47467688,数据集总大小为48565301.0 22. **配置名:working-example** 特征字段: - 音频(audio):数据类型为音频数据 - 主题(theme):数据类型为字符串 - 转录文本(transcription):数据类型为字符串 (无摘要与噪声字段) 数据划分:仅包含训练集(train),字节数为104460945.0,样本量为51 下载大小:91278093,数据集总大小为104460945.0 23. **配置名:zahar-val-test-0** 特征字段与上述一致,训练集字节数为1192326.0,样本量为1,下载大小为1204199,数据集总大小为1192326.0 24. **配置名:zahar-val-time** 特征字段与上述一致,训练集字节数为15378824.0,样本量为10,下载大小为14824313,数据集总大小为15378824.0 ### 配置数据文件路径 所有配置仅包含训练集划分,数据文件均采用`[配置名]/train-*`的路径格式,具体如下: - Large-gera:Large-gera/train-* - Large-gera2:Large-gera2/train-* - Large-gera3:Large-gera3/train-* - Large-gera4:Large-gera4/train-* - Large-gera5:Large-gera5/train-* - Large-gera6:Large-gera6/train-* - MWP-ru:MWP-ru/train-* - MWP-ru-mistral:MWP-ru-mistral/train-* - MWP-ru-mistral-vosk:MWP-ru-mistral-vosk/train-* - MWP-ukr:MWP-ukr/train-* - base:base/train-* - base-summary:base-summary/train-* - concept:concept/train-* - dummy:dummy/train-* - large:large/train-* - large-sb1:large-sb1/train-* - large-zahar-1:large-zahar-1/train-* - large-zahar-3:large-zahar-3/train-* - large-zahar-4:large-zahar-4/train-* - large-zahar-test-200:large-zahar-test-200/train-* - sb-val-time:sb-val-time/train-* - working-example:working-example/train-* - zahar-val-test-0:zahar-val-test-0/train-* - zahar-val-time:zahar-val-time/train-*
提供机构:
anakib1
原始信息汇总

数据集概述

数据集配置及特征

配置名称 特征名称 数据类型
Large-gera audio audio
Large-gera theme string
Large-gera transcription string
Large-gera summary string
Large-gera noise string
Large-gera2 audio audio
Large-gera2 theme string
Large-gera2 transcription string
Large-gera2 summary string
Large-gera2 noise string
Large-gera3 audio audio
Large-gera3 theme string
Large-gera3 transcription string
Large-gera3 summary string
Large-gera3 noise string
Large-gera4 audio audio
Large-gera4 theme string
Large-gera4 transcription string
Large-gera4 summary string
Large-gera4 noise string
Large-gera5 audio audio
Large-gera5 theme string
Large-gera5 transcription string
Large-gera5 summary string
Large-gera5 noise string
Large-gera6 audio audio
Large-gera6 theme string
Large-gera6 transcription string
Large-gera6 summary string
Large-gera6 noise string
MWP-ru audio audio
MWP-ru theme string
MWP-ru transcription string
MWP-ru summary string
MWP-ru noise string
MWP-ru-mistral audio audio
MWP-ru-mistral theme string
MWP-ru-mistral transcription string
MWP-ru-mistral summary string
MWP-ru-mistral noise string
MWP-ru-mistral-vosk audio audio
MWP-ru-mistral-vosk theme string
MWP-ru-mistral-vosk transcription string
MWP-ru-mistral-vosk summary string
MWP-ru-mistral-vosk noise string
MWP-ukr audio audio
MWP-ukr theme string
MWP-ukr transcription string
MWP-ukr summary string
MWP-ukr noise string
base audio audio
base theme string
base transcription string
base summary string
base noise string
base-summary audio audio
base-summary theme string
base-summary transcription string
base-summary summary string
base-summary noise string
concept audio audio
concept theme string
concept transcription string
dummy audio audio
dummy theme string
dummy transcription string
large audio audio
large theme string
large transcription string
large summary string
large noise string
large-sb1 audio audio
large-sb1 theme string
large-sb1 transcription string
large-sb1 summary string
large-sb1 noise string
large-zahar-1 audio audio
large-zahar-1 theme string
large-zahar-1 transcription string
large-zahar-1 summary string
large-zahar-1 noise string
large-zahar-3 audio audio
large-zahar-3 theme string
large-zahar-3 transcription string
large-zahar-3 summary string
large-zahar-3 noise string
large-zahar-4 audio audio
large-zahar-4 theme string
large-zahar-4 transcription string
large-zahar-4 summary string
large-zahar-4 noise string
large-zahar-test-200 audio audio
large-zahar-test-200 theme string
large-zahar-test-200 transcription string
large-zahar-test-200 summary string
large-zahar-test-200 noise string
sb-val-time audio audio
sb-val-time theme string
sb-val-time transcription string
sb-val-time summary string
sb-val-time noise string
working-example audio audio
working-example theme string
working-example transcription string
zahar-val-test-0 audio audio
zahar-val-test-0 theme string
zahar-val-test-0 transcription string
zahar-val-test-0 summary string
zahar-val-test-0 noise string
zahar-val-time audio audio
zahar-val-time theme string
zahar-val-time transcription string
zahar-val-time summary string
zahar-val-time noise string

数据集大小及训练集信息

配置名称 训练集大小(字节) 训练集示例数量 下载大小(字节)
Large-gera 213252706.0 200 209824384
Large-gera2 106430452.0 100 104457471
Large-gera3 186858486.0 200 182341561
Large-gera4 221199483.0 200 217891357
Large-gera5 113959164.0 100 112464725
Large-gera6 112992422.0 100 110622823
MWP-ru 34334807.0 20 34328238
MWP-ru-mistral 27609775.0 14 27573701
MWP-ru-mistral-vosk 51682924.0 20 51519771
MWP-ukr 38452053.0 20 38448310
base 109817719.0 100 108440974
base-summary 109867587.0 100 108466548
concept 7900550.0 5 6952224
dummy 39684356.0 20 38522196
large 1676337265.0 1620 1647107918
large-sb1 213069301.0 200 209633218
large-zahar-1 18514760.0 20 18234499
large-zahar-3 200360483.0 200 197003536
large-zahar-4 209593222.0 200 205262821
large-zahar-test-200 189662264.0 200 186092807
sb-val-time 48565301.0 29 47467688
working-example 104460945.0 51 91278093
zahar-val-test-0 1192326.0 1 1204199
zahar-val-time 15378824.0 10 14824313
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作