anakib1/synth-rag
收藏Hugging Face2024-05-31 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/anakib1/synth-rag
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: Large-gera
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 213252706.0
num_examples: 200
download_size: 209824384
dataset_size: 213252706.0
- config_name: Large-gera2
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 106430452.0
num_examples: 100
download_size: 104457471
dataset_size: 106430452.0
- config_name: Large-gera3
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 186858486.0
num_examples: 200
download_size: 182341561
dataset_size: 186858486.0
- config_name: Large-gera4
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 221199483.0
num_examples: 200
download_size: 217891357
dataset_size: 221199483.0
- config_name: Large-gera5
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 113959164.0
num_examples: 100
download_size: 112464725
dataset_size: 113959164.0
- config_name: Large-gera6
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 112992422.0
num_examples: 100
download_size: 110622823
dataset_size: 112992422.0
- config_name: MWP-ru
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 34334807.0
num_examples: 20
download_size: 34328238
dataset_size: 34334807.0
- config_name: MWP-ru-mistral
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 27609775.0
num_examples: 14
download_size: 27573701
dataset_size: 27609775.0
- config_name: MWP-ru-mistral-vosk
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 51682924.0
num_examples: 20
download_size: 51519771
dataset_size: 51682924.0
- config_name: MWP-ukr
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 38452053.0
num_examples: 20
download_size: 38448310
dataset_size: 38452053.0
- config_name: base
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 109817719.0
num_examples: 100
download_size: 108440974
dataset_size: 109817719.0
- config_name: base-summary
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 109867587.0
num_examples: 100
download_size: 108466548
dataset_size: 109867587.0
- config_name: concept
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 7900550.0
num_examples: 5
download_size: 6952224
dataset_size: 7900550.0
- config_name: dummy
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 39684356.0
num_examples: 20
download_size: 38522196
dataset_size: 39684356.0
- config_name: large
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 1676337265.0
num_examples: 1620
download_size: 1647107918
dataset_size: 1676337265.0
- config_name: large-sb1
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 213069301.0
num_examples: 200
download_size: 209633218
dataset_size: 213069301.0
- config_name: large-zahar-1
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 18514760.0
num_examples: 20
download_size: 18234499
dataset_size: 18514760.0
- config_name: large-zahar-3
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 200360483.0
num_examples: 200
download_size: 197003536
dataset_size: 200360483.0
- config_name: large-zahar-4
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 209593222.0
num_examples: 200
download_size: 205262821
dataset_size: 209593222.0
- config_name: large-zahar-test-200
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 189662264.0
num_examples: 200
download_size: 186092807
dataset_size: 189662264.0
- config_name: sb-val-time
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 48565301.0
num_examples: 29
download_size: 47467688
dataset_size: 48565301.0
- config_name: working-example
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 104460945.0
num_examples: 51
download_size: 91278093
dataset_size: 104460945.0
- config_name: zahar-val-test-0
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 1192326.0
num_examples: 1
download_size: 1204199
dataset_size: 1192326.0
- config_name: zahar-val-time
features:
- name: audio
dtype: audio
- name: theme
dtype: string
- name: transcription
dtype: string
- name: summary
dtype: string
- name: noise
dtype: string
splits:
- name: train
num_bytes: 15378824.0
num_examples: 10
download_size: 14824313
dataset_size: 15378824.0
configs:
- config_name: Large-gera
data_files:
- split: train
path: Large-gera/train-*
- config_name: Large-gera2
data_files:
- split: train
path: Large-gera2/train-*
- config_name: Large-gera3
data_files:
- split: train
path: Large-gera3/train-*
- config_name: Large-gera4
data_files:
- split: train
path: Large-gera4/train-*
- config_name: Large-gera5
data_files:
- split: train
path: Large-gera5/train-*
- config_name: Large-gera6
data_files:
- split: train
path: Large-gera6/train-*
- config_name: MWP-ru
data_files:
- split: train
path: MWP-ru/train-*
- config_name: MWP-ru-mistral
data_files:
- split: train
path: MWP-ru-mistral/train-*
- config_name: MWP-ru-mistral-vosk
data_files:
- split: train
path: MWP-ru-mistral-vosk/train-*
- config_name: MWP-ukr
data_files:
- split: train
path: MWP-ukr/train-*
- config_name: base
data_files:
- split: train
path: base/train-*
- config_name: base-summary
data_files:
- split: train
path: base-summary/train-*
- config_name: concept
data_files:
- split: train
path: concept/train-*
- config_name: dummy
data_files:
- split: train
path: dummy/train-*
- config_name: large
data_files:
- split: train
path: large/train-*
- config_name: large-sb1
data_files:
- split: train
path: large-sb1/train-*
- config_name: large-zahar-1
data_files:
- split: train
path: large-zahar-1/train-*
- config_name: large-zahar-3
data_files:
- split: train
path: large-zahar-3/train-*
- config_name: large-zahar-4
data_files:
- split: train
path: large-zahar-4/train-*
- config_name: large-zahar-test-200
data_files:
- split: train
path: large-zahar-test-200/train-*
- config_name: sb-val-time
data_files:
- split: train
path: sb-val-time/train-*
- config_name: working-example
data_files:
- split: train
path: working-example/train-*
- config_name: zahar-val-test-0
data_files:
- split: train
path: zahar-val-test-0/train-*
- config_name: zahar-val-time
data_files:
- split: train
path: zahar-val-time/train-*
---
### 数据集信息
本数据集包含多配置分支,各配置详情如下:
1. **配置名:Large-gera**
特征字段:
- 音频(audio):数据类型为音频数据
- 主题(theme):数据类型为字符串
- 转录文本(transcription):数据类型为字符串
- 摘要(summary):数据类型为字符串
- 噪声(noise):数据类型为字符串
数据划分:仅包含训练集(train),字节数为213252706.0,样本量为200
下载大小:209824384,数据集总大小:213252706.0
2. **配置名:Large-gera2**
特征字段与上述一致,训练集字节数为106430452.0,样本量为100,下载大小为104457471,数据集总大小为106430452.0
3. **配置名:Large-gera3**
特征字段与上述一致,训练集字节数为186858486.0,样本量为200,下载大小为182341561,数据集总大小为186858486.0
4. **配置名:Large-gera4**
特征字段与上述一致,训练集字节数为221199483.0,样本量为200,下载大小为217891357,数据集总大小为221199483.0
5. **配置名:Large-gera5**
特征字段与上述一致,训练集字节数为113959164.0,样本量为100,下载大小为112464725,数据集总大小为113959164.0
6. **配置名:Large-gera6**
特征字段与上述一致,训练集字节数为112992422.0,样本量为100,下载大小为110622823,数据集总大小为112992422.0
7. **配置名:MWP-ru**
特征字段与上述一致,训练集字节数为34334807.0,样本量为20,下载大小为34328238,数据集总大小为34334807.0
8. **配置名:MWP-ru-mistral**
特征字段与上述一致,训练集字节数为27609775.0,样本量为14,下载大小为27573701,数据集总大小为27609775.0
9. **配置名:MWP-ru-mistral-vosk**
特征字段与上述一致,训练集字节数为51682924.0,样本量为20,下载大小为51519771,数据集总大小为51682924.0
10. **配置名:MWP-ukr**
特征字段与上述一致,训练集字节数为38452053.0,样本量为20,下载大小为38448310,数据集总大小为38452053.0
11. **配置名:base**
特征字段与上述一致,训练集字节数为109817719.0,样本量为100,下载大小为108440974,数据集总大小为109817719.0
12. **配置名:base-summary**
特征字段与上述一致,训练集字节数为109867587.0,样本量为100,下载大小为108466548,数据集总大小为109867587.0
13. **配置名:concept**
特征字段:
- 音频(audio):数据类型为音频数据
- 主题(theme):数据类型为字符串
- 转录文本(transcription):数据类型为字符串
(无摘要与噪声字段)
数据划分:仅包含训练集(train),字节数为7900550.0,样本量为5
下载大小:6952224,数据集总大小:7900550.0
14. **配置名:dummy**
特征字段:
- 音频(audio):数据类型为音频数据
- 主题(theme):数据类型为字符串
- 转录文本(transcription):数据类型为字符串
(无摘要与噪声字段)
数据划分:仅包含训练集(train),字节数为39684356.0,样本量为20
下载大小:38522196,数据集总大小为39684356.0
15. **配置名:large**
特征字段与Large-gera一致,训练集字节数为1676337265.0,样本量为1620,下载大小为1647107918,数据集总大小为1676337265.0
16. **配置名:large-sb1**
特征字段与上述一致,训练集字节数为213069301.0,样本量为200,下载大小为209633218,数据集总大小为213069301.0
17. **配置名:large-zahar-1**
特征字段与上述一致,训练集字节数为18514760.0,样本量为20,下载大小为18234499,数据集总大小为18514760.0
18. **配置名:large-zahar-3**
特征字段与上述一致,训练集字节数为200360483.0,样本量为200,下载大小为197003536,数据集总大小为200360483.0
19. **配置名:large-zahar-4**
特征字段与上述一致,训练集字节数为209593222.0,样本量为200,下载大小为205262821,数据集总大小为209593222.0
20. **配置名:large-zahar-test-200**
特征字段与上述一致,训练集字节数为189662264.0,样本量为200,下载大小为186092807,数据集总大小为189662264.0
21. **配置名:sb-val-time**
特征字段与上述一致,训练集字节数为48565301.0,样本量为29,下载大小为47467688,数据集总大小为48565301.0
22. **配置名:working-example**
特征字段:
- 音频(audio):数据类型为音频数据
- 主题(theme):数据类型为字符串
- 转录文本(transcription):数据类型为字符串
(无摘要与噪声字段)
数据划分:仅包含训练集(train),字节数为104460945.0,样本量为51
下载大小:91278093,数据集总大小为104460945.0
23. **配置名:zahar-val-test-0**
特征字段与上述一致,训练集字节数为1192326.0,样本量为1,下载大小为1204199,数据集总大小为1192326.0
24. **配置名:zahar-val-time**
特征字段与上述一致,训练集字节数为15378824.0,样本量为10,下载大小为14824313,数据集总大小为15378824.0
### 配置数据文件路径
所有配置仅包含训练集划分,数据文件均采用`[配置名]/train-*`的路径格式,具体如下:
- Large-gera:Large-gera/train-*
- Large-gera2:Large-gera2/train-*
- Large-gera3:Large-gera3/train-*
- Large-gera4:Large-gera4/train-*
- Large-gera5:Large-gera5/train-*
- Large-gera6:Large-gera6/train-*
- MWP-ru:MWP-ru/train-*
- MWP-ru-mistral:MWP-ru-mistral/train-*
- MWP-ru-mistral-vosk:MWP-ru-mistral-vosk/train-*
- MWP-ukr:MWP-ukr/train-*
- base:base/train-*
- base-summary:base-summary/train-*
- concept:concept/train-*
- dummy:dummy/train-*
- large:large/train-*
- large-sb1:large-sb1/train-*
- large-zahar-1:large-zahar-1/train-*
- large-zahar-3:large-zahar-3/train-*
- large-zahar-4:large-zahar-4/train-*
- large-zahar-test-200:large-zahar-test-200/train-*
- sb-val-time:sb-val-time/train-*
- working-example:working-example/train-*
- zahar-val-test-0:zahar-val-test-0/train-*
- zahar-val-time:zahar-val-time/train-*
提供机构:
anakib1
原始信息汇总
数据集概述
数据集配置及特征
| 配置名称 | 特征名称 | 数据类型 |
|---|---|---|
| Large-gera | audio | audio |
| Large-gera | theme | string |
| Large-gera | transcription | string |
| Large-gera | summary | string |
| Large-gera | noise | string |
| Large-gera2 | audio | audio |
| Large-gera2 | theme | string |
| Large-gera2 | transcription | string |
| Large-gera2 | summary | string |
| Large-gera2 | noise | string |
| Large-gera3 | audio | audio |
| Large-gera3 | theme | string |
| Large-gera3 | transcription | string |
| Large-gera3 | summary | string |
| Large-gera3 | noise | string |
| Large-gera4 | audio | audio |
| Large-gera4 | theme | string |
| Large-gera4 | transcription | string |
| Large-gera4 | summary | string |
| Large-gera4 | noise | string |
| Large-gera5 | audio | audio |
| Large-gera5 | theme | string |
| Large-gera5 | transcription | string |
| Large-gera5 | summary | string |
| Large-gera5 | noise | string |
| Large-gera6 | audio | audio |
| Large-gera6 | theme | string |
| Large-gera6 | transcription | string |
| Large-gera6 | summary | string |
| Large-gera6 | noise | string |
| MWP-ru | audio | audio |
| MWP-ru | theme | string |
| MWP-ru | transcription | string |
| MWP-ru | summary | string |
| MWP-ru | noise | string |
| MWP-ru-mistral | audio | audio |
| MWP-ru-mistral | theme | string |
| MWP-ru-mistral | transcription | string |
| MWP-ru-mistral | summary | string |
| MWP-ru-mistral | noise | string |
| MWP-ru-mistral-vosk | audio | audio |
| MWP-ru-mistral-vosk | theme | string |
| MWP-ru-mistral-vosk | transcription | string |
| MWP-ru-mistral-vosk | summary | string |
| MWP-ru-mistral-vosk | noise | string |
| MWP-ukr | audio | audio |
| MWP-ukr | theme | string |
| MWP-ukr | transcription | string |
| MWP-ukr | summary | string |
| MWP-ukr | noise | string |
| base | audio | audio |
| base | theme | string |
| base | transcription | string |
| base | summary | string |
| base | noise | string |
| base-summary | audio | audio |
| base-summary | theme | string |
| base-summary | transcription | string |
| base-summary | summary | string |
| base-summary | noise | string |
| concept | audio | audio |
| concept | theme | string |
| concept | transcription | string |
| dummy | audio | audio |
| dummy | theme | string |
| dummy | transcription | string |
| large | audio | audio |
| large | theme | string |
| large | transcription | string |
| large | summary | string |
| large | noise | string |
| large-sb1 | audio | audio |
| large-sb1 | theme | string |
| large-sb1 | transcription | string |
| large-sb1 | summary | string |
| large-sb1 | noise | string |
| large-zahar-1 | audio | audio |
| large-zahar-1 | theme | string |
| large-zahar-1 | transcription | string |
| large-zahar-1 | summary | string |
| large-zahar-1 | noise | string |
| large-zahar-3 | audio | audio |
| large-zahar-3 | theme | string |
| large-zahar-3 | transcription | string |
| large-zahar-3 | summary | string |
| large-zahar-3 | noise | string |
| large-zahar-4 | audio | audio |
| large-zahar-4 | theme | string |
| large-zahar-4 | transcription | string |
| large-zahar-4 | summary | string |
| large-zahar-4 | noise | string |
| large-zahar-test-200 | audio | audio |
| large-zahar-test-200 | theme | string |
| large-zahar-test-200 | transcription | string |
| large-zahar-test-200 | summary | string |
| large-zahar-test-200 | noise | string |
| sb-val-time | audio | audio |
| sb-val-time | theme | string |
| sb-val-time | transcription | string |
| sb-val-time | summary | string |
| sb-val-time | noise | string |
| working-example | audio | audio |
| working-example | theme | string |
| working-example | transcription | string |
| zahar-val-test-0 | audio | audio |
| zahar-val-test-0 | theme | string |
| zahar-val-test-0 | transcription | string |
| zahar-val-test-0 | summary | string |
| zahar-val-test-0 | noise | string |
| zahar-val-time | audio | audio |
| zahar-val-time | theme | string |
| zahar-val-time | transcription | string |
| zahar-val-time | summary | string |
| zahar-val-time | noise | string |
数据集大小及训练集信息
| 配置名称 | 训练集大小(字节) | 训练集示例数量 | 下载大小(字节) |
|---|---|---|---|
| Large-gera | 213252706.0 | 200 | 209824384 |
| Large-gera2 | 106430452.0 | 100 | 104457471 |
| Large-gera3 | 186858486.0 | 200 | 182341561 |
| Large-gera4 | 221199483.0 | 200 | 217891357 |
| Large-gera5 | 113959164.0 | 100 | 112464725 |
| Large-gera6 | 112992422.0 | 100 | 110622823 |
| MWP-ru | 34334807.0 | 20 | 34328238 |
| MWP-ru-mistral | 27609775.0 | 14 | 27573701 |
| MWP-ru-mistral-vosk | 51682924.0 | 20 | 51519771 |
| MWP-ukr | 38452053.0 | 20 | 38448310 |
| base | 109817719.0 | 100 | 108440974 |
| base-summary | 109867587.0 | 100 | 108466548 |
| concept | 7900550.0 | 5 | 6952224 |
| dummy | 39684356.0 | 20 | 38522196 |
| large | 1676337265.0 | 1620 | 1647107918 |
| large-sb1 | 213069301.0 | 200 | 209633218 |
| large-zahar-1 | 18514760.0 | 20 | 18234499 |
| large-zahar-3 | 200360483.0 | 200 | 197003536 |
| large-zahar-4 | 209593222.0 | 200 | 205262821 |
| large-zahar-test-200 | 189662264.0 | 200 | 186092807 |
| sb-val-time | 48565301.0 | 29 | 47467688 |
| working-example | 104460945.0 | 51 | 91278093 |
| zahar-val-test-0 | 1192326.0 | 1 | 1204199 |
| zahar-val-time | 15378824.0 | 10 | 14824313 |



