asahi417/seamless-align-enA-zhA.tokenized.encodec
收藏Hugging Face2024-06-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/asahi417/seamless-align-enA-zhA.tokenized.encodec
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个子集,每个子集涉及英文和中文数据,包括行号、标识符、激光评分和音频令牌序列。每个子集都有详细的训练数据大小和样本数量记录。
该数据集包含多个子集,每个子集涉及英文和中文数据,包括行号、标识符、激光评分和音频令牌序列。每个子集都有详细的训练数据大小和样本数量记录。
提供机构:
asahi417
原始信息汇总
数据集概述
数据集配置及特征
-
config_name: subset_1
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1962 examples, 710852619 bytes
- download_size: 109098575 bytes
- dataset_size: 710852619 bytes
- features:
-
config_name: subset_10
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 2031 examples, 679175545 bytes
- download_size: 104060255 bytes
- dataset_size: 679175545 bytes
- features:
-
config_name: subset_100
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1891 examples, 661577445 bytes
- download_size: 102774345 bytes
- dataset_size: 661577445 bytes
- features:
-
config_name: subset_101
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1885 examples, 652302383 bytes
- download_size: 101253284 bytes
- dataset_size: 652302383 bytes
- features:
-
config_name: subset_102
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1863 examples, 636971522 bytes
- download_size: 98936328 bytes
- dataset_size: 636971522 bytes
- features:
-
config_name: subset_103
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1861 examples, 648739957 bytes
- download_size: 100689017 bytes
- dataset_size: 648739957 bytes
- features:
-
config_name: subset_104
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1875 examples, 640330458 bytes
- download_size: 99441227 bytes
- dataset_size: 640330458 bytes
- features:
-
config_name: subset_105
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1871 examples, 656736394 bytes
- download_size: 102004996 bytes
- dataset_size: 656736394 bytes
- features:
-
config_name: subset_106
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1865 examples, 621738950 bytes
- download_size: 96546849 bytes
- dataset_size: 621738950 bytes
- features:
-
config_name: subset_107
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1838 examples, 624614454 bytes
- download_size: 96978610 bytes
- dataset_size: 624614454 bytes
- features:
-
config_name: subset_108
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1860 examples, 651288129 bytes
- download_size: 101079595 bytes
- dataset_size: 651288129 bytes
- features:
-
config_name: subset_109
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1866 examples, 649726202 bytes
- download_size: 100916572 bytes
- dataset_size: 649726202 bytes
- features:
-
config_name: subset_11
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1994 examples, 652354271 bytes
- download_size: 100162655 bytes
- dataset_size: 652354271 bytes
- features:
-
config_name: subset_110
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1843 examples, 627233442 bytes
- download_size: 97384819 bytes
- dataset_size: 627233442 bytes
- features:
-
config_name: subset_111
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1845 examples, 646406232 bytes
- download_size: 100280432 bytes
- dataset_size: 646406232 bytes
- features:
-
config_name: subset_112
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1844 examples, 633693165 bytes
- download_size: 98424960 bytes
- dataset_size: 633693165 bytes
- features:
-
config_name: subset_113
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1839 examples, 628986718 bytes
- download_size: 97696784 bytes
- dataset_size: 628986718 bytes
- features:
-
config_name: subset_114
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1851 examples, 646298717 bytes
- download_size: 100311749 bytes
- dataset_size: 646298717 bytes
- features:
-
config_name: subset_115
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1821 examples, 641968057 bytes
- download_size: 99667687 bytes
- dataset_size: 641968057 bytes
- features:
-
config_name: subset_116
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1837 examples, 640626123 bytes
- download_size: 99365627 bytes
- dataset_size: 640626123 bytes
- features:
-
config_name: subset_117
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1854 examples, 646082877 bytes
- download_size: 100377054 bytes
- dataset_size: 646082877 bytes
- features:
-
config_name: subset_118
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- zhA.audio.tokens: sequence of int64
- splits:
- train: 1814 examples, 627190139 bytes
- download_size: 97295945 bytes
- dataset_size: 627190139 bytes
- features:
-
config_name: subset_119
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- zhA.audio.tokens: sequence of int64
- enA.audio.tokens: sequence of int64
- splits:
- train: 1823 examples, 633562188 bytes
- download_size: 98314879 bytes
- dataset_size: 633562188 bytes
- features:
-
config_name: subset_12
- features:
- line_no: int64
- enA.id: string
- enA.laser_score: float64
- zhA.id: string
- zhA.laser_score: float64
- enA.audio.tokens: sequence of int64
- features:



