asahi417/seamless-align-deA-enA.tokenized.encodec
收藏Hugging Face2024-06-17 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/asahi417/seamless-align-deA-enA.tokenized.encodec
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个子集(subset_1到subset_136),每个子集包含相同的特征,如行号、德语和英语的ID、Laser评分以及音频标记。这些特征的数据类型包括int64、string和float64。每个子集的训练集大小和示例数量有所不同。
This dataset contains multiple subsets (subset_1 to subset_136), each with the same features, such as line number, German and English IDs, Laser scores, and audio tokens. The data types of these features include int64, string, and float64. The size of the training set and the number of examples vary for each subset.
提供机构:
asahi417
原始信息汇总
数据集概述
数据集配置
该数据集包含多个子集,每个子集具有不同的配置名称和特征。以下是各子集的详细信息:
子集列表
- subset_1
- subset_10
- subset_100
- subset_101
- subset_102
- subset_103
- subset_104
- subset_105
- subset_106
- subset_107
- subset_108
- subset_109
- subset_11
- subset_110
- subset_111
- subset_112
- subset_113
- subset_114
- subset_115
- subset_116
- subset_117
- subset_118
- subset_119
- subset_12
- subset_120
- subset_121
- subset_122
- subset_123
- subset_124
- subset_125
- subset_126
- subset_127
- subset_128
- subset_129
- subset_13
- subset_130
- subset_131
- subset_132
- subset_133
- subset_134
- subset_135
- subset_136
特征描述
每个子集包含以下特征:
- line_no: 行号,数据类型为
int64。 - deA.id: 德语ID,数据类型为
string。 - deA.laser_score: 德语LASER分数,数据类型为
float64。 - enA.id: 英语ID,数据类型为
string。 - enA.laser_score: 英语LASER分数,数据类型为
float64。 - enA.audio.tokens: 英语音频标记,数据类型为
int64序列。 - deA.audio.tokens: 德语音频标记,数据类型为
int64序列。
数据分割
每个子集仅包含一个数据分割:
- train: 训练集
数据集大小
每个子集的训练集大小如下:
| 子集名称 | 训练集字节数 | 训练集样本数 | 下载大小 | 数据集大小 |
|---|---|---|---|---|
| subset_1 | 825169182 | 2064 | 127787389 | 825169182 |
| subset_10 | 805994313 | 2109 | 124771864 | 805994313 |
| subset_100 | 589512805 | 1982 | 91508494 | 589512805 |
| subset_101 | 603841640 | 2029 | 93644008 | 603841640 |
| subset_102 | 604643147 | 2029 | 93825688 | 604643147 |
| subset_103 | 601136745 | 1982 | 93374940 | 601136745 |
| subset_104 | 590015590 | 1985 | 91516565 | 590015590 |
| subset_105 | 613543339 | 2064 | 95142736 | 613543339 |
| subset_106 | 601000658 | 2031 | 93261298 | 601000658 |
| subset_107 | 595494310 | 2015 | 92316169 | 595494310 |
| subset_108 | 569984625 | 1995 | 88461075 | 569984625 |
| subset_109 | 568826131 | 1984 | 88274007 | 568826131 |
| subset_11 | 811970945 | 2135 | 125769382 | 811970945 |
| subset_110 | 560748004 | 1987 | 87140540 | 560748004 |
| subset_111 | 575245728 | 2015 | 89198832 | 575245728 |
| subset_112 | 590894994 | 2036 | 91852642 | 590894994 |
| subset_113 | 567243832 | 1966 | 88015479 | 567243832 |
| subset_114 | 552905345 | 1999 | 85798966 | 552905345 |
| subset_115 | 571727881 | 1987 | 88662318 | 571727881 |
| subset_116 | 569922717 | 2005 | 88411724 | 569922717 |
| subset_117 | 558468694 | 1948 | 86613458 | 558468694 |
| subset_118 | 564068581 | 1965 | 87487846 | 564068581 |
| subset_119 | 577849244 | 2008 | 89546501 | 577849244 |
| subset_12 | 783606636 | 2110 | 121396474 | 783606636 |
| subset_120 | 558239070 | 1964 | 86591020 | 558239070 |
| subset_121 | 550013935 | 1965 | 85325510 | 550013935 |
| subset_122 | 553715504 | 1991 | 85980948 | 553715504 |
| subset_123 | 544456438 | 1950 | 84578293 | 544456438 |
| subset_124 | 538462573 | 1935 | 83540787 | 538462573 |
| subset_125 | 548628599 | 1976 | 85243330 | 548628599 |
| subset_126 | 538842285 | 1959 | 83546796 | 538842285 |
| subset_127 | 547088626 | 1947 | 84914109 | 547088626 |
| subset_128 | 520973069 | 1931 | 80864250 | 520973069 |
| subset_129 | 547347053 | 1948 | 85085221 | 547347053 |
| subset_13 | 818546882 | 2163 | 126774705 | 818546882 |
| subset_130 | 532932087 | 1893 | 82786273 | 532932087 |
| subset_131 | 538704981 | 1929 | 83670221 | 538704981 |
| subset_132 | 520929778 | 1868 | 80944419 | 520929778 |
| subset_133 | 524061160 | 1874 | 81340476 | 524061160 |
| subset_134 | 542335076 | 1945 | 84225720 | 542335076 |
| subset_135 | 508463577 | 1887 | 78965221 | 508463577 |
| subset_136 | 508463577 | 1887 | 78965221 | 508463577 |



