five

asahi417/seamless-align-enA-zhA.tokenized.encodec

收藏
Hugging Face2024-06-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/asahi417/seamless-align-enA-zhA.tokenized.encodec
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含多个子集,每个子集涉及英文和中文数据,包括行号、标识符、激光评分和音频令牌序列。每个子集都有详细的训练数据大小和样本数量记录。

该数据集包含多个子集,每个子集涉及英文和中文数据,包括行号、标识符、激光评分和音频令牌序列。每个子集都有详细的训练数据大小和样本数量记录。
提供机构:
asahi417
原始信息汇总

数据集概述

数据集配置及特征

  • config_name: subset_1

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1962 examples, 710852619 bytes
    • download_size: 109098575 bytes
    • dataset_size: 710852619 bytes
  • config_name: subset_10

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 2031 examples, 679175545 bytes
    • download_size: 104060255 bytes
    • dataset_size: 679175545 bytes
  • config_name: subset_100

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1891 examples, 661577445 bytes
    • download_size: 102774345 bytes
    • dataset_size: 661577445 bytes
  • config_name: subset_101

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1885 examples, 652302383 bytes
    • download_size: 101253284 bytes
    • dataset_size: 652302383 bytes
  • config_name: subset_102

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1863 examples, 636971522 bytes
    • download_size: 98936328 bytes
    • dataset_size: 636971522 bytes
  • config_name: subset_103

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1861 examples, 648739957 bytes
    • download_size: 100689017 bytes
    • dataset_size: 648739957 bytes
  • config_name: subset_104

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1875 examples, 640330458 bytes
    • download_size: 99441227 bytes
    • dataset_size: 640330458 bytes
  • config_name: subset_105

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1871 examples, 656736394 bytes
    • download_size: 102004996 bytes
    • dataset_size: 656736394 bytes
  • config_name: subset_106

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1865 examples, 621738950 bytes
    • download_size: 96546849 bytes
    • dataset_size: 621738950 bytes
  • config_name: subset_107

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1838 examples, 624614454 bytes
    • download_size: 96978610 bytes
    • dataset_size: 624614454 bytes
  • config_name: subset_108

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1860 examples, 651288129 bytes
    • download_size: 101079595 bytes
    • dataset_size: 651288129 bytes
  • config_name: subset_109

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1866 examples, 649726202 bytes
    • download_size: 100916572 bytes
    • dataset_size: 649726202 bytes
  • config_name: subset_11

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1994 examples, 652354271 bytes
    • download_size: 100162655 bytes
    • dataset_size: 652354271 bytes
  • config_name: subset_110

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1843 examples, 627233442 bytes
    • download_size: 97384819 bytes
    • dataset_size: 627233442 bytes
  • config_name: subset_111

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1845 examples, 646406232 bytes
    • download_size: 100280432 bytes
    • dataset_size: 646406232 bytes
  • config_name: subset_112

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1844 examples, 633693165 bytes
    • download_size: 98424960 bytes
    • dataset_size: 633693165 bytes
  • config_name: subset_113

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1839 examples, 628986718 bytes
    • download_size: 97696784 bytes
    • dataset_size: 628986718 bytes
  • config_name: subset_114

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1851 examples, 646298717 bytes
    • download_size: 100311749 bytes
    • dataset_size: 646298717 bytes
  • config_name: subset_115

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1821 examples, 641968057 bytes
    • download_size: 99667687 bytes
    • dataset_size: 641968057 bytes
  • config_name: subset_116

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1837 examples, 640626123 bytes
    • download_size: 99365627 bytes
    • dataset_size: 640626123 bytes
  • config_name: subset_117

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1854 examples, 646082877 bytes
    • download_size: 100377054 bytes
    • dataset_size: 646082877 bytes
  • config_name: subset_118

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
      • zhA.audio.tokens: sequence of int64
    • splits:
      • train: 1814 examples, 627190139 bytes
    • download_size: 97295945 bytes
    • dataset_size: 627190139 bytes
  • config_name: subset_119

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • zhA.audio.tokens: sequence of int64
      • enA.audio.tokens: sequence of int64
    • splits:
      • train: 1823 examples, 633562188 bytes
    • download_size: 98314879 bytes
    • dataset_size: 633562188 bytes
  • config_name: subset_12

    • features:
      • line_no: int64
      • enA.id: string
      • enA.laser_score: float64
      • zhA.id: string
      • zhA.laser_score: float64
      • enA.audio.tokens: sequence of int64
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作