five

manu/french-30b_separate

收藏
Hugging Face2023-10-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/manu/french-30b_separate
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: WmtEnFrTest path: data/WmtEnFrTest-* - split: EnglishFrenchWebpagesScrapedTranslatedTest path: data/EnglishFrenchWebpagesScrapedTranslatedTest-* - split: FrenchLibrispeechTextOnlyTest path: data/FrenchLibrispeechTextOnlyTest-* - split: FrenchPodcastsTest path: data/FrenchPodcastsTest-* - split: FrenchOpenSubtitlesTest path: data/FrenchOpenSubtitlesTest-* - split: OriginalSongsLyricsWithFrenchTranslationTest path: data/OriginalSongsLyricsWithFrenchTranslationTest-* - split: ProjectgutenbergFrTest path: data/ProjectgutenbergFrTest-* - split: BnfGallicaTest path: data/BnfGallicaTest-* - split: ThesesFr20132023Test path: data/ThesesFr20132023Test-* - split: LegiOpendataTest path: data/LegiOpendataTest-* - split: BaloOpendataTest path: data/BaloOpendataTest-* - split: JadeOpendataTest path: data/JadeOpendataTest-* - split: DoleOpendataTest path: data/DoleOpendataTest-* - split: SardeOpendataTest path: data/SardeOpendataTest-* - split: QrOpendataTest path: data/QrOpendataTest-* - split: JorfOpendataTest path: data/JorfOpendataTest-* - split: IncaOpendataTest path: data/IncaOpendataTest-* - split: AccoOpendataTest path: data/AccoOpendataTest-* - split: KaliOpendataTest path: data/KaliOpendataTest-* - split: DebatsOpendataTest path: data/DebatsOpendataTest-* - split: CnilOpendataTest path: data/CnilOpendataTest-* - split: CappOpendataTest path: data/CappOpendataTest-* - split: CassOpendataTest path: data/CassOpendataTest-* - split: ConstitOpendataTest path: data/ConstitOpendataTest-* - split: IlluinLayoutDatasetTextOnlyTest path: data/IlluinLayoutDatasetTextOnlyTest-* - split: WikisourceFrTest path: data/WikisourceFrTest-* - split: Wikipedia20220301.frTest path: data/Wikipedia20220301.frTest-* - split: Oscar2301FrTest path: data/Oscar2301FrTest-* dataset_info: features: - name: id dtype: string - name: text dtype: string - name: dataset_id dtype: string splits: - name: WmtEnFrTest num_bytes: 933080 num_examples: 3003 - name: EnglishFrenchWebpagesScrapedTranslatedTest num_bytes: 3557903 num_examples: 8580 - name: FrenchLibrispeechTextOnlyTest num_bytes: 698968 num_examples: 2582 - name: FrenchPodcastsTest num_bytes: 505018 num_examples: 100 - name: FrenchOpenSubtitlesTest num_bytes: 3048714 num_examples: 100 - name: OriginalSongsLyricsWithFrenchTranslationTest num_bytes: 2156145 num_examples: 756 - name: ProjectgutenbergFrTest num_bytes: 39019119 num_examples: 100 - name: BnfGallicaTest num_bytes: 43160730 num_examples: 100 - name: ThesesFr20132023Test num_bytes: 3957037 num_examples: 959 - name: LegiOpendataTest num_bytes: 16589963 num_examples: 10000 - name: BaloOpendataTest num_bytes: 11094568 num_examples: 1355 - name: JadeOpendataTest num_bytes: 56977150 num_examples: 5586 - name: DoleOpendataTest num_bytes: 2065780 num_examples: 100 - name: SardeOpendataTest num_bytes: 1044391 num_examples: 2244 - name: QrOpendataTest num_bytes: 18924359 num_examples: 100 - name: JorfOpendataTest num_bytes: 11892298 num_examples: 10000 - name: IncaOpendataTest num_bytes: 27827026 num_examples: 3737 - name: AccoOpendataTest num_bytes: 36928857 num_examples: 2541 - name: KaliOpendataTest num_bytes: 7740933 num_examples: 4306 - name: DebatsOpendataTest num_bytes: 38200789 num_examples: 100 - name: CnilOpendataTest num_bytes: 1495015 num_examples: 181 - name: CappOpendataTest num_bytes: 9680857 num_examples: 727 - name: CassOpendataTest num_bytes: 8283986 num_examples: 1422 - name: ConstitOpendataTest num_bytes: 1340350 num_examples: 100 - name: IlluinLayoutDatasetTextOnlyTest num_bytes: 11714355 num_examples: 4885 - name: WikisourceFrTest num_bytes: 44358940 num_examples: 10000 - name: Wikipedia20220301.frTest num_bytes: 28814742 num_examples: 10000 - name: Oscar2301FrTest num_bytes: 51030875 num_examples: 9834 download_size: 0 dataset_size: 483041948 --- # Dataset Card for "french-30b_separate" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
manu
原始信息汇总

数据集概述

数据集配置

  • 配置名称: default
  • 数据文件:
    • WmtEnFrTest: 路径 data/WmtEnFrTest-*
    • EnglishFrenchWebpagesScrapedTranslatedTest: 路径 data/EnglishFrenchWebpagesScrapedTranslatedTest-*
    • FrenchLibrispeechTextOnlyTest: 路径 data/FrenchLibrispeechTextOnlyTest-*
    • FrenchPodcastsTest: 路径 data/FrenchPodcastsTest-*
    • FrenchOpenSubtitlesTest: 路径 data/FrenchOpenSubtitlesTest-*
    • OriginalSongsLyricsWithFrenchTranslationTest: 路径 data/OriginalSongsLyricsWithFrenchTranslationTest-*
    • ProjectgutenbergFrTest: 路径 data/ProjectgutenbergFrTest-*
    • BnfGallicaTest: 路径 data/BnfGallicaTest-*
    • ThesesFr20132023Test: 路径 data/ThesesFr20132023Test-*
    • LegiOpendataTest: 路径 data/LegiOpendataTest-*
    • BaloOpendataTest: 路径 data/BaloOpendataTest-*
    • JadeOpendataTest: 路径 data/JadeOpendataTest-*
    • DoleOpendataTest: 路径 data/DoleOpendataTest-*
    • SardeOpendataTest: 路径 data/SardeOpendataTest-*
    • QrOpendataTest: 路径 data/QrOpendataTest-*
    • JorfOpendataTest: 路径 data/JorfOpendataTest-*
    • IncaOpendataTest: 路径 data/IncaOpendataTest-*
    • AccoOpendataTest: 路径 data/AccoOpendataTest-*
    • KaliOpendataTest: 路径 data/KaliOpendataTest-*
    • DebatsOpendataTest: 路径 data/DebatsOpendataTest-*
    • CnilOpendataTest: 路径 data/CnilOpendataTest-*
    • CappOpendataTest: 路径 data/CappOpendataTest-*
    • CassOpendataTest: 路径 data/CassOpendataTest-*
    • ConstitOpendataTest: 路径 data/ConstitOpendataTest-*
    • IlluinLayoutDatasetTextOnlyTest: 路径 data/IlluinLayoutDatasetTextOnlyTest-*
    • WikisourceFrTest: 路径 data/WikisourceFrTest-*
    • Wikipedia20220301.frTest: 路径 data/Wikipedia20220301.frTest-*
    • Oscar2301FrTest: 路径 data/Oscar2301FrTest-*

数据集信息

  • 特征:
    • id: 类型 string
    • text: 类型 string
    • dataset_id: 类型 string
  • 分割:
    • WmtEnFrTest: 字节数 933080, 样本数 3003
    • EnglishFrenchWebpagesScrapedTranslatedTest: 字节数 3557903, 样本数 8580
    • FrenchLibrispeechTextOnlyTest: 字节数 698968, 样本数 2582
    • FrenchPodcastsTest: 字节数 505018, 样本数 100
    • FrenchOpenSubtitlesTest: 字节数 3048714, 样本数 100
    • OriginalSongsLyricsWithFrenchTranslationTest: 字节数 2156145, 样本数 756
    • ProjectgutenbergFrTest: 字节数 39019119, 样本数 100
    • BnfGallicaTest: 字节数 43160730, 样本数 100
    • ThesesFr20132023Test: 字节数 3957037, 样本数 959
    • LegiOpendataTest: 字节数 16589963, 样本数 10000
    • BaloOpendataTest: 字节数 11094568, 样本数 1355
    • JadeOpendataTest: 字节数 56977150, 样本数 5586
    • DoleOpendataTest: 字节数 2065780, 样本数 100
    • SardeOpendataTest: 字节数 1044391, 样本数 2244
    • QrOpendataTest: 字节数 18924359, 样本数 100
    • JorfOpendataTest: 字节数 11892298, 样本数 10000
    • IncaOpendataTest: 字节数 27827026, 样本数 3737
    • AccoOpendataTest: 字节数 36928857, 样本数 2541
    • KaliOpendataTest: 字节数 7740933, 样本数 4306
    • DebatsOpendataTest: 字节数 38200789, 样本数 100
    • CnilOpendataTest: 字节数 1495015, 样本数 181
    • CappOpendataTest: 字节数 9680857, 样本数 727
    • CassOpendataTest: 字节数 8283986, 样本数 1422
    • ConstitOpendataTest: 字节数 1340350, 样本数 100
    • IlluinLayoutDatasetTextOnlyTest: 字节数 11714355, 样本数 4885
    • WikisourceFrTest: 字节数 44358940, 样本数 10000
    • Wikipedia20220301.frTest: 字节数 28814742, 样本数 10000
    • Oscar2301FrTest: 字节数 51030875, 样本数 9834
  • 下载大小: 0
  • 数据集大小: 483041948
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作