five

fujiki/wiki40b_ja

收藏
Hugging Face2023-04-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/fujiki/wiki40b_ja
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - ja dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 1954209746 num_examples: 745392 - name: validation num_bytes: 107186201 num_examples: 41576 - name: test num_bytes: 107509760 num_examples: 41268 download_size: 420085060 dataset_size: 2168905707 --- This dataset is a reformatted version of the Japanese portion of [wiki40b](https://aclanthology.org/2020.lrec-1.297/) dataset. When you use this dataset, please cite the original paper: ``` @inproceedings{guo-etal-2020-wiki, title = "{W}iki-40{B}: Multilingual Language Model Dataset", author = "Guo, Mandy and Dai, Zihang and Vrande{\v{c}}i{\'c}, Denny and Al-Rfou, Rami", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.297", pages = "2440--2452", abstract = "We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.", language = "English", ISBN = "979-10-95546-34-4", } ```
提供机构:
fujiki
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-sa-4.0
  • 语言: 日语(ja)

数据集结构

  • 特征:
    • text (字符串类型)

数据分割

  • 训练集:
    • 示例数量: 745392
    • 字节数: 1954209746
  • 验证集:
    • 示例数量: 41576
    • 字节数: 107186201
  • 测试集:
    • 示例数量: 41268
    • 字节数: 107509760

数据集大小

  • 下载大小: 420085060字节
  • 总大小: 2168905707字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作