JParaCrawl
收藏数据集卡片 for JParaCrawl
数据集概述
JParaCrawl 是由 NTT 创建的最大的公开可用英日平行语料库。它通过大规模网络爬取和自动对齐平行句子创建。
数据集信息
特征
- translation
- en: 类型为 string
- ja: 类型为 string
数据分割
- train
- 字节数: 1084069907
- 样本数: 3669859
下载和数据集大小
- 下载大小: 603669921
- 数据集大小: 1084069907
配置
- default
- 数据文件:
- train: data/train-*
- 数据文件:
如何使用
python from datasets import load_dataset dataset = load_dataset("Hoshikuzu/JParaCrawl")
如果数据加载时间过长,可以使用流式加载:
python from datasets import load_dataset dataset = load_dataset("Hoshikuzu/JParaCrawl", streaming=True)
数据实例
json { "en": "Of course, we’ll keep the important stuff, but we’ll try to sell as much as possible of the stuff we don’t need. afterwards I feel like we can save money by reducing things and making life related patterns too.", "ja": "もちろん大切なものは取っておきますが、なくても困らないものはなるべく売るようにします。 さいごに ものを減らして、生活関連もパターン化することでお金は貯まる気がしています。" }
许可信息
JParaCrawl 根据其自己的许可证分发。详细信息请参阅 https://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/。
数据分割
仅提供 train 分割。
引用信息
json @inproceedings{morishita-etal-2020-jparacrawl, title = "{JP}ara{C}rawl: A Large Scale Web-Based {E}nglish-{J}apanese Parallel Corpus", author = "Morishita, Makoto and Suzuki, Jun and Nagata, Masaaki", booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.443", pages = "3603--3609", ISBN = "979-10-95546-34-4", }




