cdleong/piglatin-mt

Name: cdleong/piglatin-mt
Creator: cdleong
Published: 2022-10-24 19:22:09
License: 暂无描述

Hugging Face2022-10-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/cdleong/piglatin-mt

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: - mit multilinguality: - translation size_categories: - 10K<n<100K source_datasets: - original task_categories: - translation task_ids: [] language_details: eng and engyay --- ## Dataset Description - **Homepage:** cdleong.github.io # Dataset Summary: Pig-latin machine and English parallel machine translation corpus. Based on [The Project Gutenberg EBook of "De Bello Gallico" and Other Commentaries](https://www.gutenberg.org/ebooks/10657) Converted to pig-latin with https://github.com/bpabel/piglatin Blank lines removed. ## Dataset Structure ``` DatasetDict({ train: Dataset({ features: ['translation'], num_rows: 14778 }) validation: Dataset({ features: ['translation'], num_rows: 1000 }) }) ``` ### Data Instances ``` { 'translation': { 'eng': 'thrown into disorder they returned with more precipitation than is usual', 'engyay': 'own-thray into-ay isorder-day ey-thay eturned-ray ith-way ore-may ecipitation-pray an-thay is-ay usual-ay' } } ``` ### Data Fields - `translation`: a dictionary containing two strings paired with a key indicating the corresponding language. ### Data Splits - `train`: most of the data, 13,232 samples total. - `dev`: 1k holdout samples, created with the datasets.train_test_split() function

提供机构：

cdleong

原始信息汇总

数据集概述

基本信息

语言: 英语（eng）和英语拉丁语（engyay）
许可证: MIT
多语言性: 翻译
大小: 10K<n<100K
源数据集: 原始数据
任务类别: 翻译

数据集描述

摘要: 猪拉丁语与英语平行机器翻译语料库。
来源: 基于《The Project Gutenberg EBook of "De Bello Gallico" and Other Commentaries》，转换为猪拉丁语。

数据集结构

数据集字典: 包含训练集和验证集
- 训练集: 包含14,778行数据，特征为translation
- 验证集: 包含1,000行数据，特征为translation

数据实例

示例:

{ translation: { eng: thrown into disorder they returned with more precipitation than is usual, engyay: own-thray into-ay isorder-day ey-thay eturned-ray ith-way ore-may ecipitation-pray an-thay is-ay usual-ay } }

数据字段

translation: 包含两个字符串，分别对应英语和英语拉丁语。

数据分割

训练集: 13,232个样本
验证集: 1,000个样本，通过datasets.train_test_split()函数创建

5,000+

优质数据集

54 个

任务类型

进入经典数据集