sappho192/Tatoeba-Challenge-jpn-kor
收藏Hugging Face2024-01-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sappho192/Tatoeba-Challenge-jpn-kor
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- translation
language:
- ja
- ko
size_categories:
- 10M<n<100M
---
# Dataset Card for Dataset Name
This dataset contains Japanese-Korean paired text which is from [Helsinki-NLP/Tatoeba-Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/README-v2023-09-26.md).
## Dataset Details
### Dataset Description
- **Curated by:** [Helsinki-NLP](https://github.com/Helsinki-NLP)
- **Language(s) (NLP):** Japanese-Korean
- **License:** CC BY-NC-SA 4.0
### Dataset Sources
- **Repository:** [Helsinki-NLP/Tatoeba-Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/blob/master/data/README-v2023-09-26.md)
- **Detail**: Japanese - Korean [jpn-kor](https://object.pouta.csc.fi/Tatoeba-Challenge-v2023-09-26/jpn-kor.tar)
## Uses
The dataset can be used to train the translation model that translates Japanese sentence to Korean.
### Out-of-Scope Use
You cannot use this dataset to train the model which is to be used under commercial service.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
Each dataset has two columns: `sourceString` and `targetString`, which corresponds to Japanese and Korean sentence.
Check [example code](https://huggingface.co/datasets/sappho192/Tatoeba-Challenge-jpn-kor/blob/main/example.ipynb) to learn how to load the dataset.
## Dataset Creation
### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
This dataset may contain following inappropriate or explicit sentences:
- personal
- sensitive
- private
- data that reveals addresses
- uniquely identifiable names or aliases
- racial or ethnic origins
- sexual orientations
- religious beliefs
- political opinions
- financial or health data
- etc.
So use with your own risk.
## Citation
**BibTeX:**
```bibtex
@inproceedings{tiedemann-2020-tatoeba,
title = "The {T}atoeba {T}ranslation {C}hallenge {--} {R}ealistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.wmt-1.139",
pages = "1174--1182"
}
```
## Dataset Card Authors
[sappho192](https://huggingface.co/sappho192)
## Dataset Card Contact
Please create a thread in the community.
提供机构:
sappho192
原始信息汇总
数据集卡片 for Dataset Name
数据集详情
数据集描述
- 由以下机构策划: Helsinki-NLP
- 语言(NLP): 日语-韩语
- 许可证: CC BY-NC-SA 4.0
数据集来源
- 仓库: Helsinki-NLP/Tatoeba-Challenge
- 详情: 日语 - 韩语 jpn-kor
用途
该数据集可用于训练将日语句子翻译成韩语的翻译模型。
超出范围的用途
您不能使用此数据集来训练用于商业服务的模型。
数据集结构
每个数据集有两列:sourceString 和 targetString,分别对应日语和韩语句子。
查看 示例代码 以了解如何加载数据集。
数据集创建
个人和敏感信息
该数据集可能包含以下不当或显式句子:
- 个人
- 敏感
- 私人
- 揭示地址的数据
- 唯一可识别的姓名或别名
- 种族或民族起源
- 性取向
- 宗教信仰
- 政治观点
- 财务或健康数据
- 等等
因此,请自行承担风险使用。
引用
BibTeX:
bibtex @inproceedings{tiedemann-2020-tatoeba, title = "The {T}atoeba {T}ranslation {C}hallenge {--} {R}ealistic Data Sets for Low Resource and Multilingual {MT}", author = {Tiedemann, J{"o}rg}, booktitle = "Proceedings of the Fifth Conference on Machine Translation", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.wmt-1.139", pages = "1174--1182" }



