NilanE/ParallelFiction-Ja_En-100k
收藏Hugging Face2024-06-02 更新2024-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NilanE/ParallelFiction-Ja_En-100k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- translation
language:
- ja
- en
---
# Dataset details:
Each entry in this dataset is a sentence-aligned Japanese web novel chapter and English fan translation.
The intended use-case is for document translation tasks.
# Dataset format:
```json
{
'src': 'JAPANESE WEB NOVEL CHAPTER',
'trg': 'CORRESPONDING ENGLISH TRANSLATION',
'meta': {
'general': {
'series_title_eng': 'ENGLISH SERIES TITLE',
'series_title_jap': 'JAPANESE SERIES TITLE',
'sentence_alignment_score': 'ALIGNMENT SCORE'
},
'novelupdates': {
'link': 'NOVELUPDATES URL',
'genres': 'NOVELUPDATES GENRES',
'tags': 'NOVELUPDATES TAGS (think sub-genres)',
'rating': 'NOVELUPDATES RATING (X/5)',
'rating_votes': 'NOVELUPDATES RATING VOTES'
},
'syosetu': {
'link': 'SYOSETU URL',
'series_active': 'IS THE SERIES STILL UP ON SYOSETU (is false for 3 series, each one has no syosetu metadata beyond the link and active status)',
'writer': 'AUTHOR'S NAME ON SYOSETU',
'fav_novel_cnt': 'FROM SYOSETU API FOR CHECKING SERIES QUALITY',
'global_points': 'ALSO FROM SYOSETU API FOR CHECKING SERIES QUALITY'
}
}
}
```
This is version 2 of the dataset. It contains more chapters (103K -> 106K), but has slightly fewer tokens due to an overhaul of the alignment code.
This version should fix the issues found in discussions #3 and #4, adds series-specific metadata as requested in #1, and does not remove chapter titles.
No translation quality filtering has been applied to the dataset. Methods for doing so are being researched.
# License note:
The texts and site-specific metadata is distributed under fair use principles, with everything else being under an Apache 2.0 license.
If an author, translator or one of the sites mentioned above requests a takedown of one or more series, it will be promptly addressed.
Takedowns can be requested through the creation of a Huggingface disscussion.
I am not a lawyer, and the above notice is probably not legally sound. As such, I recommend discretion when using the contents of the dataset.
提供机构:
NilanE
原始信息汇总
数据集详情
数据集描述
每个条目包含一对日文网络小说章节及其对应的英文粉丝翻译。适用于文档翻译任务。
数据集格式
数据集以JSON格式存储,每个条目包含以下字段:
src: 日文网络小说章节trg: 对应的英文翻译meta: 元数据,包括:general: 一般信息,如系列标题(英文和日文)、句子对齐分数novelupdates: NovelUpdates网站的相关信息,如链接、类型、标签、评分、评分票数syosetu: Syosetu网站的相关信息,如链接、系列是否仍在更新、作者名、收藏数、全球点数
版本信息
这是数据集的第2版,包含更多章节(从103K增加到106K),但由于对齐代码的改进,令牌数略有减少。此版本修复了讨论#3和#4中的问题,并添加了系列特定元数据。未对翻译质量进行过滤。
许可证
文本和网站特定元数据遵循合理使用原则,其他内容遵循Apache 2.0许可证。如作者、译者或相关网站请求移除某些系列,将及时处理。移除请求可通过Huggingface讨论创建。



