five

NilanE/ParallelFiction-Ja_En-100k

收藏
Hugging Face2024-06-02 更新2024-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NilanE/ParallelFiction-Ja_En-100k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - translation language: - ja - en --- # Dataset details: Each entry in this dataset is a sentence-aligned Japanese web novel chapter and English fan translation. The intended use-case is for document translation tasks. # Dataset format: ```json { 'src': 'JAPANESE WEB NOVEL CHAPTER', 'trg': 'CORRESPONDING ENGLISH TRANSLATION', 'meta': { 'general': { 'series_title_eng': 'ENGLISH SERIES TITLE', 'series_title_jap': 'JAPANESE SERIES TITLE', 'sentence_alignment_score': 'ALIGNMENT SCORE' }, 'novelupdates': { 'link': 'NOVELUPDATES URL', 'genres': 'NOVELUPDATES GENRES', 'tags': 'NOVELUPDATES TAGS (think sub-genres)', 'rating': 'NOVELUPDATES RATING (X/5)', 'rating_votes': 'NOVELUPDATES RATING VOTES' }, 'syosetu': { 'link': 'SYOSETU URL', 'series_active': 'IS THE SERIES STILL UP ON SYOSETU (is false for 3 series, each one has no syosetu metadata beyond the link and active status)', 'writer': 'AUTHOR'S NAME ON SYOSETU', 'fav_novel_cnt': 'FROM SYOSETU API FOR CHECKING SERIES QUALITY', 'global_points': 'ALSO FROM SYOSETU API FOR CHECKING SERIES QUALITY' } } } ``` This is version 2 of the dataset. It contains more chapters (103K -> 106K), but has slightly fewer tokens due to an overhaul of the alignment code. This version should fix the issues found in discussions #3 and #4, adds series-specific metadata as requested in #1, and does not remove chapter titles. No translation quality filtering has been applied to the dataset. Methods for doing so are being researched. # License note: The texts and site-specific metadata is distributed under fair use principles, with everything else being under an Apache 2.0 license. If an author, translator or one of the sites mentioned above requests a takedown of one or more series, it will be promptly addressed. Takedowns can be requested through the creation of a Huggingface disscussion. I am not a lawyer, and the above notice is probably not legally sound. As such, I recommend discretion when using the contents of the dataset.
提供机构:
NilanE
原始信息汇总

数据集详情

数据集描述

每个条目包含一对日文网络小说章节及其对应的英文粉丝翻译。适用于文档翻译任务。

数据集格式

数据集以JSON格式存储,每个条目包含以下字段:

  • src: 日文网络小说章节
  • trg: 对应的英文翻译
  • meta: 元数据,包括:
    • general: 一般信息,如系列标题(英文和日文)、句子对齐分数
    • novelupdates: NovelUpdates网站的相关信息,如链接、类型、标签、评分、评分票数
    • syosetu: Syosetu网站的相关信息,如链接、系列是否仍在更新、作者名、收藏数、全球点数

版本信息

这是数据集的第2版,包含更多章节(从103K增加到106K),但由于对齐代码的改进,令牌数略有减少。此版本修复了讨论#3和#4中的问题,并添加了系列特定元数据。未对翻译质量进行过滤。

许可证

文本和网站特定元数据遵循合理使用原则,其他内容遵循Apache 2.0许可证。如作者、译者或相关网站请求移除某些系列,将及时处理。移除请求可通过Huggingface讨论创建。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作