veezbo/akkadian_english_corpus

Name: veezbo/akkadian_english_corpus
Creator: veezbo
Published: 2023-09-30 21:32:28
License: 暂无描述

Hugging Face2023-09-30 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/veezbo/akkadian_english_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个经过清理的英语翻译的阿卡德语数据集，可用于文本生成任务，例如微调大型语言模型。数据集生成过程包括从专家处获取高质量的英语翻译阿卡德语数据，执行最小行长度要求，删除重复行，移除括号内的文本注释和其他通用注释，并在适当位置插入翻译注释和字面注释以保持语法清晰。数据集原始数据的聚合归功于Akkademia项目，而原始数据来源则归功于ORACC项目。

This dataset is a cleaned English-translated Akkadian language dataset, suitable for text generation tasks. The dataset generation process involved sourcing high-quality English-translated Akkadian data from experts, enforcing a minimum line length, removing duplicate lines, deleting textual notes and other generic notes within parentheses, and inserting translation notes and literal notes in place to preserve grammar and add clarity to the corpus. The aggregation of the raw data is credited to the Akkademia project, while the original source of the raw data comes from the Open Richly Annotated Cuneiform Corpus (ORACC) project.

提供机构：

veezbo

原始信息汇总

Akkadian English Corpus

概述

名称: Akkadian English Corpus
许可: MIT
任务类别: 文本生成
语言: 英语
数据集大小: 1K<n<10K
别名: English-translated Akkadian Corpus

数据集描述

该数据集是一个经过清洗的英语翻译阿卡德语数据集，适用于文本生成任务，例如用于微调大型语言模型（LLMs）。

数据生成过程

数据来源: 由专家提供的高质量英语翻译阿卡德语数据集
处理步骤:
- 强制执行最小行长度
- 移除重复行
- 移除文本注释和其他通用注释
- 插入翻译注释和字面注释，以保留语法并增加语料库的清晰度

数据来源

原始数据集: 由Akkademia项目提供
原始数据文件: train.en
原始数据来源: Open Richly Annotated Cuneiform Corpus (ORACC)项目
具体数据集: RINAP 1, 3, 4, 和 5

引用

Gai Gutherz et al. 2023: Translating Akkadian to English with neural machine translation, PNAS Nexus, Volume 2, Issue 5, May 2023, pgad096, https://doi.org/10.1093/pnasnexus/pgad096
Jamie Novotny et al.: Open Richly Annotated Cuneiform Corpus, http://oracc.org

5,000+

优质数据集

54 个

任务类型

进入经典数据集