five

FrancophonIA/Romance-Croatian_Parallel_Corpus

收藏
Hugging Face2025-03-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/FrancophonIA/Romance-Croatian_Parallel_Corpus
下载链接
链接失效反馈
官方服务:
资源简介:
RomCro是一个罗曼语系与克罗地亚语平行语料库,由萨格勒布大学人文社会科学学院罗曼语系和文学系构建。该语料库汇集了五种罗曼语系语言(法语、意大利语、葡萄牙语、罗马尼亚语、西班牙语)和克罗地亚语,包含20世纪和21世纪的文学作品。语料库中的每个原文句子都与其余五种语言的翻译对应句对齐,但句子的原始顺序已被打乱。整个语料库共有15.9百万词,并按照不同语言进行了字数分配。为了方便不同用途的使用,语料库提供了TMX和TSV两种格式,其中语言的顺序为西班牙语、法语、意大利语、葡萄牙语、罗马尼亚语和克罗地亚语。两种格式文件中还包括了关于原文语言、作者和文本标题的注释。

RomCro is a parallel corpus of Romance Languages and Croatian, constructed at the Department of Romance Languages and Literatures of the Faculty of Humanities and Social Sciences, University of Zagreb. The corpus includes five Romance languages (French, Italian, Portuguese, Romanian, Spanish) and Croatian. It consists of literary texts from the 20th and 21st centuries. Each original sentence in the corpus is aligned with its translational equivalents in the other five languages, although the original order of sentences is scrambled. The corpus totals 15.9 million words, with a distribution by languages as follows: French 2.9 million words, Italian 2.5 million words, Portuguese 2.5 million words, Romanian 2.6 million words, Spanish 2.7 million words, and Croatian 2.4 million words. The corpus is provided in two formats, TMX and TSV, with the language order being Spanish, French, Italian, Portuguese, Romanian, and Croatian. Both formats include notes about the original language, author, and the title of the text from which the segment is taken.
提供机构:
FrancophonIA
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作