Hindi-English新闻平行语料库

Name: Hindi-English新闻平行语料库
Creator: 印度信息技术学院布巴内斯瓦尔分校
Published: 2019-01-25 03:49:43
License: 暂无描述

arXiv2019-01-25 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/1901.08625v1

下载链接

链接失效反馈

官方服务：

资源简介：

本研究开发了一个自动生成Hindi-English新闻翻译任务平行语料库的系统原型。数据集由从网络爬取的比较语料库构建，通过模糊字符串匹配算法提高质量。该数据集旨在解决现有Hindi-English平行语料库规模有限的问题，支持统计机器翻译系统和神经机器翻译系统的训练。数据集的创建过程涉及从Navbharat Times等来源提取Hindi新闻内容，并将其翻译成英语，然后通过Google搜索API找到相关英语新闻内容进行对齐。该数据集主要应用于机器翻译领域，特别是Hindi-English新闻翻译任务。

This study developed a prototype system for automatically generating parallel corpora for Hindi-English news translation tasks. The dataset is constructed from web-crawled comparable corpora, with its quality improved by fuzzy string matching algorithms. This dataset aims to address the limited scale issue of existing Hindi-English parallel corpora, and supports the training of statistical machine translation (SMT) and neural machine translation (NMT) systems. The dataset creation process involves extracting Hindi news content from sources including Navbharat Times, translating the extracted content into English, then aligning the translated texts with relevant English news articles retrieved via the Google Search API. This dataset is primarily applied in the field of machine translation, particularly for Hindi-English news translation tasks.

提供机构：

印度信息技术学院布巴内斯瓦尔分校

创建时间：

2019-01-25