five

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

收藏
arXiv2021-12-17 更新2024-07-24 收录
下载链接:
https://github.com/cfiltnlp/challengeCognateFF
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集名为‘印度语言中的同源词和假朋友对挑战数据集’,由印度理工学院孟买分校创建,涵盖了十二种印度语言的同源词数据。数据集通过数字化印度语言同源词词典并利用关联的印度语言Wordnets生成同源词集合。此外,还创建了一个假朋友数据集,用于十一对语言。数据集主要用于支持机器翻译、跨语言信息检索和计算系统发生学等自然语言处理应用,旨在解决不同语言间同源词识别的挑战。

This dataset is named 'Challenge Dataset of Cognates and False Friends in Indian Languages'. It was created by the Indian Institute of Technology Bombay, covering cognate data for twelve Indian languages. The cognate sets were generated by digitizing Indian language cognate dictionaries and leveraging the associated Indian language Wordnets. Additionally, a false friend dataset was constructed for eleven language pairs. This dataset is primarily intended to support natural language processing applications such as machine translation, cross-lingual information retrieval and computational phylogenetics, aiming to address the challenges of cognate identification across different languages.
提供机构:
印度理工学院孟买分校
创建时间:
2021-12-17
原始信息汇总

数据集概述

数据集详情

本仓库包含两个出版物的数据:

  1. Challenge Dataset of Cognates and False Friend Pairs from Indian Languages (LREC 2020)
  2. Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages (COLING 2020)

数据集内容

  • D1、D2、D3:这些数据集分别对应于LREC 2020论文中描述的内容,可在各自的文件夹中找到。
    • D1和D2可以合并以复现COLING 2020论文中关于印度语言同源词检测的实验。
    • D3仅与LREC 2020论文相关,包含印度语言的假朋友数据。

注意事项

  • ILCI Parallel Corpus:用于机器翻译实验的ILCI平行语料库不可分发,需通过TDIL网站请求获取。

引用

LREC 2020

latex @inproceedings{kanojia-etal-2020-challenge, title = "Challenge Dataset of Cognates and False Friend Pairs from {I}ndian Languages", author = "Kanojia, Diptesh and Kulkarni, Malhar and Bhattacharyya, Pushpak and Haffari, Gholamreza", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.378", pages = "3096--3102", abstract = "Cognates are present in multiple variants of the same text across different languages (e.g., {}hund{} in German and {}hound{} in the English language mean {``}dog{}). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends{} dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.", language = "English", ISBN = "979-10-95546-34-4", }

COLING 2020

latex @inproceedings{kanojia-etal-2020-harnessing, title = "Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages", author = "Kanojia, Diptesh and Dabre, Raj and Dewangan, Shubham and Bhattacharyya, Pushpak and Haffari, Gholamreza and Kulkarni, Malhar", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2020.coling-main.119", doi = "10.18653/v1/2020.coling-main.119", pages = "1384--1395", abstract = "Cognates are variants of the same lexical form across different languages; for example {}fonema{} in Spanish and {}phoneme{} in English are cognates, both of which mean {``}a unit of sound{}. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18{%} points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作