five

HinDialect 1.1: 26 Hindi-related languages and dialects of the Indic Continuum in North India

收藏
hdl.handle.net2025-03-23 收录
下载链接:
http://hdl.handle.net/11234/1-4839
下载链接
链接失效反馈
官方服务:
资源简介:
HinDialect: 26 Hindi-related languages and dialects of the Indic Continuum in North India Languages This is a collection of folksongs for 26 languages that form a dialect continuum in North India and nearby regions. Namely Angika, Awadhi, Baiga, Bengali, Bhadrawahi, Bhili, Bhojpuri, Braj, Bundeli, Chhattisgarhi, Garhwali, Gujarati, Haryanvi, Himachali, Hindi, Kanauji, Khadi Boli, Korku, Kumaoni, Magahi, Malvi, Marathi, Nimadi, Panjabi, Rajasthani, Sanskrit. This data is originally collected by the Kavita Kosh Project at http://www.kavitakosh.org/ . Here are the main characteristics of the languages in this collection: - They are all Indic languages except for Korku. - The majority of them are closely related to the standard Hindi dialect genealogically (such as Hariyanvi and Bhojpuri), although the collection also contains languages such as Bengali and Gujarati which are more distant relatives. - They are all primarily spoken in (North) India (Bengali is also spoken in Bangladesh) - All except Sanksrit are alive languages Data Categorising them by pre-existing available NLP resources, we have: * Band 1 languages : Hindi, Panjabi, Gujarati, Bengali, Nepali. These languages already have other large standard datasets available. Kavita Kosh may have very little data for these languages. * Band 2 languages: Bhojpuri, Magahi, Awadhi, Braj. These languages have growing interest and some datasets of a relatively small size as compared to Band 1 language resources. * Band 3 languages: All other languages in the collection are previously zero-resource languages. These are the languages for which this dataset is the most relevant. Script This dataset is entirely in Devanagari. Content in the case of languages not written in Devanagari (such as Bengali and Gujarati) has been transliterated by the Kavita Kosh Project. Format The dataset contains a single text file containing folksongs per language. Folksongs are separated from each other by an empty line. The first line of a new piece is the title of the folksong, and line separation within folksongs is preserved.

HinDialect:北印度及其邻近地区的26种印地语系语言及方言 语言 此为北印度及其邻近地区26种语言构成的方言连续体的民歌集合。 具体而言,包括安加卡语、阿瓦迪语、巴伊加语、孟加拉语、巴德劳阿希语、比利语、博杰普里语、布拉杰语、邦德利语、恰蒂斯加尔语、加尔瓦尔语、古吉拉特语、哈里亚尼语、喜马偕尔语、印地语、卡纳乌吉语、卡迪博利语、科鲁语、库马翁语、马加希语、马尔维亚语、马拉地语、尼马迪语、旁遮普语、拉贾斯坦语、梵文。 此数据最初由Kavita Kosh项目收集于http://www.kavitakosh.org/。以下为此集合中各语言的主要特征: - 除科鲁语外,均为印地语系语言。 - 其中大部分与标准印地语方言在谱系上密切相关(如哈里亚尼语和Bhojpuri),尽管集合中也包含如孟加拉语和古吉拉特语等较为疏远的亲属语言。 - 均主要在印度(孟加拉语也在孟加拉国使用)境内使用。 - 除梵文外,均为活语言。 数据 通过将它们分类到现有的预存自然语言处理资源中,我们得到以下分类: * 第一类语言:印地语、旁遮普语、古吉拉特语、孟加拉语、尼泊尔语。这些语言已有其他大型标准数据集可用。Kavita Kosh可能为这些语言提供的数据非常有限。 * 第二类语言:博杰普里语、马加希语、阿瓦迪语、布拉杰语。这些语言有逐渐增长的关注度,与第一类语言资源相比,拥有相对较小的数据集。 * 第三类语言:集合中所有其他语言均为先前零资源语言。此数据集对这些语言最为相关。 文字系统 此数据集完全使用天城文。对于未使用天城文的语言(如孟加拉语和古吉拉特语)的内容,已由Kavita Kosh项目进行转写。 格式 此数据集包含一个文本文件,其中包含每种语言的民歌。民歌之间由空行分隔。新民歌的第一行是民歌的标题,民歌内部行与行之间保持分隔。
提供机构:
hdl.handle.net
二维码
社区交流群
二维码
科研交流群
商业服务