five

Divyanshu/IE_SemParse

收藏
Hugging Face2023-07-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Divyanshu/IE_SemParse
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - machine-generated language: - as - bn - gu - hi - kn - ml - mr - or - pa - ta - te license: - cc0-1.0 multilinguality: - multilingual pretty_name: IE-SemParse size_categories: - 1M<n<10M source_datasets: - original task_categories: - text2text-generation task_ids: - parsing --- # Dataset Card for "IE-SemParse" ## Table of Contents - [Dataset Card for "IE-SemParse"](#dataset-card-for-ie-semparse) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset usage](#dataset-usage) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Human Verification Process](#human-verification-process) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** <https://github.com/divyanshuaggarwal/IE-SemParse> - **Paper:** [Evaluating Inter-Bilingual Semantic Parsing for Indian Languages](https://arxiv.org/abs/2304.13005) - **Point of Contact:** [Divyanshu Aggarwal](mailto:divyanshuggrwl@gmail.com) ### Dataset Summary IE-SemParse is an InterBilingual Semantic Parsing Dataset for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’). ### Supported Tasks and Leaderboards **Tasks:** Inter-Bilingual Semantic Parsing **Leaderboards:** Currently there is no Leaderboard for this dataset. ### Languages - `Assamese (as)` - `Bengali (bn)` - `Gujarati (gu)` - `Kannada (kn)` - `Hindi (hi)` - `Malayalam (ml)` - `Marathi (mr)` - `Oriya (or)` - `Punjabi (pa)` - `Tamil (ta)` - `Telugu (te)` ... <!-- Below is the dataset split given for `hi` dataset. ```python DatasetDict({ train: Dataset({ features: ['utterance', 'logical form', 'intent'], num_rows: 36000 }) test: Dataset({ features: ['utterance', 'logical form', 'intent'], num_rows: 3000 }) validation: Dataset({ features: ['utterance', 'logical form', 'intent'], num_rows: 1500 }) }) ``` --> ## Dataset usage Code snippet for using the dataset using datasets library. ```python from datasets import load_dataset dataset = load_dataset("Divyanshu/IE_SemParse") ``` ## Dataset Creation Machine translation of 3 multilingual semantic Parsing datasets english dataset to 11 listed Indic Languages. ### Curation Rationale [More information needed] ### Source Data [mTOP dataset](https://aclanthology.org/2021.eacl-main.257/) [multilingualTOP dataset](https://github.com/awslabs/multilingual-top) [multi-ATIS++ dataset](https://paperswithcode.com/paper/end-to-end-slot-alignment-and-recognition-for) #### Initial Data Collection and Normalization [Detailed in the paper](https://arxiv.org/abs/2304.13005) #### Who are the source language producers? [Detailed in the paper](https://arxiv.org/abs/2304.13005) #### Human Verification Process [Detailed in the paper](https://arxiv.org/abs/2304.13005) ## Considerations for Using the Data ### Social Impact of Dataset [Detailed in the paper](https://arxiv.org/abs/2304.13005) ### Discussion of Biases [Detailed in the paper](https://arxiv.org/abs/2304.13005) ### Other Known Limitations [Detailed in the paper](https://arxiv.org/abs/2304.13005) ### Dataset Curators Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use any of the datasets, models or code modules, please cite the following paper: ``` @misc{aggarwal2023evaluating, title={Evaluating Inter-Bilingual Semantic Parsing for Indian Languages}, author={Divyanshu Aggarwal and Vivek Gupta and Anoop Kunchukuttan}, year={2023}, eprint={2304.13005}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` <!-- ### Contributions -->
提供机构:
Divyanshu
原始信息汇总

数据集卡片 for "IE-SemParse"

数据集描述

数据集摘要

IE-SemParse 是一个针对十一种主要印度语言的跨语言语义解析数据集,包括阿萨姆语(‘as’)、古吉拉特语(‘gu’)、卡纳达语(‘kn’)、马拉雅拉姆语(‘ml’)、马拉地语(‘mr’)、奥里亚语(‘or’)、旁遮普语(‘pa’)、泰米尔语(‘ta’)、泰卢固语(‘te’)、印地语(‘hi’)和孟加拉语(‘bn’)。

支持的任务和排行榜

任务: 跨语言语义解析

排行榜: 目前没有针对此数据集的排行榜。

语言

  • 阿萨姆语 (as)
  • 孟加拉语 (bn)
  • 古吉拉特语 (gu)
  • 卡纳达语 (kn)
  • 印地语 (hi)
  • 马拉雅拉姆语 (ml)
  • 马拉地语 (mr)
  • 奥里亚语 (or)
  • 旁遮普语 (pa)
  • 泰米尔语 (ta)
  • 泰卢固语 (te)

数据集使用

使用 datasets 库加载数据集的代码片段:

python from datasets import load_dataset

dataset = load_dataset("Divyanshu/IE_SemParse")

数据集创建

通过将三个多语言语义解析数据集的英语数据集机器翻译成列出的十一种印度语言。

源数据

初始数据收集和规范化

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

源语言生产者

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

人工验证过程

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

使用数据的注意事项

数据集的社会影响

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

偏见的讨论

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

其他已知限制

详细信息在论文中提供:Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

数据集策展人

Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan

许可信息

本仓库的内容仅限于非商业研究目的,遵循Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)。数据集内容的版权属于原始版权持有者。

引用信息

如果您使用任何数据集、模型或代码模块,请引用以下论文:

@misc{aggarwal2023evaluating, title={Evaluating Inter-Bilingual Semantic Parsing for Indian Languages}, author={Divyanshu Aggarwal and Vivek Gupta and Anoop Kunchukuttan}, year={2023}, eprint={2304.13005}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作