neuralspace/citizen_nlu
收藏Hugging Face2022-09-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/neuralspace/citizen_nlu
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- other
language_creators:
- other
language:
- as
- bn
- gu
- hi
- kn
- mr
- pa
- ta
- te
expert-generated license:
- cc-by-nc-sa-4.0
multilinguality:
- multilingual
size_categories:
- n>1K
source_datasets:
- original
task_categories:
- question-answering
- text-retrieval
- text2text-generation
- other
- translation
- conversational
task_ids:
- extractive-qa
- closed-domain-qa
- utterance-retrieval
- document-retrieval
- closed-domain-qa
- open-book-qa
- closed-book-qa
paperswithcode_id: acronym-identification
pretty_name: Citizen Services NLU Multilingual Dataset.
train-eval-index:
- config: citizen_nlu
task: token-classification
task_id: entity_extraction
splits:
train_split: train
eval_split: test
col_mapping:
sentence: text
label: target
metrics:
- type: citizen_nlu
name: citizen_nlu
config:
citizen_nlu
tags:
- chatbots
- citizen services
- help
- emergency services
- health
- reporting crime
configs:
- citizen_nlu
---
# Dataset Card for citizen_nlu
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
### Dataset Description
- **Homepage**: [NeuralSpace Homepage](https://huggingface.co/neuralspace)
- **Repository:** [citizen_nlu Dataset](https://huggingface.co/datasets/neuralspace/citizen_nlu)
- **Point of Contact:** [Juhi Jain](mailto:juhi@neuralspace.ai)
- **Point of Contact:** [Ayushman Dash](mailto:ayushman@neuralspace.ai)
- **Size of downloaded dataset files:** 67.6 MB
### Dataset Summary
NeuralSpace strives to provide AutoNLP text and speech services, especially for low-resource languages. One of the major services provided by NeuralSpace on its platform is the “Language Understanding” service, where you can build, train and deploy your NLU model to recognize intents and entities with minimal code and just a few clicks.
The initiative of this challenge is created with the purpose of sparkling AI applications to address some of the pressing problems in India and find unique ways to address them. Starting with a focus on NLU, this challenge hopes to make progress towards multilingual modelling, as language diversity is significantly underserved on the web.
NeuralSpace aims at mastering the low-resource domain, and the citizen services use case is naturally a multilingual and essential domain for the general citizen.
Citizen services refer to the essential services provided by organizations to general citizens. In this case, we focus on important services like various FIR-based requests, Blood/Platelets Donation, and Coronavirus-related queries.
Such services may not be needed regularly by any particular city but when needed are of utmost importance, and in general, the needs for such services are prevalent every day.
Despite the importance of citizen services, linguistically rich countries like India are still far behind in delivering such essential needs to the citizens with absolute ease. The best services currently available do not exist in various low-resource languages that are native to different groups of people. This challenge aims to make government services more efficient, responsive, and customer-friendly.
As our computing resources and modelling capabilities grow, so does our potential to support our citizens by delivering a far superior customer experience. Equipping a Citizen services bot with the ability to converse in vernacular languages would make them accessible to a vast group of people for whom English is not a language of choice, but for who are increasingly turning to digital platforms and interfaces for a wide range of needs and wants.
### Supported Tasks
A key component of any chatbot system is the NLU pipeline for ‘Intent Classification’ and ‘Named Entity Recognition. This primarily enables any chatbot to perform various tasks at ease. A fully functional multilingual chatbot needs to be able to decipher the language and understand exactly what the user wants.
#### citizen_nlu
A manually-curated multilingual dataset by Data Engineers at [NeuralSpace](https://www.neuralspace.ai/) for citizen services in 9 Indian languages for a realistic information-seeking task with data samples written by native-speaking expert data annotators [here](https://www.neuralspace.ai/). The dataset files are available in CSV format.
### Languages
The citizen_nlu data is available in nine Indian languages i.e, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, and Telugu
## Dataset Structure
### Data Instances
- **Size of downloaded dataset files:** 67.6 MB
An example of 'test' looks as follows.
``` text,intents
मेरे पिता की कार उनके कार्यालय की पार्किंग से कल से गायब है। वाहन संख्या केए-03-एचए-1985 । मैं एफआईआर कराना चाहता हूं।,ReportingMissingVehicle
```
An example of 'train' looks as follows.
```text,intents
என் தாத்தா எனக்கு பிறந்தநாள் பரிசு கொடுத்தார் மஞ்சள் நான் டாடனானோவை இழந்தேன். காணவில்லை என புகார் தெரிவிக்க விரும்புகிறேன்,ReportingMissingVehicle
```
### Data Fields
The data fields are the same among all splits.
#### citizen_nlu
- `text`: a `string` feature.
- `intent`: a `string` feature.
- `type`: a classification label, with possible values including `train` or `test`.
### Data Splits
#### citizen_nlu
| |train|test|
|----|----:|---:|
|citizen_nlu| 287832| 4752|
### Contributions
Mehar Bhatia (mehar@neuralspace.ai)
提供机构:
neuralspace
原始信息汇总
数据集概述
数据集名称
- 名称: Citizen Services NLU Multilingual Dataset
- 别名: citizen_nlu
语言
- 支持语言: 9种印度语言,包括Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu
许可
- 许可证: cc-by-nc-sa-4.0
多语言性
- 多语言支持: 是
数据集大小
- 下载大小: 67.6 MB
任务类别
- 任务:
- 问答
- 文本检索
- 文本到文本生成
- 翻译
- 对话
数据集结构
- 数据实例:
- 训练集: 287832条记录
- 测试集: 4752条记录
- 数据字段:
text: 文本字符串intent: 意图字符串type: 分类标签,值为train或test
数据集创建
- 来源: 原始数据
- 注释: 专家生成
使用考虑
- 社会影响: 旨在提高政府服务的效率和响应性,使服务更加客户友好。
- 偏见讨论: 未详细说明
- 其他已知限制: 未详细说明
贡献者
- 主要贡献者: Mehar Bhatia (mehar@neuralspace.ai)



