neuralspace/citizen_nlu

Name: neuralspace/citizen_nlu
Creator: neuralspace
Published: 2022-09-09 05:53:16
License: 暂无描述

Hugging Face2022-09-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/neuralspace/citizen_nlu

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - other language_creators: - other language: - as - bn - gu - hi - kn - mr - pa - ta - te expert-generated license: - cc-by-nc-sa-4.0 multilinguality: - multilingual size_categories: - n>1K source_datasets: - original task_categories: - question-answering - text-retrieval - text2text-generation - other - translation - conversational task_ids: - extractive-qa - closed-domain-qa - utterance-retrieval - document-retrieval - closed-domain-qa - open-book-qa - closed-book-qa paperswithcode_id: acronym-identification pretty_name: Citizen Services NLU Multilingual Dataset. train-eval-index: - config: citizen_nlu task: token-classification task_id: entity_extraction splits: train_split: train eval_split: test col_mapping: sentence: text label: target metrics: - type: citizen_nlu name: citizen_nlu config: citizen_nlu tags: - chatbots - citizen services - help - emergency services - health - reporting crime configs: - citizen_nlu --- # Dataset Card for citizen_nlu ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ### Dataset Description - **Homepage**: [NeuralSpace Homepage](https://huggingface.co/neuralspace) - **Repository:** [citizen_nlu Dataset](https://huggingface.co/datasets/neuralspace/citizen_nlu) - **Point of Contact:** [Juhi Jain](mailto:juhi@neuralspace.ai) - **Point of Contact:** [Ayushman Dash](mailto:ayushman@neuralspace.ai) - **Size of downloaded dataset files:** 67.6 MB ### Dataset Summary NeuralSpace strives to provide AutoNLP text and speech services, especially for low-resource languages. One of the major services provided by NeuralSpace on its platform is the “Language Understanding” service, where you can build, train and deploy your NLU model to recognize intents and entities with minimal code and just a few clicks. The initiative of this challenge is created with the purpose of sparkling AI applications to address some of the pressing problems in India and find unique ways to address them. Starting with a focus on NLU, this challenge hopes to make progress towards multilingual modelling, as language diversity is significantly underserved on the web. NeuralSpace aims at mastering the low-resource domain, and the citizen services use case is naturally a multilingual and essential domain for the general citizen. Citizen services refer to the essential services provided by organizations to general citizens. In this case, we focus on important services like various FIR-based requests, Blood/Platelets Donation, and Coronavirus-related queries. Such services may not be needed regularly by any particular city but when needed are of utmost importance, and in general, the needs for such services are prevalent every day. Despite the importance of citizen services, linguistically rich countries like India are still far behind in delivering such essential needs to the citizens with absolute ease. The best services currently available do not exist in various low-resource languages that are native to different groups of people. This challenge aims to make government services more efficient, responsive, and customer-friendly. As our computing resources and modelling capabilities grow, so does our potential to support our citizens by delivering a far superior customer experience. Equipping a Citizen services bot with the ability to converse in vernacular languages would make them accessible to a vast group of people for whom English is not a language of choice, but for who are increasingly turning to digital platforms and interfaces for a wide range of needs and wants. ### Supported Tasks A key component of any chatbot system is the NLU pipeline for ‘Intent Classification’ and ‘Named Entity Recognition. This primarily enables any chatbot to perform various tasks at ease. A fully functional multilingual chatbot needs to be able to decipher the language and understand exactly what the user wants. #### citizen_nlu A manually-curated multilingual dataset by Data Engineers at [NeuralSpace](https://www.neuralspace.ai/) for citizen services in 9 Indian languages for a realistic information-seeking task with data samples written by native-speaking expert data annotators [here](https://www.neuralspace.ai/). The dataset files are available in CSV format. ### Languages The citizen_nlu data is available in nine Indian languages i.e, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, and Telugu ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 67.6 MB An example of 'test' looks as follows. ``` text,intents मेरे पिता की कार उनके कार्यालय की पार्किंग से कल से गायब है। वाहन संख्या केए-03-एचए-1985 । मैं एफआईआर कराना चाहता हूं।,ReportingMissingVehicle ``` An example of 'train' looks as follows. ```text,intents என் தாத்தா எனக்கு பிறந்தநாள் பரிசு கொடுத்தார் மஞ்சள் நான் டாடனானோவை இழந்தேன். காணவில்லை என புகார் தெரிவிக்க விரும்புகிறேன்,ReportingMissingVehicle ``` ### Data Fields The data fields are the same among all splits. #### citizen_nlu - `text`: a `string` feature. - `intent`: a `string` feature. - `type`: a classification label, with possible values including `train` or `test`. ### Data Splits #### citizen_nlu | |train|test| |----|----:|---:| |citizen_nlu| 287832| 4752| ### Contributions Mehar Bhatia (mehar@neuralspace.ai)

提供机构：

neuralspace

原始信息汇总

数据集概述

数据集名称

名称: Citizen Services NLU Multilingual Dataset
别名: citizen_nlu

语言

支持语言: 9种印度语言，包括Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu

许可

许可证: cc-by-nc-sa-4.0

多语言性

多语言支持: 是

数据集大小

下载大小: 67.6 MB

任务类别

任务:
- 问答
- 文本检索
- 文本到文本生成
- 翻译
- 对话

数据集结构

数据实例:
- 训练集: 287832条记录
- 测试集: 4752条记录
数据字段:
- text: 文本字符串
- intent: 意图字符串
- type: 分类标签，值为train或test

数据集创建

来源: 原始数据
注释: 专家生成

使用考虑

社会影响: 旨在提高政府服务的效率和响应性，使服务更加客户友好。
偏见讨论: 未详细说明
其他已知限制: 未详细说明

贡献者

主要贡献者: Mehar Bhatia (mehar@neuralspace.ai)

5,000+

优质数据集

54 个

任务类型

进入经典数据集