iitrsamrat/piqa_indic
收藏Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/iitrsamrat/piqa_indic
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
- config_name: ben
features:
- name: goal
dtype: string
- name: sol1
dtype: string
- name: sol2
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 10915667
num_examples: 16113
- name: valid
num_bytes: 1238392
num_examples: 1838
download_size: 4716439
dataset_size: 12154059
- config_name: eng
features:
- name: goal
dtype: string
- name: sol1
dtype: string
- name: sol2
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 4104002
num_examples: 16113
- name: valid
num_bytes: 464309
num_examples: 1838
download_size: 2958845
dataset_size: 4568311
- config_name: hin
features:
- name: goal
dtype: string
- name: sol1
dtype: string
- name: sol2
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 10377270
num_examples: 16113
- name: valid
num_bytes: 1170817
num_examples: 1838
download_size: 4597934
dataset_size: 11548087
- config_name: kan
features:
- name: goal
dtype: string
- name: sol1
dtype: string
- name: sol2
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 11890364
num_examples: 16113
- name: valid
num_bytes: 1348293
num_examples: 1838
download_size: 4984600
dataset_size: 13238657
- config_name: tam
features:
- name: goal
dtype: string
- name: sol1
dtype: string
- name: sol2
dtype: string
- name: label
dtype: int64
splits:
- name: train
num_bytes: 12949508
num_examples: 16113
- name: valid
num_bytes: 1468796
num_examples: 1838
download_size: 5199760
dataset_size: 14418304
configs:
- config_name: ben
data_files:
- split: train
path: ben/train-*
- split: valid
path: ben/valid-*
- config_name: eng
data_files:
- split: train
path: eng/train-*
- split: valid
path: eng/valid-*
- config_name: hin
data_files:
- split: train
path: hin/train-*
- split: valid
path: hin/valid-*
- config_name: kan
data_files:
- split: train
path: kan/train-*
- split: valid
path: kan/valid-*
- config_name: tam
data_files:
- split: train
path: tam/train-*
- split: valid
path: tam/valid-*
---
# Dataset Card for Dataset Name
Dataset Summary (Taken from Piqa)
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA.
Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language.
PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials.
- **Curated by:** Samrat Saha
- **Language(s) (NLP):** ISO 639-2 Code - ben, hin, kan
- **License:** Apache-2.0
### Dataset Sources [optional]
- **Demo [optional]:**
- goal sol1 sol2 label
ಬೆಣ್ಣೆಯನ್ನು ಕುದಿಸುವಾಗ, ಅದು ಸಿದ್ಧವಾದಾಗ, ನೀವು ಮಾ... ಅದನ್ನು ತಟ್ಟೆಯಲ್ಲಿ ಸುರಿಯಿರಿ. ಅದನ್ನು ಬಾಟಲಿಯಲ್ಲಿ ಸುರಿಯಿರಿ. 1
## Dataset Structure
Please Refer to Piqa
## Dataset Creation
This dataset is created from Piqa using 1B High quality Indic Transformer(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
).
Currently the Train and validation dataset provided for Bengali, Hindi, Kannada Languages.
The Translation is done using beam search with a beam width of 3.
### Curation Rationale
The goal of the dataset is to convert the Piqa data into Indic Languages for the Development of Indic LLM.
### Source Data
Piqa
### Annotations
Manual Annotation not done, this is completley high quality machine translation dataset.
## Citation
@inproceedings{Bisk2020,
author = {Yonatan Bisk and Rowan Zellers and
Ronan Le Bras and Jianfeng Gao
and Yejin Choi},
title = {PIQA: Reasoning about Physical Commonsense in
Natural Language},
booktitle = {Thirty-Fourth AAAI Conference on
Artificial Intelligence},
year = {2020},
}
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Dataset Card Authors [optional]
{
Samrat Saha
iitr.samrat@gmail.com
}
## Dataset Card Contact
{
author = {Samrat Saha},
title = {PIQA_indic: Reasoning about Physical Commonsense in
Natural Language For Indic Languages},
year = {2024},
}
提供机构:
iitrsamrat
原始信息汇总
数据集概述
数据集配置
配置名称:ben
- 特征:
- goal: string
- sol1: string
- sol2: string
- label: int64
- 分割:
- train:
- 字节数: 10915667
- 样本数: 16113
- valid:
- 字节数: 1238392
- 样本数: 1838
- train:
- 下载大小: 4716439 字节
- 数据集大小: 12154059 字节
配置名称:eng
- 特征:
- goal: string
- sol1: string
- sol2: string
- label: int64
- 分割:
- train:
- 字节数: 4104002
- 样本数: 16113
- valid:
- 字节数: 464309
- 样本数: 1838
- train:
- 下载大小: 2958845 字节
- 数据集大小: 4568311 字节
配置名称:hin
- 特征:
- goal: string
- sol1: string
- sol2: string
- label: int64
- 分割:
- train:
- 字节数: 10377270
- 样本数: 16113
- valid:
- 字节数: 1170817
- 样本数: 1838
- train:
- 下载大小: 4597934 字节
- 数据集大小: 11548087 字节
配置名称:kan
- 特征:
- goal: string
- sol1: string
- sol2: string
- label: int64
- 分割:
- train:
- 字节数: 11890364
- 样本数: 16113
- valid:
- 字节数: 1348293
- 样本数: 1838
- train:
- 下载大小: 4984600 字节
- 数据集大小: 13238657 字节
配置名称:tam
- 特征:
- goal: string
- sol1: string
- sol2: string
- label: int64
- 分割:
- train:
- 字节数: 12949508
- 样本数: 16113
- valid:
- 字节数: 1468796
- 样本数: 1838
- train:
- 下载大小: 5199760 字节
- 数据集大小: 14418304 字节
数据文件
配置名称:ben
- 数据文件:
- train: ben/train-*
- valid: ben/valid-*
配置名称:eng
- 数据文件:
- train: eng/train-*
- valid: eng/valid-*
配置名称:hin
- 数据文件:
- train: hin/train-*
- valid: hin/valid-*
配置名称:kan
- 数据文件:
- train: kan/train-*
- valid: kan/valid-*
配置名称:tam
- 数据文件:
- train: tam/train-*
- valid: tam/valid-*



