iitrsamrat/piqa_indic

Name: iitrsamrat/piqa_indic
Creator: iitrsamrat
Published: 2024-02-06 12:56:14
License: 暂无描述

Hugging Face2024-02-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/iitrsamrat/piqa_indic

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: - config_name: ben features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 10915667 num_examples: 16113 - name: valid num_bytes: 1238392 num_examples: 1838 download_size: 4716439 dataset_size: 12154059 - config_name: eng features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 4104002 num_examples: 16113 - name: valid num_bytes: 464309 num_examples: 1838 download_size: 2958845 dataset_size: 4568311 - config_name: hin features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 10377270 num_examples: 16113 - name: valid num_bytes: 1170817 num_examples: 1838 download_size: 4597934 dataset_size: 11548087 - config_name: kan features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 11890364 num_examples: 16113 - name: valid num_bytes: 1348293 num_examples: 1838 download_size: 4984600 dataset_size: 13238657 - config_name: tam features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 12949508 num_examples: 16113 - name: valid num_bytes: 1468796 num_examples: 1838 download_size: 5199760 dataset_size: 14418304 configs: - config_name: ben data_files: - split: train path: ben/train-* - split: valid path: ben/valid-* - config_name: eng data_files: - split: train path: eng/train-* - split: valid path: eng/valid-* - config_name: hin data_files: - split: train path: hin/train-* - split: valid path: hin/valid-* - config_name: kan data_files: - split: train path: kan/train-* - split: valid path: kan/valid-* - config_name: tam data_files: - split: train path: tam/train-* - split: valid path: tam/valid-* --- # Dataset Card for Dataset Name Dataset Summary (Taken from Piqa) To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials. - **Curated by:** Samrat Saha - **Language(s) (NLP):** ISO 639-2 Code - ben, hin, kan - **License:** Apache-2.0 ### Dataset Sources [optional] - **Demo [optional]:** - goal sol1 sol2 label ಬೆಣ್ಣೆಯನ್ನು ಕುದಿಸುವಾಗ, ಅದು ಸಿದ್ಧವಾದಾಗ, ನೀವು ಮಾ... ಅದನ್ನು ತಟ್ಟೆಯಲ್ಲಿ ಸುರಿಯಿರಿ. ಅದನ್ನು ಬಾಟಲಿಯಲ್ಲಿ ಸುರಿಯಿರಿ. 1 ## Dataset Structure Please Refer to Piqa ## Dataset Creation This dataset is created from Piqa using 1B High quality Indic Transformer(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin ). Currently the Train and validation dataset provided for Bengali, Hindi, Kannada Languages. The Translation is done using beam search with a beam width of 3. ### Curation Rationale The goal of the dataset is to convert the Piqa data into Indic Languages for the Development of Indic LLM. ### Source Data Piqa ### Annotations Manual Annotation not done, this is completley high quality machine translation dataset. ## Citation @inproceedings{Bisk2020, author = {Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi}, title = {PIQA: Reasoning about Physical Commonsense in Natural Language}, booktitle = {Thirty-Fourth AAAI Conference on Artificial Intelligence}, year = {2020}, } **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional]  [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] { Samrat Saha iitr.samrat@gmail.com } ## Dataset Card Contact { author = {Samrat Saha}, title = {PIQA_indic: Reasoning about Physical Commonsense in Natural Language For Indic Languages}, year = {2024}, }

提供机构：

iitrsamrat

原始信息汇总

数据集概述

数据集配置

配置名称：ben

特征：
- goal: string
- sol1: string
- sol2: string
- label: int64
分割：
- train:
  - 字节数: 10915667
  - 样本数: 16113
- valid:
  - 字节数: 1238392
  - 样本数: 1838
下载大小： 4716439 字节
数据集大小： 12154059 字节

配置名称：eng

特征：
- goal: string
- sol1: string
- sol2: string
- label: int64
分割：
- train:
  - 字节数: 4104002
  - 样本数: 16113
- valid:
  - 字节数: 464309
  - 样本数: 1838
下载大小： 2958845 字节
数据集大小： 4568311 字节

配置名称：hin

特征：
- goal: string
- sol1: string
- sol2: string
- label: int64
分割：
- train:
  - 字节数: 10377270
  - 样本数: 16113
- valid:
  - 字节数: 1170817
  - 样本数: 1838
下载大小： 4597934 字节
数据集大小： 11548087 字节

配置名称：kan

特征：
- goal: string
- sol1: string
- sol2: string
- label: int64
分割：
- train:
  - 字节数: 11890364
  - 样本数: 16113
- valid:
  - 字节数: 1348293
  - 样本数: 1838
下载大小： 4984600 字节
数据集大小： 13238657 字节

配置名称：tam

特征：
- goal: string
- sol1: string
- sol2: string
- label: int64
分割：
- train:
  - 字节数: 12949508
  - 样本数: 16113
- valid:
  - 字节数: 1468796
  - 样本数: 1838
下载大小： 5199760 字节
数据集大小： 14418304 字节