five

iitrsamrat/piqa_indic

收藏
Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/iitrsamrat/piqa_indic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: - config_name: ben features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 10915667 num_examples: 16113 - name: valid num_bytes: 1238392 num_examples: 1838 download_size: 4716439 dataset_size: 12154059 - config_name: eng features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 4104002 num_examples: 16113 - name: valid num_bytes: 464309 num_examples: 1838 download_size: 2958845 dataset_size: 4568311 - config_name: hin features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 10377270 num_examples: 16113 - name: valid num_bytes: 1170817 num_examples: 1838 download_size: 4597934 dataset_size: 11548087 - config_name: kan features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 11890364 num_examples: 16113 - name: valid num_bytes: 1348293 num_examples: 1838 download_size: 4984600 dataset_size: 13238657 - config_name: tam features: - name: goal dtype: string - name: sol1 dtype: string - name: sol2 dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 12949508 num_examples: 16113 - name: valid num_bytes: 1468796 num_examples: 1838 download_size: 5199760 dataset_size: 14418304 configs: - config_name: ben data_files: - split: train path: ben/train-* - split: valid path: ben/valid-* - config_name: eng data_files: - split: train path: eng/train-* - split: valid path: eng/valid-* - config_name: hin data_files: - split: train path: hin/train-* - split: valid path: hin/valid-* - config_name: kan data_files: - split: train path: kan/train-* - split: valid path: kan/valid-* - config_name: tam data_files: - split: train path: tam/train-* - split: valid path: tam/valid-* --- # Dataset Card for Dataset Name Dataset Summary (Taken from Piqa) To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials. - **Curated by:** Samrat Saha - **Language(s) (NLP):** ISO 639-2 Code - ben, hin, kan - **License:** Apache-2.0 ### Dataset Sources [optional] - **Demo [optional]:** - goal sol1 sol2 label ಬೆಣ್ಣೆಯನ್ನು ಕುದಿಸುವಾಗ, ಅದು ಸಿದ್ಧವಾದಾಗ, ನೀವು ಮಾ... ಅದನ್ನು ತಟ್ಟೆಯಲ್ಲಿ ಸುರಿಯಿರಿ. ಅದನ್ನು ಬಾಟಲಿಯಲ್ಲಿ ಸುರಿಯಿರಿ. 1 ## Dataset Structure Please Refer to Piqa ## Dataset Creation This dataset is created from Piqa using 1B High quality Indic Transformer(Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin ). Currently the Train and validation dataset provided for Bengali, Hindi, Kannada Languages. The Translation is done using beam search with a beam width of 3. ### Curation Rationale The goal of the dataset is to convert the Piqa data into Indic Languages for the Development of Indic LLM. ### Source Data Piqa ### Annotations Manual Annotation not done, this is completley high quality machine translation dataset. ## Citation @inproceedings{Bisk2020, author = {Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi}, title = {PIQA: Reasoning about Physical Commonsense in Natural Language}, booktitle = {Thirty-Fourth AAAI Conference on Artificial Intelligence}, year = {2020}, } **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] { Samrat Saha iitr.samrat@gmail.com } ## Dataset Card Contact { author = {Samrat Saha}, title = {PIQA_indic: Reasoning about Physical Commonsense in Natural Language For Indic Languages}, year = {2024}, }
提供机构:
iitrsamrat
原始信息汇总

数据集概述

数据集配置

配置名称:ben

  • 特征:
    • goal: string
    • sol1: string
    • sol2: string
    • label: int64
  • 分割:
    • train:
      • 字节数: 10915667
      • 样本数: 16113
    • valid:
      • 字节数: 1238392
      • 样本数: 1838
  • 下载大小: 4716439 字节
  • 数据集大小: 12154059 字节

配置名称:eng

  • 特征:
    • goal: string
    • sol1: string
    • sol2: string
    • label: int64
  • 分割:
    • train:
      • 字节数: 4104002
      • 样本数: 16113
    • valid:
      • 字节数: 464309
      • 样本数: 1838
  • 下载大小: 2958845 字节
  • 数据集大小: 4568311 字节

配置名称:hin

  • 特征:
    • goal: string
    • sol1: string
    • sol2: string
    • label: int64
  • 分割:
    • train:
      • 字节数: 10377270
      • 样本数: 16113
    • valid:
      • 字节数: 1170817
      • 样本数: 1838
  • 下载大小: 4597934 字节
  • 数据集大小: 11548087 字节

配置名称:kan

  • 特征:
    • goal: string
    • sol1: string
    • sol2: string
    • label: int64
  • 分割:
    • train:
      • 字节数: 11890364
      • 样本数: 16113
    • valid:
      • 字节数: 1348293
      • 样本数: 1838
  • 下载大小: 4984600 字节
  • 数据集大小: 13238657 字节

配置名称:tam

  • 特征:
    • goal: string
    • sol1: string
    • sol2: string
    • label: int64
  • 分割:
    • train:
      • 字节数: 12949508
      • 样本数: 16113
    • valid:
      • 字节数: 1468796
      • 样本数: 1838
  • 下载大小: 5199760 字节
  • 数据集大小: 14418304 字节

数据文件

配置名称:ben

  • 数据文件:
    • train: ben/train-*
    • valid: ben/valid-*

配置名称:eng

  • 数据文件:
    • train: eng/train-*
    • valid: eng/valid-*

配置名称:hin

  • 数据文件:
    • train: hin/train-*
    • valid: hin/valid-*

配置名称:kan

  • 数据文件:
    • train: kan/train-*
    • valid: kan/valid-*

配置名称:tam

  • 数据文件:
    • train: tam/train-*
    • valid: tam/valid-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作