davidfant/natural-questions-chunk-5

Name: davidfant/natural-questions-chunk-5
Creator: davidfant
Published: 2023-10-22 23:06:32
License: 暂无描述

Hugging Face2023-10-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/davidfant/natural-questions-chunk-5

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: document struct: - name: html dtype: string - name: title dtype: string - name: tokens sequence: - name: end_byte dtype: int64 - name: is_html dtype: bool - name: start_byte dtype: int64 - name: token dtype: string - name: url dtype: string - name: question struct: - name: text dtype: string - name: tokens sequence: string - name: long_answer_candidates sequence: - name: end_byte dtype: int64 - name: end_token dtype: int64 - name: start_byte dtype: int64 - name: start_token dtype: int64 - name: top_level dtype: bool - name: annotations sequence: - name: id dtype: string - name: long_answer struct: - name: candidate_index dtype: int64 - name: end_byte dtype: int64 - name: end_token dtype: int64 - name: start_byte dtype: int64 - name: start_token dtype: int64 - name: short_answers sequence: - name: end_byte dtype: int64 - name: end_token dtype: int64 - name: start_byte dtype: int64 - name: start_token dtype: int64 - name: text dtype: string - name: yes_no_answer dtype: class_label: names: '0': 'NO' '1': 'YES' splits: - name: train num_bytes: 4651468477 num_examples: 10000 download_size: 1807817811 dataset_size: 4651468477 --- # Dataset Card for "natural-questions-chunk-5" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The dataset natural-questions-chunk-5 includes multiple features such as document, question, long answer candidates, and annotations. The document feature contains HTML content, title, token information, and URL. The question feature includes text and token information. The long answer candidates feature includes start and end positions of bytes and tokens, as well as whether they are top-level. The annotations feature includes ID, long answer, short answers, and yes/no answer markers. The dataset is divided into a training set with 10,000 samples, totaling 4.65GB in size.

提供机构：

davidfant

原始信息汇总

数据集概述

特征结构

id: 数据类型为字符串。
document: 结构化数据，包含以下字段：
- html: 数据类型为字符串。
- title: 数据类型为字符串。
- tokens: 序列化数据，包含以下字段：
  - end_byte: 数据类型为int64。
  - is_html: 数据类型为布尔值。
  - start_byte: 数据类型为int64。
  - token: 数据类型为字符串。
- url: 数据类型为字符串。
question: 结构化数据，包含以下字段：
- text: 数据类型为字符串。
- tokens: 序列化数据，数据类型为字符串。
long_answer_candidates: 序列化数据，包含以下字段：
- end_byte: 数据类型为int64。
- end_token: 数据类型为int64。
- start_byte: 数据类型为int64。
- start_token: 数据类型为int64。
- top_level: 数据类型为布尔值。
annotations: 序列化数据，包含以下字段：
- id: 数据类型为字符串。
- long_answer: 结构化数据，包含以下字段：
  - candidate_index: 数据类型为int64。
  - end_byte: 数据类型为int64。
  - end_token: 数据类型为int64。
  - start_byte: 数据类型为int64。
  - start_token: 数据类型为int64。
- short_answers: 序列化数据，包含以下字段：
  - end_byte: 数据类型为int64。
  - end_token: 数据类型为int64。
  - start_byte: 数据类型为int64。
  - start_token: 数据类型为int64。
  - text: 数据类型为字符串。
- yes_no_answer: 数据类型为分类标签，包含以下类别：
  - 0: NO
  - 1: YES

数据集分割

train: 包含10000个样本，数据大小为4651468477字节。

数据集大小

下载大小: 1807817811字节。
数据集大小: 4651468477字节。

5,000+

优质数据集

54 个

任务类型

进入经典数据集