OmBayus/Turk-GPT-Dataset-tr-1
收藏Hugging Face2024-06-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/OmBayus/Turk-GPT-Dataset-tr-1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 621066823
num_examples: 100001
download_size: 329767625
dataset_size: 621066823
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
The dataset includes the following features: id (int64), text (string), and meta (struct type, containing annotations (sequence of strings), identification (struct type, containing label (string) and prob (float64)), line_identifications (list type, containing label (string) and prob (float64)), warc_headers (struct type, containing multiple fields of string and int64 types)). The dataset is divided into a training set (containing 100001 samples, with a total size of 621066823 bytes). The download size of the dataset is 329767625 bytes, and the total size is 621066823 bytes.
提供机构:
OmBayus
原始信息汇总
数据集概述
数据集特征
- id: 整数类型 (int64)
- text: 字符串类型 (string)
- meta: 结构体类型,包含以下子特征:
- annotations: 字符串序列 (sequence: string)
- identification: 结构体,包含:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float64)
- line_identifications: 列表,每个元素包含:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float64)
- warc_headers: 结构体,包含:
- content-length: 整数类型 (int64)
- content-type: 字符串类型 (string)
- warc-block-digest: 字符串类型 (string)
- warc-date: 字符串类型 (string)
- warc-identified-content-language: 字符串类型 (string)
- warc-record-id: 字符串类型 (string)
- warc-refers-to: 字符串类型 (string)
- warc-target-uri: 字符串类型 (string)
- warc-type: 字符串类型 (string)
数据集分割
- train:
- 数据量: 621066823 字节
- 示例数量: 100001
数据集大小
- 下载大小: 329767625 字节
- 数据集总大小: 621066823 字节
配置
- config_name: default
- data_files:
- split: train
- path: data/train-*



