datablations/oscar-filter-small
收藏Hugging Face2022-11-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datablations/oscar-filter-small
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: perplexity_score
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 658480427
num_examples: 100000
download_size: 347756473
dataset_size: 658480427
---
# Dataset Card for "small-oscar"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
datablations
原始信息汇总
数据集概述
数据集特征
- id: 整数类型 (int64)
- text: 字符串类型 (string)
- meta: 结构体类型,包含以下子特征:
- annotations: 字符串序列
- identification: 结构体,包含:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float64)
- line_identifications: 列表,每个元素包含:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float64)
- perplexity_score: 浮点数类型 (float64)
- warc_headers: 结构体,包含:
- content-length: 整数类型 (int64)
- content-type: 字符串类型 (string)
- warc-block-digest: 字符串类型 (string)
- warc-date: 字符串类型 (string)
- warc-identified-content-language: 字符串类型 (string)
- warc-record-id: 字符串类型 (string)
- warc-refers-to: 字符串类型 (string)
- warc-target-uri: 字符串类型 (string)
- warc-type: 字符串类型 (string)
数据集划分
- train:
- 数据量: 658480427 字节
- 示例数量: 100000
数据集大小
- 下载大小: 347756473 字节
- 数据集大小: 658480427 字节



