bjoernp/oscar2301_de_deduped_filtered
收藏Hugging Face2023-06-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bjoernp/oscar2301_de_deduped_filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: warc_headers
struct:
- name: warc-record-id
dtype: string
- name: warc-date
dtype: string
- name: content-type
dtype: string
- name: content-length
dtype: int32
- name: warc-type
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-block-digest
dtype: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float32
- name: harmful_pp
dtype: float32
- name: tlsh
dtype: string
- name: quality_warnings
sequence: string
- name: categories
sequence: string
- name: sentence_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float32
splits:
- name: train
num_bytes: 303054722776.2827
num_examples: 42108307
download_size: 211315018208
dataset_size: 303054722776.2827
---
# Dataset Card for "oscar2301_de_deduped_filtered"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
bjoernp
原始信息汇总
数据集概述
数据集名称
oscar2301_de_deduped_filtered
数据集特征
- id: 整数类型(int64)
- text: 字符串类型(string)
- meta: 结构化数据
- warc_headers: 结构化数据
- warc-record-id: 字符串类型(string)
- warc-date: 字符串类型(string)
- content-type: 字符串类型(string)
- content-length: 整数类型(int32)
- warc-type: 字符串类型(string)
- warc-identified-content-language: 字符串类型(string)
- warc-refers-to: 字符串类型(string)
- warc-target-uri: 字符串类型(string)
- warc-block-digest: 字符串类型(string)
- identification: 结构化数据
- label: 字符串类型(string)
- prob: 浮点数类型(float32)
- harmful_pp: 浮点数类型(float32)
- tlsh: 字符串类型(string)
- quality_warnings: 字符串序列
- categories: 字符串序列
- sentence_identifications: 列表
- label: 字符串类型(string)
- prob: 浮点数类型(float32)
- warc_headers: 结构化数据
数据集拆分
- train:
- 数据量: 303054722776.2827 字节
- 示例数量: 42108307
数据集大小
- 下载大小: 211315018208 字节
- 数据集大小: 303054722776.2827 字节



