datablations/oscar-filter
收藏Hugging Face2023-05-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/datablations/oscar-filter
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: warc_headers
struct:
- name: warc-record-id
dtype: string
- name: warc-date
dtype: string
- name: content-type
dtype: string
- name: content-length
dtype: int32
- name: warc-type
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-block-digest
dtype: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float32
- name: annotations
sequence: string
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float32
- name: perplexity_score
dtype: float64
- name: text_length
dtype: int64
- name: url
dtype: string
- name: domain
dtype: string
- name: dup_ratio
dtype: float64
- name: pairs
sequence:
sequence: int64
- name: repetitions
sequence: binary
- name: included_in_dedup
dtype: bool
- name: cluster
sequence: int64
splits:
- name: train
num_bytes: 3188486875748
num_examples: 431992659
download_size: 419397499659
dataset_size: 3188486875748
---
this is the one where we build the suffix array for 25% Oscar and only deduplicate that part - by deduplication I mean removing any document which has an at least 100-char span overlapping with another document in the 25% chunk. This is very strict and preserves only about 20 million documents, so less then 5% of the full Oscar.
提供机构:
datablations
原始信息汇总
数据集概述
数据集特征
- id: 整数类型 (int64)
- text: 字符串类型 (string)
- meta: 结构体类型,包含以下子特征:
- warc_headers: 结构体类型,包含以下子特征:
- warc-record-id: 字符串类型 (string)
- warc-date: 字符串类型 (string)
- content-type: 字符串类型 (string)
- content-length: 整数类型 (int32)
- warc-type: 字符串类型 (string)
- warc-identified-content-language: 字符串类型 (string)
- warc-refers-to: 字符串类型 (string)
- warc-target-uri: 字符串类型 (string)
- warc-block-digest: 字符串类型 (string)
- identification: 结构体类型,包含以下子特征:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float32)
- annotations: 字符串序列类型
- line_identifications: 列表类型,包含以下子特征:
- label: 字符串类型 (string)
- prob: 浮点数类型 (float32)
- warc_headers: 结构体类型,包含以下子特征:
- perplexity_score: 浮点数类型 (float64)
- text_length: 整数类型 (int64)
- url: 字符串类型 (string)
- domain: 字符串类型 (string)
- dup_ratio: 浮点数类型 (float64)
- pairs: 整数序列的序列类型 (sequence: sequence: int64)
- repetitions: 二进制序列类型 (sequence: binary)
- included_in_dedup: 布尔类型 (bool)
- cluster: 整数序列类型 (sequence: int64)
数据集分割
- train: 训练集
- 字节数: 3188486875748
- 示例数: 431992659
数据集大小
- 下载大小: 419397499659
- 数据集大小: 3188486875748



