datablations/oscar-filter

Name: datablations/oscar-filter
Creator: datablations
Published: 2023-05-10 06:58:28
License: 暂无描述

Hugging Face2023-05-10 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/datablations/oscar-filter

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: warc_headers struct: - name: warc-record-id dtype: string - name: warc-date dtype: string - name: content-type dtype: string - name: content-length dtype: int32 - name: warc-type dtype: string - name: warc-identified-content-language dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-block-digest dtype: string - name: identification struct: - name: label dtype: string - name: prob dtype: float32 - name: annotations sequence: string - name: line_identifications list: - name: label dtype: string - name: prob dtype: float32 - name: perplexity_score dtype: float64 - name: text_length dtype: int64 - name: url dtype: string - name: domain dtype: string - name: dup_ratio dtype: float64 - name: pairs sequence: sequence: int64 - name: repetitions sequence: binary - name: included_in_dedup dtype: bool - name: cluster sequence: int64 splits: - name: train num_bytes: 3188486875748 num_examples: 431992659 download_size: 419397499659 dataset_size: 3188486875748 --- this is the one where we build the suffix array for 25% Oscar and only deduplicate that part - by deduplication I mean removing any document which has an at least 100-char span overlapping with another document in the 25% chunk. This is very strict and preserves only about 20 million documents, so less then 5% of the full Oscar.

提供机构：

datablations

原始信息汇总

数据集概述

数据集特征

id: 整数类型 (int64)
text: 字符串类型 (string)
meta: 结构体类型，包含以下子特征：
- warc_headers: 结构体类型，包含以下子特征：
  - warc-record-id: 字符串类型 (string)
  - warc-date: 字符串类型 (string)
  - content-type: 字符串类型 (string)
  - content-length: 整数类型 (int32)
  - warc-type: 字符串类型 (string)
  - warc-identified-content-language: 字符串类型 (string)
  - warc-refers-to: 字符串类型 (string)
  - warc-target-uri: 字符串类型 (string)
  - warc-block-digest: 字符串类型 (string)
- identification: 结构体类型，包含以下子特征：
  - label: 字符串类型 (string)
  - prob: 浮点数类型 (float32)
- annotations: 字符串序列类型
- line_identifications: 列表类型，包含以下子特征：
  - label: 字符串类型 (string)
  - prob: 浮点数类型 (float32)
perplexity_score: 浮点数类型 (float64)
text_length: 整数类型 (int64)
url: 字符串类型 (string)
domain: 字符串类型 (string)
dup_ratio: 浮点数类型 (float64)
pairs: 整数序列的序列类型 (sequence: sequence: int64)
repetitions: 二进制序列类型 (sequence: binary)
included_in_dedup: 布尔类型 (bool)
cluster: 整数序列类型 (sequence: int64)

数据集分割

train: 训练集
- 字节数: 3188486875748
- 示例数: 431992659

数据集大小

下载大小: 419397499659
数据集大小: 3188486875748

5,000+

优质数据集

54 个

任务类型

进入经典数据集