musabg/commoncrawl-tr
收藏Hugging Face2023-05-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/musabg/commoncrawl-tr
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: warc_headers
struct:
- name: warc-record-id
dtype: string
- name: warc-date
dtype: string
- name: content-type
dtype: string
- name: content-length
dtype: int32
- name: warc-type
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-block-digest
dtype: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float32
- name: harmful_pp
dtype: float32
- name: tlsh
dtype: string
- name: quality_warnings
sequence: string
- name: categories
sequence: string
- name: sentence_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float32
splits:
- name: train
num_bytes: 85952224217
num_examples: 13327165
download_size: 46952332972
dataset_size: 85952224217
---
# Dataset Card for "commoncrawl-tr"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
musabg
原始信息汇总
数据集概述
数据集信息
- 特征列表:
id: 数据类型为int64text: 数据类型为stringmeta: 结构化数据,包含以下字段:warc_headers: 结构化数据,包含以下字段:warc-record-id: 数据类型为stringwarc-date: 数据类型为stringcontent-type: 数据类型为stringcontent-length: 数据类型为int32warc-type: 数据类型为stringwarc-identified-content-language: 数据类型为stringwarc-refers-to: 数据类型为stringwarc-target-uri: 数据类型为stringwarc-block-digest: 数据类型为string
identification: 结构化数据,包含以下字段:label: 数据类型为stringprob: 数据类型为float32
harmful_pp: 数据类型为float32tlsh: 数据类型为stringquality_warnings: 序列类型,数据类型为stringcategories: 序列类型,数据类型为stringsentence_identifications: 列表类型,包含以下字段:label: 数据类型为stringprob: 数据类型为float32
数据集分割
- 训练集:
train: 包含 13,327,165 个样本,总字节数为 85,952,224,217
数据集大小
- 下载大小: 46,952,332,972 字节
- 数据集大小: 85,952,224,217 字节



