dim/habr_10k
收藏Hugging Face2023-09-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dim/habr_10k
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: uint32
- name: language
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text_markdown
dtype: string
- name: text_html
dtype: string
- name: author
dtype: string
- name: original_author
dtype: string
- name: original_url
dtype: string
- name: lead_html
dtype: string
- name: lead_markdown
dtype: string
- name: type
dtype: string
- name: time_published
dtype: uint64
- name: statistics
struct:
- name: commentsCount
dtype: uint32
- name: favoritesCount
dtype: uint32
- name: readingCount
dtype: uint32
- name: score
dtype: int32
- name: votesCount
dtype: int32
- name: votesCountPlus
dtype: int32
- name: votesCountMinus
dtype: int32
- name: labels
sequence: string
- name: hubs
sequence: string
- name: flows
sequence: string
- name: tags
sequence: string
- name: reading_time
dtype: uint32
- name: format
dtype: string
- name: complexity
dtype: string
- name: comments
sequence:
- name: id
dtype: uint64
- name: parent_id
dtype: uint64
- name: level
dtype: uint32
- name: time_published
dtype: uint64
- name: score
dtype: int32
- name: votes
dtype: uint32
- name: message_html
dtype: string
- name: message_markdown
dtype: string
- name: author
dtype: string
- name: children
sequence: uint64
- name: readingCount
dtype: int64
splits:
- name: train
num_bytes: 661170132.0315578
num_examples: 10000
download_size: 901387901
dataset_size: 661170132.0315578
---
# Dataset Card for "habr_10k"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
dim
原始信息汇总
数据集概述
数据集信息
- 数据集名称: habr_10k
- 下载大小: 901387901 字节
- 数据集大小: 661170132.0315578 字节
数据集特征
- id: 数据类型为 uint32
- language: 数据类型为 string
- url: 数据类型为 string
- title: 数据类型为 string
- text_markdown: 数据类型为 string
- text_html: 数据类型为 string
- author: 数据类型为 string
- original_author: 数据类型为 string
- original_url: 数据类型为 string
- lead_html: 数据类型为 string
- lead_markdown: 数据类型为 string
- type: 数据类型为 string
- time_published: 数据类型为 uint64
- statistics: 结构体包含以下字段:
- commentsCount: 数据类型为 uint32
- favoritesCount: 数据类型为 uint32
- readingCount: 数据类型为 uint32
- score: 数据类型为 int32
- votesCount: 数据类型为 int32
- votesCountPlus: 数据类型为 int32
- votesCountMinus: 数据类型为 int32
- labels: 序列类型,数据类型为 string
- hubs: 序列类型,数据类型为 string
- flows: 序列类型,数据类型为 string
- tags: 序列类型,数据类型为 string
- reading_time: 数据类型为 uint32
- format: 数据类型为 string
- complexity: 数据类型为 string
- comments: 序列类型,包含以下字段:
- id: 数据类型为 uint64
- parent_id: 数据类型为 uint64
- level: 数据类型为 uint32
- time_published: 数据类型为 uint64
- score: 数据类型为 int32
- votes: 数据类型为 uint32
- message_html: 数据类型为 string
- message_markdown: 数据类型为 string
- author: 数据类型为 string
- children: 序列类型,数据类型为 uint64
- readingCount: 数据类型为 int64
数据集分割
- train: 包含 10000 个样本,大小为 661170132.0315578 字节



