IlyaGusev/habr
收藏数据集概述
数据集名称
Habr dataset
数据集特征
- id: uint32
- language: string
- url: string
- title: string
- text_markdown: string
- text_html: string
- author: string
- original_author: string
- original_url: string
- lead_html: string
- lead_markdown: string
- type: string
- time_published: uint64
- statistics: struct
- commentsCount: uint32
- favoritesCount: uint32
- readingCount: uint32
- score: int32
- votesCount: int32
- votesCountPlus: int32
- votesCountMinus: int32
- labels: sequence: string
- hubs: sequence: string
- flows: sequence: string
- tags: sequence: string
- reading_time: uint32
- format: string
- complexity: string
- comments: sequence
- id: uint64
- parent_id: uint64
- level: uint32
- time_published: uint64
- score: int32
- votes: uint32
- message_html: string
- message_markdown: string
- author: string
- children: sequence: uint64
数据集大小
- 下载大小: 3485570346
- 数据集大小: 19968161329
- 训练集大小: 19968161329, 包含302049个样本
语言
- 俄语 (ru)
- 英语 (en)
任务类别
- 文本生成
数据实例
json { "id": 12730, "language": "ru", "url": "https://habr.com/ru/post/12730/", "text_markdown": "...", "text_html": "...", "lead_markdown": "...", "lead_html": "...", "type": "article", "labels": [], "original_author": null, "original_url": null, "time_published": 1185962380, "author": "...", "title": "Хочешь в университет — сделай презентацию", "statistics": { "commentsCount": 23, "favoritesCount": 1, "readingCount": 1542, "score": 7, "votesCount": 15, "votesCountPlus": 11, "votesCountMinus": 4 }, "hubs": ["itcompanies"], "flows": ["popsci"], "tags": ["PowerPoint", "презентация", "абитуриенты"], "reading_time": 1, "format": null, "complexity": null, "comments": { "id": [11653537, 11653541], "parent_id": [null, 11653537], "level": [0, 1], "time_published": [1185963192, 1185967886], "score": [-1, 0], "votes": [1, 0], "message_html": ["...", "..."], "author": ["...", "..."], "children": [[11653541], []] } }
数据来源
- 数据来源于Habr网站。
个人信息和敏感信息
- 数据集未匿名化,可能包含个人姓名。原作者信息尽可能包含在内。



