TheFinAI/hacker-news
收藏Hugging Face2026-04-10 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/TheFinAI/hacker-news
下载链接
链接失效反馈官方服务:
资源简介:
# Hacker News Dataset
## Dataset Summary
This dataset is derived from the official Hacker News data provided via the Hacker News Firebase API. It contains user-generated content including stories, comments, and metadata from the Hacker News platform.
Hacker News is a social news website focusing on computer science, entrepreneurship, and technology. The dataset captures real-world discussions, technical conversations, and community interactions over time.
The data was programmatically collected using the official API endpoints and processed into structured JSONL format suitable for large-scale language modeling and analysis.
---
## Dataset Structure
Each sample in the dataset is stored as a JSON object with the following fields:
- **Source**: Dataset name (e.g., "HackerNews")
- **Date**: Year extracted from the timestamp
- **Text**: Main textual content (e.g., comment or story text)
- **Token_count**: Number of tokens computed using `tiktoken` (cl100k_base)
### Example
```json
{
"Source": "HackerNews",
"Date": 2021,
"Text": "This is an example comment from Hacker News.",
"Token_count": 18
}
```
## Data Collection
Data was collected using the official Hacker News API: https://github.com/HackerNews/API
## Data Processing
The following preprocessing steps were applied:
Removed empty or null text entries
Converted timestamps to year format
Tokenized text using cl100k_base
Split into multiple JSONL shards for efficient storage
# Hacker News 数据集
## 数据集摘要
本数据集源自通过 Hacker News Firebase API 获取的官方 Hacker News 数据,包含来自该平台的用户生成内容,涵盖故事、评论与元数据。
Hacker News 是一家聚焦计算机科学、创业与技术领域的社交新闻网站,本数据集记录了该平台随时间推移产生的真实讨论、技术对话与社区互动。
该数据集通过官方 API 接口以编程方式采集,并被处理为结构化 JSONL 格式,适配大规模语言建模与分析任务。
---
## 数据集结构
数据集中的每条样本均以 JSON 对象形式存储,包含以下字段:
- **来源(Source)**:数据集名称(例如:"HackerNews")
- **日期(Date)**:从时间戳提取的年份
- **文本(Text)**:主要文本内容(例如:评论或故事文本)
- **Token 数量(Token_count)**:使用 `tiktoken` 的 `cl100k_base` 编码计算得到的 Token 数量
### 示例
json
{
"Source": "HackerNews",
"Date": 2021,
"Text": "This is an example comment from Hacker News.",
"Token_count": 18
}
## 数据采集
本数据集通过官方 Hacker News API 采集,接口地址为:https://github.com/HackerNews/API
## 数据处理
本数据集采用以下预处理步骤:
- 移除空文本或空值文本条目
- 将时间戳转换为年份格式
- 使用 `cl100k_base` 编码对文本进行 Token 化处理
- 拆分为多个 JSONL 分片以实现高效存储
提供机构:
TheFinAI



