brusic/hacker-news-who-is-hiring-posts
收藏Hugging Face2024-07-03 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/brusic/hacker-news-who-is-hiring-posts
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了自2011年4月以来Hacker News上Who Is Hiring帖子的所有一级评论。数据来源于官方的Firebase API,未经过数据清洗。数据集中的每一行代表一个月的数据,包括月份、父提交ID和该月的评论列表。数据集提供了多种格式的数据文件,如压缩的parquet、feather、pickle或未压缩的pickle格式。此外,README还提供了一个Python代码示例,展示了如何将数据框转换为每行一个评论的格式。
This dataset contains all first-level comments to Hacker News Who Is Hiring posts from April 2011. All data is derived from the official Firebase API and no data cleansing has occurred. Each row in the dataset represents data for a single month, including the month, parent submission ID, and a list of comments for that month. The dataset is available in various formats such as compressed parquet, feather, pickle, or uncompressed pickle. Additionally, the README provides a Python code example demonstrating how to convert the dataframe into a row-for-comment format.
提供机构:
brusic
原始信息汇总
数据集概述
数据集内容
- 时间范围:2011年4月至2024年3月
- 数据来源:Hacker News Who Is Hiring posts,通过官方Firebase API获取
- 数据格式:pickle格式
- 排除内容:Who wants to be hired? 和 Freelancer 帖子未包含在内
数据结构
- 每行数据包含:
- month:月份,格式为mmmm yyyy
- parent_id:提交ID
- comments:该月的评论列表
- 数据组织方式:按月组织,每行代表一个月份的数据
数据格式
- 可用格式:压缩的parquet、feather、pickle或未压缩的pickle格式
- 数据转换:可通过Python代码将数据框转换为每条评论对应一行的格式
示例数据
| 序号 | month | parent_id | comments |
|---|---|---|---|
| 0 | March 2024 | 39562986 | [{id: 39563104, by: jnathsf, text: Ci... |
| 1 | February 2024 | 39217310 | [{id: 39375047, by: lpnoel1, text: Di... |
| 2 | January 2024 | 38842977 | [{id: 38844766, by: pudo, text: OpenS... |
| ... | ... | ... | ... |
| 159 | June 2011 | 2607052 | [{id: 2607280, by: yummyfajitas, text:... |
| 160 | May 2011 | 2503204 | [{id: 2504067, by: jfarmer, text: Eve... |
| 161 | April 2011 | 2396027 | [{id: 2396144, by: pchristensen, text:... |
数据转换示例
python import pandas as pd
hiring_df = pd.read_parquet(hiring.parquet)
exploded_df = hiring_df.explode(comments).dropna().reset_index(drop=True).rename(columns={comments: comment})
comments_df = exploded_df.join(pd.DataFrame(exploded_df[comment].tolist())).drop(comment, axis=1)
| 序号 | month | parent_id | id | by | text |
|---|---|---|---|---|---|
| 0 | March 2024 | 39562986 | 39563104 | jnathsf | City Innovate |



