brusic/hacker-news-who-is-hiring-posts

Name: brusic/hacker-news-who-is-hiring-posts
Creator: brusic
Published: 2024-07-03 19:13:55
License: 暂无描述

Hugging Face2024-07-03 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/brusic/hacker-news-who-is-hiring-posts

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了自2011年4月以来Hacker News上Who Is Hiring帖子的所有一级评论。数据来源于官方的Firebase API，未经过数据清洗。数据集中的每一行代表一个月的数据，包括月份、父提交ID和该月的评论列表。数据集提供了多种格式的数据文件，如压缩的parquet、feather、pickle或未压缩的pickle格式。此外，README还提供了一个Python代码示例，展示了如何将数据框转换为每行一个评论的格式。

This dataset contains all first-level comments to Hacker News Who Is Hiring posts from April 2011. All data is derived from the official Firebase API and no data cleansing has occurred. Each row in the dataset represents data for a single month, including the month, parent submission ID, and a list of comments for that month. The dataset is available in various formats such as compressed parquet, feather, pickle, or uncompressed pickle. Additionally, the README provides a Python code example demonstrating how to convert the dataframe into a row-for-comment format.

提供机构：

brusic

原始信息汇总

数据集概述

数据集内容

时间范围：2011年4月至2024年3月
数据来源：Hacker News Who Is Hiring posts，通过官方Firebase API获取
数据格式：pickle格式
排除内容：Who wants to be hired? 和 Freelancer 帖子未包含在内

数据结构

每行数据包含：
- month：月份，格式为mmmm yyyy
- parent_id：提交ID
- comments：该月的评论列表
数据组织方式：按月组织，每行代表一个月份的数据

数据格式

可用格式：压缩的parquet、feather、pickle或未压缩的pickle格式
数据转换：可通过Python代码将数据框转换为每条评论对应一行的格式

示例数据

序号	month	parent_id	comments
0	March 2024	39562986	[{id: 39563104, by: jnathsf, text: Ci...
1	February 2024	39217310	[{id: 39375047, by: lpnoel1, text: Di...
2	January 2024	38842977	[{id: 38844766, by: pudo, text: OpenS...
...	...	...	...
159	June 2011	2607052	[{id: 2607280, by: yummyfajitas, text:...
160	May 2011	2503204	[{id: 2504067, by: jfarmer, text: Eve...
161	April 2011	2396027	[{id: 2396144, by: pchristensen, text:...

数据转换示例

python import pandas as pd

hiring_df = pd.read_parquet(hiring.parquet)

exploded_df = hiring_df.explode(comments).dropna().reset_index(drop=True).rename(columns={comments: comment})

comments_df = exploded_df.join(pd.DataFrame(exploded_df[comment].tolist())).drop(comment, axis=1)

序号	month	parent_id	id	by	text
0	March 2024	39562986	39563104	jnathsf	City Innovate

5,000+

优质数据集

54 个

任务类型

进入经典数据集