NetGene/reddit-it-labor-sentiment-2020-2026
收藏Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/NetGene/reddit-it-labor-sentiment-2020-2026
下载链接
链接失效反馈官方服务:
资源简介:
---
title: "Reddit IT Labor Market Sentiment (2020–2026)"
pretty_name: "IT Labor Market & AI Impact Sentiment"
license: mit
language:
- en
tags:
- labor-market
- reddit
- ai-impact
- tech-jobs
size_categories:
- 10M<n<100M
task_categories:
- text-classification
- token-classification
task_ids:
- sentiment-analysis
dataset_info:
features:
- name: created_utc
dtype: int64
- name: subreddit
dtype: string
- name: author_id
dtype: string
- name: text
dtype: string
- name: score
dtype: int32
splits:
- name: train
num_examples: 56458273
---
# Dataset
## Overview
Reddit IT Labor Market Sentiment (2020–2026)
### Thesis title
The impact of the artificial intelligence bubble on the job market for new IT specialists:
An analysis of the disconnect between recruitment requirements and attitudes
This dataset is a collection of Reddit posts and comments pulled from 32 different IT-focused subreddits.
It’s built for researchers looking at shifts in the labor market, skill inflation, and how people working in IT really feel, especially as AI investments ramp up.
### Ethics & Anonymization
Protecting user privacy matters, so the dataset follows strict guidelines:
- Every username is replaced with a salted SHA-256 hash, to hide every individual's identity from the datasets.
- Personal info is scrubbed out of the text.
- This is **NOT** for business or profit, only for ethical research.
#### Right to Erasure
If you spot any content in this dataset that can be traced back to you or your social media profile and want it removed, just reach out to the repository owner, [NetGene](https://huggingface.co/NetGene).
You have the “Right to be Forgotten”, so if pseudonymization doesn’t feel secure enough, you can ask to have your data removed.
Just reach out to [NetGene](https://huggingface.co/NetGene) and include the specific post identifiers you want redacted.
Your privacy matters, and ethical research is a priority here.
## Data Selection & Filtering Logic
The data comes from the Pushshift Reddit Archive.
Everything was filtered into three groups based on how useful it is for my model and dashboard building project, *[it-labor-decoupling-ai-cycle-impact](https://github.com/Net-Gene/it-labor-decoupling-ai-cycle-impact)*:
### 1. subreddits_sentimental
- Packed with personal sentimental stories, talking about:
- Career struggles
- Job hunts
- Changes in the tech industry
- Included subreddits:
- r/csMajors
- r/ExperiencedDevs
- r/careerchange
- r/it
- r/SecurityCareerAdvice
- r/learnprogramming
### 2. subreddits_less_sentimental
- More technical
- Less emotional.
- Troubleshooting
- Certifications
- The occasional sentiment buried in technical advice.
- Included subreddits:
- r/networking
- r/ccna
- r/CompTIA
- r/AZURE
- r/aws
- r/netsec
### 3. subreddits_undecided
- Too much noise
- Big, busy communities.
- Still figuring out if the labor market signals are strong or just lost in all the noise.
- Included:
- r/programming
- r/MachineLearning
- r/datascience
- r/AskProgramming
## Technical Specs
- Format
- Apache Parquet (compressed)
- Schema
- created_utc (int64)
- Unix timestamp of the post
- subreddit (string)
- The subreddit it came from
- author_id (string)
- Hashed, pseudonymized author ID
- text (string)
- The comment or post itself
- score (int32)
- Upvotes and downvotes combined
## Data Transformation & Anonymization
Turning the massive 3.8TB Pushshift archive into something useful meant having to build a custom Python pipeline. Here’s how:
### 1. Streaming Decompression & Filtering
The source files are huge, so the pipeline uses `zstandard` stream readers to handle data one line at a time, which keeps memory usage low and lets you filter by subreddit and date.
### 2. Irreversible Anonymization
To protect privacy:
- Author Masking:
- Every username gets salted and hashed with SHA-256.
- Only the first 16 characters are saved (`author_id`).
- That way, nobody can reverse-engineer the usernames, but you can still track unique users over time.
- Deleted Content:
- If a post’s author is `[deleted]` or `[removed]`
- it’s changed to a uniform `[deleted]` tag.
### 3. Parquet Conversion & Type Enforcement
Everything ends up in Apache Parquet, compressed with snappy. Perks include:
- Strict typing for `created_utc` and `score` so data won’t get corrupted
- Models can load just the `text` column for fast sentiment analysis
- Data takes up way less space than the original JSON
## Implementation
You can load this dataset directly from the servers url:
```python
from datasets import load_dataset
# Load reddit dataset from Hugging Face path
dataset = load_dataset("NetGene/reddit-it-labor-sentiment-2020-2026")
# Extract the data into reddit_df
reddit_df = dataset['train'].to_pandas()
```
Or load this dataset from your local file directory into your Python environment:
```python
import pandas as pd
# Load a specific category
df = pd.read_parquet("hf://{Your Local file location for the parquet files}.parquet.parquet")
```
## Citation
If you use this dataset or the associated research in your work, please cite:
> **Boussakine, D. (2026).** *The impact of the artificial intelligence bubble on the job market for new IT specialists:
An analysis of the disconnect between recruitment requirements and attitudes*
>
> Bachelor's Thesis, Lapland University of Applied Sciences (2026).
>
> Data originally from the Pushshift Reddit Archive via Academic Torrents.
提供机构:
NetGene



