five

NetGene/reddit-it-labor-sentiment-2020-2026

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/NetGene/reddit-it-labor-sentiment-2020-2026
下载链接
链接失效反馈
官方服务:
资源简介:
--- title: "Reddit IT Labor Market Sentiment (2020–2026)" pretty_name: "IT Labor Market & AI Impact Sentiment" license: mit language: - en tags: - labor-market - reddit - ai-impact - tech-jobs size_categories: - 10M<n<100M task_categories: - text-classification - token-classification task_ids: - sentiment-analysis dataset_info: features: - name: created_utc dtype: int64 - name: subreddit dtype: string - name: author_id dtype: string - name: text dtype: string - name: score dtype: int32 splits: - name: train num_examples: 56458273 --- # Dataset ## Overview Reddit IT Labor Market Sentiment (2020–2026) ### Thesis title The impact of the artificial intelligence bubble on the job market for new IT specialists: An analysis of the disconnect between recruitment requirements and attitudes This dataset is a collection of Reddit posts and comments pulled from 32 different IT-focused subreddits. It’s built for researchers looking at shifts in the labor market, skill inflation, and how people working in IT really feel, especially as AI investments ramp up. ### Ethics & Anonymization Protecting user privacy matters, so the dataset follows strict guidelines: - Every username is replaced with a salted SHA-256 hash, to hide every individual's identity from the datasets. - Personal info is scrubbed out of the text. - This is **NOT** for business or profit, only for ethical research. #### Right to Erasure If you spot any content in this dataset that can be traced back to you or your social media profile and want it removed, just reach out to the repository owner, [NetGene](https://huggingface.co/NetGene). You have the “Right to be Forgotten”, so if pseudonymization doesn’t feel secure enough, you can ask to have your data removed. Just reach out to [NetGene](https://huggingface.co/NetGene) and include the specific post identifiers you want redacted. Your privacy matters, and ethical research is a priority here. ## Data Selection & Filtering Logic The data comes from the Pushshift Reddit Archive. Everything was filtered into three groups based on how useful it is for my model and dashboard building project, *[it-labor-decoupling-ai-cycle-impact](https://github.com/Net-Gene/it-labor-decoupling-ai-cycle-impact)*: ### 1. subreddits_sentimental - Packed with personal sentimental stories, talking about: - Career struggles - Job hunts - Changes in the tech industry - Included subreddits: - r/csMajors - r/ExperiencedDevs - r/careerchange - r/it - r/SecurityCareerAdvice - r/learnprogramming ### 2. subreddits_less_sentimental - More technical - Less emotional. - Troubleshooting - Certifications - The occasional sentiment buried in technical advice. - Included subreddits: - r/networking - r/ccna - r/CompTIA - r/AZURE - r/aws - r/netsec ### 3. subreddits_undecided - Too much noise - Big, busy communities. - Still figuring out if the labor market signals are strong or just lost in all the noise. - Included: - r/programming - r/MachineLearning - r/datascience - r/AskProgramming ## Technical Specs - Format - Apache Parquet (compressed) - Schema - created_utc (int64) - Unix timestamp of the post - subreddit (string) - The subreddit it came from - author_id (string) - Hashed, pseudonymized author ID - text (string) - The comment or post itself - score (int32) - Upvotes and downvotes combined ## Data Transformation & Anonymization Turning the massive 3.8TB Pushshift archive into something useful meant having to build a custom Python pipeline. Here’s how: ### 1. Streaming Decompression & Filtering The source files are huge, so the pipeline uses `zstandard` stream readers to handle data one line at a time, which keeps memory usage low and lets you filter by subreddit and date. ### 2. Irreversible Anonymization To protect privacy: - Author Masking: - Every username gets salted and hashed with SHA-256. - Only the first 16 characters are saved (`author_id`). - That way, nobody can reverse-engineer the usernames, but you can still track unique users over time. - Deleted Content: - If a post’s author is `[deleted]` or `[removed]` - it’s changed to a uniform `[deleted]` tag. ### 3. Parquet Conversion & Type Enforcement Everything ends up in Apache Parquet, compressed with snappy. Perks include: - Strict typing for `created_utc` and `score` so data won’t get corrupted - Models can load just the `text` column for fast sentiment analysis - Data takes up way less space than the original JSON ## Implementation You can load this dataset directly from the servers url: ```python from datasets import load_dataset # Load reddit dataset from Hugging Face path dataset = load_dataset("NetGene/reddit-it-labor-sentiment-2020-2026") # Extract the data into reddit_df reddit_df = dataset['train'].to_pandas() ``` Or load this dataset from your local file directory into your Python environment: ```python import pandas as pd # Load a specific category df = pd.read_parquet("hf://{Your Local file location for the parquet files}.parquet.parquet") ``` ## Citation If you use this dataset or the associated research in your work, please cite: > **Boussakine, D. (2026).** *The impact of the artificial intelligence bubble on the job market for new IT specialists: An analysis of the disconnect between recruitment requirements and attitudes* > > Bachelor's Thesis, Lapland University of Applied Sciences (2026). > > Data originally from the Pushshift Reddit Archive via Academic Torrents.
提供机构:
NetGene
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作