five

MIT-WAL/ai-jobs-news-articles-abstracts

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MIT-WAL/ai-jobs-news-articles-abstracts
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit size_categories: - 10K<n<100K tags: - news - scholarly-articles - artificial-intelligence - labor-market - future-of-work - gdelt - semantic-scholar - retrieval-augmented-generation - economics pretty_name: News articles and paper abstracts (AI, labor, and jobs) --- # News articles and research abstracts on AI, labor, and jobs ## Dataset summary This file is a **standalone CSV** of **news articles** (full scraped text) and **scholarly paper abstracts** curated for research on **artificial intelligence, work, and labor markets**. Each row is one document: a stable id, publication date, normalized title and main text, and a small metadata dictionary. **Rows:** 49,077 | `document_class` | Rows | Approx. date range (`date` column) | |------------------|--------|-------------------------------------| | `news` | 30,014 | Jan. 2025 → Mar. 2026 | | `paper` | 19,063 | Jan. 2020 → Apr. 2026 | Use it for retrieval, RAG, classification, topic modeling, or qualitative sampling—without depending on any other repository layout or auxiliary files. <details> <summary><strong>News sources in this snapshot</strong> (32 outlets — expand for full list)</summary> These are the distinct values of `metadata["SourceCommonName"]` for rows with `document_class == "news"`. - abc.net.au - aljazeera.com - apnews.com - bbc.com - bloomberg.com - businessinsider.com - businesstoday.in - cbsnews.com - chicagotribune.com - chinadaily.com.cn - cnbc.com - cnn.com - dailymail.co.uk - dw.com - econotimes.com - indianexpress.com - indiatimes.com - livemint.com - manilatimes.net - nbcnews.com - newsweek.com - npr.org - nytimes.com - scmp.com - techcrunch.com - techradar.com - theglobeandmail.com - theguardian.com - time.com - webpronews.com - wsj.com - yahoo.com </details> ## What’s in each row - **News** (`document_class == "news"`): `doc_id` is the **SHA-256 (hex)** of the article URL. `text` is the article body; `metadata` holds GDELT-derived source and content signals (see below). - **Papers** (`document_class == "paper"`): `doc_id` is the **Semantic Scholar `paperId`**. `text` is the **abstract only** (not full PDF text). `metadata` includes bibliographic fields when available. ## Column reference | Column | Description | |------------------|-------------| | `doc_id` | Stable id: URL hash (news) or `paperId` (papers). | | `date` | `YYYY-MM-DD` publication date (or best available). | | `document_class` | `"news"` or `"paper"`. | | `metadata` | Python `repr` of a `dict` (use `ast.literal_eval` in Python). | | `wordcount` | Word count (float; often aligned with GDELT for news). | | `title` | Headline (news) or paper title. | | `text` | Article body (news) or abstract (papers). | ## News metadata (GDELT-related keys) For news rows, `metadata` may include these keys. They originate from the **GDELT Global Knowledge Graph** style exports; see the [GKG codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf) for full detail. | Key | Meaning | |-----|--------| | `SourceCommonName` | Human-readable **outlet** (usually the website domain, e.g. `reuters.com`). | | `SourceCollectionIdentifier` | **Type of document id**: `1` means **open web** and `DocumentIdentifier` is a full **URL**; other values mean citations, DOIs, etc. | | `DocumentIdentifier` | **Canonical document id**; for web news, the **article URL** (same URL used to compute `doc_id`). | | `V2Locations` | **Places** mentioned in the article. Raw GDELT encodes geocoded blocks; in this release the string may be **normalized** (simplified place tokens, not the full raw GKG encoding). | | `V2Tone` | **Comma-separated numbers** (in order): (1) overall **tone** −100…+100; (2) **positive** word %; (3) **negative** word %; (4) **polarity** (how emotionally charged the text is); (5) **activity** density; (6) **self/group** (pronoun) density; (7) GDELT **word count** for the text they analyzed. | | `AllNames` | **Proper names** GDELT associated with the article (people, orgs, named events, etc.); often **simplified** to a deduplicated token list in this release, not raw offset-encoded blocks. | | `wordcount` | Word count carried in metadata (may mirror the last `V2Tone` field or the `wordcount` column). | ## Paper metadata keys (when present) | Key | Meaning | |-----|--------| | `paperId` | Semantic Scholar identifier (matches `doc_id`). | | `year` | Publication year. | | `authors` | Author string or list as stored at export time. | | `citationCount` | Citation count from Semantic Scholar. | ## Intended uses - Build **retrieval** or **RAG** corpora on AI, automation, and labor themes. - Train or evaluate **models** on mixed news + academic abstract text. - **Filter or join** on dates, outlets, or tone; trace articles back via `DocumentIdentifier`. ## Limitations - **Not factual ground truth**—news reflects publishers; abstracts are summaries only. ## File - **Format:** CSV, UTF-8, header row. - **File name:** `text_db.csv` ## Quick start Dataset hub id: [`MIT-WAL/ai-jobs-news-articles-abstracts`](https://huggingface.co/datasets/MIT-WAL/ai-jobs-news-articles-abstracts). ### With `load_dataset` ```python from datasets import load_dataset ds = load_dataset("MIT-WAL/ai-jobs-news-articles-abstracts", split="train") ``` ### With pandas (`hf://`) ```python import pandas as pd df = pd.read_csv("hf://datasets/MIT-WAL/ai-jobs-news-articles-abstracts/text_db.csv") ``` ## Citation If you use this dataset, please cite the MIT Work Analytics Laboratory. ## License This dataset is released under the MIT License.
提供机构:
MIT-WAL
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作