MIT-WAL/ai-jobs-news-articles-abstracts
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/MIT-WAL/ai-jobs-news-articles-abstracts
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 10K<n<100K
tags:
- news
- scholarly-articles
- artificial-intelligence
- labor-market
- future-of-work
- gdelt
- semantic-scholar
- retrieval-augmented-generation
- economics
pretty_name: News articles and paper abstracts (AI, labor, and jobs)
---
# News articles and research abstracts on AI, labor, and jobs
## Dataset summary
This file is a **standalone CSV** of **news articles** (full scraped text) and **scholarly paper abstracts** curated for research on **artificial intelligence, work, and labor markets**. Each row is one document: a stable id, publication date, normalized title and main text, and a small metadata dictionary.
**Rows:** 49,077
| `document_class` | Rows | Approx. date range (`date` column) |
|------------------|--------|-------------------------------------|
| `news` | 30,014 | Jan. 2025 → Mar. 2026 |
| `paper` | 19,063 | Jan. 2020 → Apr. 2026 |
Use it for retrieval, RAG, classification, topic modeling, or qualitative sampling—without depending on any other repository layout or auxiliary files.
<details>
<summary><strong>News sources in this snapshot</strong> (32 outlets — expand for full list)</summary>
These are the distinct values of `metadata["SourceCommonName"]` for rows with `document_class == "news"`.
- abc.net.au
- aljazeera.com
- apnews.com
- bbc.com
- bloomberg.com
- businessinsider.com
- businesstoday.in
- cbsnews.com
- chicagotribune.com
- chinadaily.com.cn
- cnbc.com
- cnn.com
- dailymail.co.uk
- dw.com
- econotimes.com
- indianexpress.com
- indiatimes.com
- livemint.com
- manilatimes.net
- nbcnews.com
- newsweek.com
- npr.org
- nytimes.com
- scmp.com
- techcrunch.com
- techradar.com
- theglobeandmail.com
- theguardian.com
- time.com
- webpronews.com
- wsj.com
- yahoo.com
</details>
## What’s in each row
- **News** (`document_class == "news"`): `doc_id` is the **SHA-256 (hex)** of the article URL. `text` is the article body; `metadata` holds GDELT-derived source and content signals (see below).
- **Papers** (`document_class == "paper"`): `doc_id` is the **Semantic Scholar `paperId`**. `text` is the **abstract only** (not full PDF text). `metadata` includes bibliographic fields when available.
## Column reference
| Column | Description |
|------------------|-------------|
| `doc_id` | Stable id: URL hash (news) or `paperId` (papers). |
| `date` | `YYYY-MM-DD` publication date (or best available). |
| `document_class` | `"news"` or `"paper"`. |
| `metadata` | Python `repr` of a `dict` (use `ast.literal_eval` in Python). |
| `wordcount` | Word count (float; often aligned with GDELT for news). |
| `title` | Headline (news) or paper title. |
| `text` | Article body (news) or abstract (papers). |
## News metadata (GDELT-related keys)
For news rows, `metadata` may include these keys. They originate from the **GDELT Global Knowledge Graph** style exports; see the [GKG codebook](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf) for full detail.
| Key | Meaning |
|-----|--------|
| `SourceCommonName` | Human-readable **outlet** (usually the website domain, e.g. `reuters.com`). |
| `SourceCollectionIdentifier` | **Type of document id**: `1` means **open web** and `DocumentIdentifier` is a full **URL**; other values mean citations, DOIs, etc. |
| `DocumentIdentifier` | **Canonical document id**; for web news, the **article URL** (same URL used to compute `doc_id`). |
| `V2Locations` | **Places** mentioned in the article. Raw GDELT encodes geocoded blocks; in this release the string may be **normalized** (simplified place tokens, not the full raw GKG encoding). |
| `V2Tone` | **Comma-separated numbers** (in order): (1) overall **tone** −100…+100; (2) **positive** word %; (3) **negative** word %; (4) **polarity** (how emotionally charged the text is); (5) **activity** density; (6) **self/group** (pronoun) density; (7) GDELT **word count** for the text they analyzed. |
| `AllNames` | **Proper names** GDELT associated with the article (people, orgs, named events, etc.); often **simplified** to a deduplicated token list in this release, not raw offset-encoded blocks. |
| `wordcount` | Word count carried in metadata (may mirror the last `V2Tone` field or the `wordcount` column). |
## Paper metadata keys (when present)
| Key | Meaning |
|-----|--------|
| `paperId` | Semantic Scholar identifier (matches `doc_id`). |
| `year` | Publication year. |
| `authors` | Author string or list as stored at export time. |
| `citationCount` | Citation count from Semantic Scholar. |
## Intended uses
- Build **retrieval** or **RAG** corpora on AI, automation, and labor themes.
- Train or evaluate **models** on mixed news + academic abstract text.
- **Filter or join** on dates, outlets, or tone; trace articles back via `DocumentIdentifier`.
## Limitations
- **Not factual ground truth**—news reflects publishers; abstracts are summaries only.
## File
- **Format:** CSV, UTF-8, header row.
- **File name:** `text_db.csv`
## Quick start
Dataset hub id: [`MIT-WAL/ai-jobs-news-articles-abstracts`](https://huggingface.co/datasets/MIT-WAL/ai-jobs-news-articles-abstracts).
### With `load_dataset`
```python
from datasets import load_dataset
ds = load_dataset("MIT-WAL/ai-jobs-news-articles-abstracts", split="train")
```
### With pandas (`hf://`)
```python
import pandas as pd
df = pd.read_csv("hf://datasets/MIT-WAL/ai-jobs-news-articles-abstracts/text_db.csv")
```
## Citation
If you use this dataset, please cite the MIT Work Analytics Laboratory.
## License
This dataset is released under the MIT License.
提供机构:
MIT-WAL



