bguzzo2k/nyt_100y_news_headlines

Name: bguzzo2k/nyt_100y_news_headlines
Creator: bguzzo2k
Published: 2026-03-22 22:10:26
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/bguzzo2k/nyt_100y_news_headlines

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - news - nlp - titles - abstract - nyt - corpus - headlines - pre-training pretty_name: New York Times News 100 Years size_categories: - 10M<n<100M --- # New York Times 100 Years of News Headlines (1927-2026) This dataset contains approximately 100 years of New York Times news headlines and abstracts, ranging from **1927 to January 2026**. It is designed for time-series analysis, NLP tasks, and historical research. **Hugging Face Dataset Page:** [bguzzo2k/nyt_100y_news_headlines](https://huggingface.co/datasets/bguzzo2k/nyt_100y_news_headlines) ## Dataset Description The dataset consists of metadata for articles published by The New York Times. It captures the "Main" headline and the "Abstract" (summary) provided by the NYT Archive. - **Sources:** [New York Times Archive API](https://developer.nytimes.com/docs/archive-product/1/overview). - **Temporal Coverage:** January 1, 1927, to January 31, 2026. - **Content:** Headline (Title) and Abstract (Summary) of articles. ## Data Format The data is organized into yearly Parquet files located in the `yearly_parquet/` directory (e.g., `nyt_1927.parquet`, `nyt_2026.parquet`). Each file is optimized for performance using: - **Compression:** Snappy - **Indexing:** Sorted by publication date. ### Schema | Column | Type | Description | | :--- | :--- | :--- | | `date` | datetime64[ns, UTC] | The publication date and time of the article. | | `headline` | string | The main headline of the article. | | `abstract` | string | A brief summary or abstract of the article content. | ## Data Generation & Maintenance The dataset is generated and updated using the Python scripts found in the `utils/` directory: 1. **`utils/nyt_archive_downloader.py`**: A multi-threaded downloader that fetches raw JSON data from the NYT Archive API. It uses exponential backoff to respect API limits. 2. **`utils/nyt_pqt_convert.py`**: Processes the raw JSON files, extracts relevant fields, cleans the data, and converts it into structured Parquet files grouped by year. ## Usage To read the data using Python and Pandas: ```python import pandas as pd # Example: Read the year 2025 df = pd.read_parquet("yearly_parquet/nyt_2025.parquet") print(df.head()) ``` ## License This dataset is provided for educational and research purposes. Please refer to the [New York Times Developer Terms of Service](https://developer.nytimes.com/terms) regarding the use of their API data.

license: Apache-2.0 language: 英语 tags: - 新闻 - 自然语言处理（Natural Language Processing） - 标题 - 摘要 - 《纽约时报》（New York Times） - 语料库 - 新闻头条 - 预训练 pretty_name: 《纽约时报百年新闻》 size_categories: - 1000万 < 数据量 < 1亿 # 《纽约时报百年新闻头条（1927-2026）》本数据集收录约100年的《纽约时报》新闻头条与摘要，时间跨度为1927年至2026年1月，适用于时序分析、自然语言处理任务及历史研究。 **Hugging Face数据集页面：** [bguzzo2k/nyt_100y_news_headlines](https://huggingface.co/datasets/bguzzo2k/nyt_100y_news_headlines) ## 数据集描述本数据集收录《纽约时报》刊发文章的元数据，包含由《纽约时报》档案库提供的“主标题”与“摘要（概要）”。 - **数据来源：** [纽约时报档案API](https://developer.nytimes.com/docs/archive-product/1/overview)。 - **时间覆盖范围：** 1927年1月1日至2026年1月31日。 - **内容构成：** 文章的标题与摘要。 ## 数据格式数据以按年份划分的Parquet文件形式组织，存储于`yearly_parquet/`目录下（例如`nyt_1927.parquet`、`nyt_2026.parquet`）。每个文件均通过以下方式优化性能： - **压缩算法：** Snappy - **索引方式：** 按发布日期排序。 ### 数据模式（Schema） | 列名 | 数据类型 | 描述 | | :--- | :--- | :--- | | `date` | datetime64[ns, UTC] | 文章的发布日期与时间。 | | `headline` | 字符串类型 | 文章的主标题。 | | `abstract` | 字符串类型 | 文章内容的简要概要或摘要。 | ## 数据生成与维护本数据集通过`utils/`目录下的Python脚本生成并更新： 1. **`utils/nyt_archive_downloader.py`**：多线程下载工具，可从纽约时报档案API获取原始JSON数据，采用指数退避策略以遵守API调用限制。 2. **`utils/nyt_pqt_convert.py`**：处理原始JSON文件，提取相关字段、清洗数据，并将其转换为按年份分组的结构化Parquet文件。 ## 使用方法可通过Python与Pandas读取数据，示例代码如下： python import pandas as pd # 示例：读取2025年数据 df = pd.read_parquet("yearly_parquet/nyt_2025.parquet") print(df.head()) ## 许可证本数据集仅供教育与研究使用，使用其API数据需遵循[纽约时报开发者服务条款](https://developer.nytimes.com/terms)。

提供机构：

bguzzo2k

5,000+

优质数据集

54 个

任务类型

进入经典数据集