NewZealand-PD-Newspapers

Name: NewZealand-PD-Newspapers
Creator: maas
Published: 2025-07-03 16:31:06
License: 暂无描述

魔搭社区2025-07-03 更新2025-06-21 收录

下载链接：

https://modelscope.cn/datasets/PleIAs/NewZealand-PD-Newspapers

下载链接

链接失效反馈

官方服务：

资源简介：

# New Zealand Public Domain Newspapers Dataset Card ## Dataset Overview **Dataset Name:** New Zealand Public Domain Newspapers **Description:** The New Zealand Public Domain Newspapers dataset comprises a collection of historical newspapers from New Zealand. The dataset is organized into Parquet files divided by year. Each file contains detailed information about the newspaper articles, including metadata extracted from XML files. **Languages Covered:** The dataset primarily contains English-language documents. **Total Number of Documents:** 1,772,785 **Total Number of Words:** 10,178,602,316 **Average Number of Words per Document:** 5,741.59 ## Dataset Structure The dataset is stored in Parquet files organized by year. Each Parquet file contains the following metadata for each document: - `identifier`: Unique identifier for the document. - `word_count`: Total number of words in the document. - `text`: Full text content of the document. Each Parquet file is accompanied by the original XML files collected into tar.gz files, one for each Parquet file. ## Usage The dataset can be used for various purposes including, but not limited to: - Analyzing historical newspaper trends and topics over time. - Studying the distribution and frequency of words and phrases in historical contexts. - Developing natural language processing models to analyze the content of newspaper articles. - Extracting insights related to historical events and their representation in newspapers. ## Source The dataset is collected from the public domain newspapers of New Zealand. The original XML files are processed and converted into Parquet format for efficient storage and analysis. ## Dataset Citation If you use this dataset in your research, please cite it as follows: ``` @dataset{NZ_Public_Domain_Newspapers_2024, title={New Zealand Public Domain Newspapers Dataset}, author={Pleias}, year={2024}, description={Collection of historical newspapers from New Zealand, organized by year and language, with detailed metadata extracted from XML files.} } ``` **Note:** The dataset is presented and maintained by Pleias. All rights reserved.

# 新西兰公共领域报纸数据集卡片 ## 数据集概览 **数据集名称：** 新西兰公共领域报纸 **数据集描述：** 新西兰公共领域报纸数据集汇集了新西兰的历史报纸馆藏。本数据集按年份划分为Parquet文件，每个文件包含报纸文章的详细信息，其中包括从XML文件中提取的元数据。 **覆盖语言：** 本数据集以英文文档为主。 **文档总数：** 1,772,785 **总词数：** 10,178,602,316 **单文档平均词数：** 5,741.59 ## 数据集结构本数据集以按年份组织的Parquet文件存储。每个Parquet文件包含每份文档的以下元数据： - `identifier`：文档的唯一标识符 - `word_count`：文档总词数 - `text`：文档的完整文本内容每个Parquet文件均附带对应原始XML文件打包而成的tar.gz压缩包。 ## 使用场景本数据集可应用于多种场景，包括但不限于： - 分析历史报纸的长期趋势与主题 - 研究历史语境下词汇与短语的分布及使用频率 - 开发用于分析报纸文章内容的自然语言处理模型 - 挖掘与历史事件相关的洞察信息，及其在报纸中的呈现方式 ## 数据来源本数据集采集自新西兰公共领域报纸。原始XML文件经过处理并转换为Parquet格式，以实现高效存储与分析。 ## 数据集引用若您在研究中使用本数据集，请按以下格式引用： @dataset{NZ_Public_Domain_Newspapers_2024, title={New Zealand Public Domain Newspapers Dataset}, author={Pleias}, year={2024}, description={Collection of historical newspapers from New Zealand, organized by year and language, with detailed metadata extracted from XML files.} } **注意：** 本数据集由Pleias发布并维护，保留所有权利。

提供机构：

maas

创建时间：

2025-06-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集