jkeisling/hacker-news-corpus-2007-2022

Name: jkeisling/hacker-news-corpus-2007-2022
Creator: jkeisling
Published: 2023-07-05 04:13:00
License: 暂无描述

Hugging Face2023-07-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jkeisling/hacker-news-corpus-2007-2022

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en pretty_name: Hacker News posts and comments, 2007-Nov 2022 size_categories: - 10M<n<100M --- # Hacker News corpus, 2007-Nov 2022 ## Dataset Description ### Dataset Summary **Dataset Name:** Hacker News Full Corpus (2007 - November 2022) **Description:** > NOTE: I am not affiliated with Y Combinator. This dataset is a July 2023 snapshot of YCombinator's [BigQuery dump](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news) of the entire archive of posts and comments made on Hacker News. It contains posts from Hacker News' inception in 2007 through to November 16, 2022, when the BigQuery database was last updated. The dataset does not incorporate any modifications or filtering - it is a raw dump from the original dataset provided by YCombinator. Hence, it retains the same structure and fields as the original BigQuery table, serving as a ready-to-use resource for conducting large-scale data analysis or training language models. All credit for the original data collection and maintenance goes to YCombinator and the original post and comment authors. This version of the dataset has been prepared for convenience and ease of use within the HuggingFace ecosystem, especially for those interested in offline usage or who prefer not to use Google Cloud. Please bear in mind that this dataset is a snapshot and will probably not be updated. For the latest data, consider accessing the live data directly from the official [Hacker News API](https://github.com/HackerNews/API), potentially using [Anant Narayanan's scripts](https://www.kix.in/2023/05/05/hacker-news-chatgpt-plugin/#downloading-the-dataset). Please use responsibly, respecting all relevant terms of use and privacy considerations inherent in the data. ### Languages English ## Dataset Structure ### Data Fields | fullname | mode | type | description | | ----------- | -------- | --------- | ------------------------------------------------------------ | | title | NULLABLE | STRING | Story title | | url | NULLABLE | STRING | Story url | | text | NULLABLE | STRING | Story or comment text | | dead | NULLABLE | BOOLEAN | Is dead? | | by | NULLABLE | STRING | The username of the item's author. | | score | NULLABLE | INTEGER | Story score | | time | NULLABLE | INTEGER | Unix time | | timestamp | NULLABLE | TIMESTAMP | Timestamp for the unix time | | type | NULLABLE | STRING | Type of details (comment, comment_ranking, poll, story, job, pollopt) | | id | NULLABLE | INTEGER | The item's unique id. | | parent | NULLABLE | INTEGER | Parent comment ID | | descendants | NULLABLE | INTEGER | Number of story or poll descendants | | ranking | NULLABLE | INTEGER | Comment ranking | | deleted | NULLABLE | BOOLEAN | Is deleted? | ## Dataset Creation ### Curation Rationale This dataset provides a snapshot of the Hacker News posts and comments archive, sourced from YCombinator's open data, to enable easy and direct access without the need for a Google Cloud account or BigQuery interface, and without putting undue strain on the HN API. It aims to simplify the data acquisition process, promoting its use within the HuggingFace ecosystem for various tasks including analysis, trend prediction, sentiment studies, and language model training. By minimizing barriers to access, this dataset encourages a wider usage, fostering innovation in natural language processing and related fields. ### Annotations ### Personal and Sensitive Information This dataset has not undergone specific checks for personally identifiable information (PII); hence, it's possible that some may exist within the data. However, as the data source is publicly available and shared by YCombinator, any potential PII present is already part of the public domain. ## Considerations for Using the Data ### Social Impact of Dataset The collective wisdom and perspectives captured in the posts and comments of this Hacker News dataset represent a unique gift from YCombinator and countless contributors worldwide; it is part of the common heritage of humanity. The potential insights to be gleaned and the future knowledge to be generated, especially through the training of language models on this corpus, can provide unbounded new perspectives, enriching our understanding and potential solutions to complex issues. It is a testament to the power of shared knowledge and open dialogue in shaping the world. While there is a risk that some may use language models trained on this dataset for disinformation purposes, it's worth noting that the misuse of technology is a challenge that predates this dataset. The proverbial horse of misused technology has long since left the barn; our focus now must be on harnessing this shared intellectual legacy responsibly for the common good. ### Discussion of Biases Given that Hacker News is a technology-focused platform with a largely self-selected user base, the content and perspectives found within this dataset may lean towards technology, entrepreneurship, and related fields, often reflecting the views and biases of this specific community. As such, users should be aware that analysis drawn from this data may not fully represent a balanced, global perspective and might contain inherent biases towards topics and viewpoints that are overrepresented in the Hacker News community. ## Additional Information ### Licensing Information In the absence of an explicit license for the upstream BigQuery dataset, this dataset uses the same MIT license as the Hacker News API. The upstream terms of use are reproduced here: > This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - [https://github.com/HackerNews/API ](https://github.com/HackerNews/API) - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

提供机构：

jkeisling

原始信息汇总

Hacker News Full Corpus (2007 - November 2022) 数据集概述

数据集描述

数据集总结

数据集名称： Hacker News Full Corpus (2007 - November 2022)
描述： 该数据集是2023年7月的一个快照，包含了从2007年Hacker News成立至2022年11月16日的所有帖子和评论。数据集未经过任何修改或过滤，保留了原始BigQuery表的结构和字段，适用于大规模数据分析或语言模型训练。

语言

英语

数据集结构

数据字段

字段名	可空性	类型	描述
title	NULLABLE	STRING	故事标题
url	NULLABLE	STRING	故事链接
text	NULLABLE	STRING	故事或评论文本
dead	NULLABLE	BOOLEAN	是否已删除
by	NULLABLE	STRING	作者用户名
score	NULLABLE	INTEGER	故事评分
time	NULLABLE	INTEGER	Unix时间
timestamp	NULLABLE	TIMESTAMP	时间戳
type	NULLABLE	STRING	细节类型（评论、评论排名、投票、故事、工作、投票选项）
id	NULLABLE	INTEGER	项目唯一ID
parent	NULLABLE	INTEGER	父评论ID
descendants	NULLABLE	INTEGER	故事或投票的后代数量
ranking	NULLABLE	INTEGER	评论排名
deleted	NULLABLE	BOOLEAN	是否已删除

数据集创建

精选理由

该数据集提供了一个Hacker News帖子和评论存档的快照，旨在简化数据获取过程，促进在HuggingFace生态系统中的使用，适用于分析、趋势预测、情感研究及语言模型训练等任务。

个人和敏感信息

数据集未进行个人身份信息（PII）的特定检查，可能存在此类信息，但数据源为公开可用，任何潜在的PII已属于公共领域。

使用数据时的考虑

社会影响

该数据集捕捉了Hacker News社区的集体智慧和观点，为训练语言模型等提供了丰富的资源，但需注意可能存在的偏见和误用风险。

偏见讨论

由于Hacker News是一个技术导向的平台，数据集内容可能偏向技术、创业及相关领域，反映特定社区的观点和偏见，使用时应意识到分析结果可能不全面代表全球视角。

附加信息

许可信息

该数据集使用与Hacker News API相同的MIT许可证，未提供上游BigQuery数据集的明确许可。

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是Hacker News从2007年创立至2022年11月的完整帖子和评论语料库，包含约1110万行数据，总大小为5.15 GB，以Parquet格式提供。它是一个原始、未过滤的快照，适用于大规模数据分析和语言模型训练，但需注意数据可能包含个人可识别信息，并反映技术社区的特定偏见。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集