hackernews-stories
收藏HackerNews Stories Dataset
基本信息
- 数据集名称: HackerNews stories dataset
- 语言: 英语
- 许可证: Apache License 2.0
数据集配置
- 配置名称: default
- 数据文件:
- 分割: train
- 路径: data/*.jsonl.zst
数据集特征
- id: 整数类型
- url: 字符串类型
- title: 字符串类型
- author: 字符串类型
- markdown: 字符串类型
- downloaded: 布尔类型
- meta_extracted: 布尔类型
- parsed: 布尔类型
- description: 字符串类型
- filedate: 字符串类型
- date: 字符串类型
- image: 字符串类型
- pagetype: 字符串类型
- hostname: 字符串类型
- sitename: 字符串类型
- tags: 字符串类型
- categories: 字符串类型
数据集统计
- 日期覆盖范围: xx.2006-09.2024
- 总页面数: 2150271
- 未压缩大小: ~20GB
使用方法
-
数据格式: JSONL,使用ZSTD压缩
-
示例: json { "id": 8961943, "url": "https://www.eff.org/deeplinks/2015/01/internet-sen-ron-wyden-were-counting-you-oppose-fast-track-tpp", "title": "Digital Rights Groups to Senator Ron Wyden: Were Counting on You to Oppose Fast Track for the TPP", "author": "Maira Sutton", "markdown": "Seven leading US digital rights and access to knowledge groups, ...", "downloaded": true, "meta_extracted": true, "parsed": true, "description": "Seven leading US digital rights and access to knowledge groups, and over 7,550 users, have called on Sen. Wyden today to oppose any new version of Fast Track (aka trade promotion authority) that does not fix the secretive, corporate-dominated process of trade negotiations. In particular, we urge...", "filedate": "2024-10-13", "date": "2015-01-27", "image": "https://www.eff.org/files/issues/fair-use-og-1.png", "pagetype": "article", "hostname": "eff.org", "sitename": "Electronic Frontier Foundation", "categories": null, "tags": null }
-
加载方法: python from datasets import load_dataset stories = load_dataset("nixiesearch/hackernews-stories", split="train") print(stories[0])




