five

BrightData/Wikipedia-Articles

收藏
Hugging Face2024-06-21 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/BrightData/Wikipedia-Articles
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含超过123万条结构化的维基百科文章记录,每条记录包含10个数据字段,如时间戳、URL、文章标题、原始文本、分类文本、图像、参见参考、外部链接和结构化的目录。数据集适用于多种NLP任务,如文本分类、文本生成、文本到文本生成、摘要和问答。数据集的创建过程包括数据收集、解析、清理和验证,以确保数据的高质量和完整性。

该数据集包含超过123万条结构化的维基百科文章记录,每条记录包含10个数据字段,如时间戳、URL、文章标题、原始文本、分类文本、图像、参见参考、外部链接和结构化的目录。数据集适用于多种NLP任务,如文本分类、文本生成、文本到文本生成、摘要和问答。数据集的创建过程包括数据收集、解析、清理和验证,以确保数据的高质量和完整性。
提供机构:
BrightData
原始信息汇总

数据集卡片 "BrightData/Wikipedia-Articles"

数据集概述

探索包含超过1.23M条结构化记录的维基百科文章数据集,涵盖10个数据字段,并定期更新和刷新。每个条目包括主要数据点,如时间戳、URL、文章标题、原始和分类文本、图片、“参见”引用、外部链接和结构化目录。

数据字典

列名 描述 数据类型
url 文章的URL Url
title 文章标题 Text
table_of_contents 文章的目录 Array
raw_text 文章的原始文本 Text
cataloged_text 按标题分类的文章文本 Array
> title 分类部分的标题 Text
> sub_title 分类部分内的子标题 Text
> text 分类部分内的文本内容 Text
> links_in_text 文本内容内的链接 Array
>> link_name 链接的名称或描述 Text
>> url 链接的URL Url
images 文章中图片的URL链接 Array
> image_text 图片下的文本描述 Text
> image_url 图片的URL Url
see_also 其他推荐文章 Array
> title 推荐文章的标题 Text
> url 推荐文章的URL Url
references 文章中的引用 Array
> reference 文章中的引用内容 Text
>> urls 引用内容中的URL Array
>>> url_text 引用URL的文本描述 Text
>>> url 引用文章或来源的URL Url
external_links 文章中引用的外部链接 Array
> external_links_name 外部链接的名称或描述 Text
> link 外部链接的URL Text

数据集创建

数据收集和处理

数据收集过程涉及直接从维基百科提取信息,确保覆盖所需属性的全面性。收集后,数据经历多个处理阶段:

  • 解析:提取的原始数据被解析为结构化格式。
  • 清洗:清洗过程涉及移除任何无关或错误条目,以提高数据质量。

验证

为确保数据完整性,实施了验证过程。每个条目在多个属性上进行检查,包括:

  • 唯一性:每个记录被检查以确保其唯一性,消除任何重复。
  • 完整性:检查数据集以确认所有必要字段被填充或填充,处理缺失数据。
  • 一致性:进行交叉验证检查以确保跨多个属性的一致性,包括与历史记录的比较。
  • 数据类型验证:确保所有数据类型正确分配且与预期格式一致。
  • 填充率和重复检查:进行全面检查以验证填充率,确保无显著数据缺口,并严格筛查重复项。

这确保了数据集满足分析、研究和建模所需的高质量标准。

示例JSON

json [ { "timestamp": "2024-02-19", "url": "https://en.wikipedia.org/wiki/Adam_Storke", "title": "Adam Storke", "raw_text": "American actor This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: "Adam Storke" – news · newspapers · books · scholar · JSTOR (March 2013) (Learn how and when to remove this template message) Adam StorkeBornAdam J. Storke (1962-08-18) August 18, 1962 (age 61)New York, New York, U.S.OccupationActor Adam J. Storke (born August 18, 1962) is an American actor who has starred in television and film. He is best known for playing Julia Robertss love interest in the 1988 film Mystic Pizza and as Larry Underwood in the 1994 Stephen King mini series The Stand.

Biography Storke was born in New York City, New York, the son of Angela Thornton, an actress, and William Storke, a film and television producer. His well-known television role is in the soap opera Search for Tomorrow as Andrew Ryder in 1985 and in the short lived TV series in 1998 Prey. Adam has appeared in some TV movies and has made guest appearances on several television series, including Miami Vice, L.A. Law, American Dreams, Law & Order: Criminal Intent, Tales from the Crypt and 2005s Over There. His theatre credits include The Rimers of Eldritch.

Filmography Broadways Finest (2012) (film) New Amsterdam (2008) (TV) Over There (2005) (TV) Our Generation (2003) (TV) Crossing Jordan (2003) (TV) Johnson County War (2002) (mini) Roughing It (2002) (TV) Prey (1998) (TV) Rough Riders (1997) Escape from Terror: The Teresa Stamper Story (1995) (as Paul Stamper) Tales From The Crypt (1994) Attack of the 5 Ft. 2 In. Women (1994) The Stand (1994) (mini) Death Becomes Her (1992) Highway to Hell (1992) The Phantom of the Opera (1990) Mystic Pizza (1988) A Gathering of Old Men (1987) Ill Take Manhattan (1987) References

External links Adam Storke at IMDb Authority control databases International ISNI VIAF National Israel United States", "cataloged_text": [ { "links_in_text": [ { "link_name": "actor", "url": "https://en.wikipedia.org/wiki/Actor" }, { "link_name": "Julia Roberts", "url": "https://en.wikipedia.org/wiki/Julia_Roberts" }, { "link_name": "Mystic Pizza", "url": "https://en.wikipedia.org/wiki/Mystic_Pizza" }, { "link_name": "Stephen King", "url": "https://en.wikipedia.org/wiki/Stephen_King" }, { "link_name": "mini series", "url": "https://en.wikipedia.org/wiki/Mini_series" }, { "link_name": "The Stand", "url": "https://en.wikipedia.org/wiki/The_Stand_(1994_miniseries)" } ], "text": "Adam J. Storke (born August 18, 1962) is an American actor who has starred in television and film. He is best known for playing Julia Robertss love interest in the 1988 film Mystic Pizza and as Larry Underwood in the 1994 Stephen King mini series The Stand.", "title": "Adam Storke" }, { "links_in_text": [ { "link_name": "New York City", "url": "https://en.wikipedia.org/wiki/New_York_City" }, { "link_name": "New York", "url": "https://en.wikipedia.org/wiki/New_York_(state)" }, { "link_name": "citation needed", "url": "https://en.wikipedia.org/wiki/Wikipedia:Citation_needed" }, { "link_name": "soap opera", "url": "https://en.wikipedia.org/wiki/Soap_opera" }, { "link_name": "Search for Tomorrow", "url": "https://en.wikipedia.org/wiki/Search_for_Tomorrow" }, { "link_name": "Prey", "url": "https://en.wikipedia.org/wiki/Prey_(U.S.TV_series)" }, { "link_name": "Miami Vice", "url": "https://en.wikipedia.org/wiki/Miami_Vice" }, { "link_name": "L.A. Law", "url": "https://en.wikipedia.org/wiki/L.A.Law" }, { "link_name": "American Dreams", "url": "https://en.wikipedia.org/wiki/American_Dreams" }, { "link_name": "Law & Order: Criminal Intent", "url": "https://en.wikipedia.org/wiki/Law%26_Order:Criminal_Intent" }, { "link_name": "Tales from the Crypt", "url": "https://en.wikipedia.org/wiki/Tales_from_the_Crypt(TV_series)" }, { "link_name": "Over There", "url": "https://en.wikipedia.org/wiki/Over_There(American_TV_series)" }, { "link_name": "The Rimers of Eldritch", "url": "https://en.wikipedia.org/wiki/The_Rimers_of_Eldritch" } ], "text": "Storke was born in New York City, New York, the son of Angela Thornton, an actress, and William Storke, a film and television producer.[citation needed] His well-known television role is in the soap opera Search for Tomorrow as Andrew Ryder in 1985 and in the short lived TV series in 1998 Prey. Adam has appeared in some TV movies and has made guest appearances on several television series, including Miami Vice, L.A. Law, American Dreams, Law & Order: Criminal Intent, Tales from the Crypt and 2005s Over There. His theatre credits include The Rimers of Eldritch.", "title": "Biography" }, { "links_in_text": [ { "link_name": "Broadways Finest", "url": "https://en.wikipedia.org/wiki/Broadway%27s_Finest" }, { "link_name": "New Amsterdam", "url": "https://en.wikipedia.org/wiki/New_Amsterdam_(2008_TV_series)" }, { "link_name": "Over There", "url": "https://en.wikipedia.org/wiki/Over_There_(American_TV_series)" }, { "link_name": "Crossing Jordan", "url": "https://en.wikipedia.org/wiki/Crossing_Jordan" }, { "link_name": "Johnson County War", "url": "https://en.wikipedia.org/wiki/Johnson_County_War" }, { "link_name": "Roughing It", "url": "https://en.wikipedia.org/wiki/Roughing_It" }, { "link_name": "Prey", "url": "https://en.wikipedia.org/wiki/Prey_(American_TV_series)" }, { "link_name": "Rough Riders", "url": "https://en.wikipedia.org/wiki/Rough_Riders" }, { "link_name": "Escape from Terror: The Teresa Stamper Story", "url": "https://en.wikipedia.org/wiki/Escape_from_Terror:The_Teresa_Stamper_Story" }, { "link_name": "Tales From The Crypt", "url": "https://en.wikipedia.org/wiki/Tales_from_the_Crypt(TV_series)" }, { "link_name": "Attack of the 5 Ft. 2 In. Women", "url": "https://en.wikipedia.org/wiki/Attack_of_the_5_Ft.2_In.Women" }, { "link_name": "The Stand", "url": "https://en.wikipedia.org/wiki/The_Stand(1994_miniseries)" }, { "link_name": "Death Becomes Her", "url": "https://en.wikipedia.org/wiki/Death_Becomes_Her" }, { "link_name": "Highway to Hell", "url": "https://en.wikipedia.org/wiki/Highway_to_Hell(film)" }, { "link_name": "The Phantom of the Opera", "url": "https://en.wikipedia.org/wiki/The_Phantom_of_the_Opera_(miniseries)" }, { "link_name": "Mystic Pizza", "url": "https://en.wikipedia.org/wiki/Mystic_Pizza" }, { "link_name": "A Gathering of Old Men", "url": "https://en.wikipedia.org/wiki/A_Gathering_of_Old_Men" }, { "link_name": "Ill Take Manhattan", "url": "https://en.wikipedia.org/wiki/I%27ll_Take_Manhattan" } ], "text": "Broadways Finest (2012) (film) New Amsterdam (2008) (TV) Over There (2005) (TV) Our Generation (2003) (TV) Crossing Jordan (2003) (TV) Johnson County War (2002) (mini) Roughing It (2002) (TV) Prey (1998) (TV) Rough Riders (1997) Escape from Terror: The Teresa Stamper Story (1995) (as Paul Stamper) Tales From The Crypt (1994) Attack of the 5 Ft. 2 In. Women (1994) The Stand (1994) (mini) Death Becomes Her (1992) Highway to Hell (1992) The Phantom of the Opera (1990) Mystic Pizza (1988) A Gathering of Old Men (1987) Ill Take Manhattan (1987)", "title": "Filmography" } ], "images": [], "see_also": null, "references": [], "external_links": [ { "Link": "https://www.google.com/search?as_eq=wikipedia&q=%22Adam+Storke%22", "external_links_name": ""Adam Storke"" }, { "Link": "https://www.google.com/search?tbm=nws&q=%22Adam+Storke%22+-wikipedia&tbs=ar:1", "external_links_name": "news" }, { "Link": "https://www.google.com/search?&q=%22Adam+Storke%22&tbs=bkt:s&tbm=bks", "external_links_name": "newspapers" }, { "Link": "https://www.google.com/search?tbs=bks:1&q=%22Adam+Storke%22+-wikipedia", "external_links_name": "books" }, { "Link": "https://scholar.google.com/scholar?q=%22Adam+Storke%22", "external_links_name": "scholar" }, { "Link": "https://www.jstor.org/action/doBasicSearch?Query=%22

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作