BrightData/Wikipedia-Articles
收藏数据集卡片 "BrightData/Wikipedia-Articles"
数据集概述
探索包含超过1.23M条结构化记录的维基百科文章数据集,涵盖10个数据字段,并定期更新和刷新。每个条目包括主要数据点,如时间戳、URL、文章标题、原始和分类文本、图片、“参见”引用、外部链接和结构化目录。
数据字典
| 列名 | 描述 | 数据类型 |
|---|---|---|
| url | 文章的URL | Url |
| title | 文章标题 | Text |
| table_of_contents | 文章的目录 | Array |
| raw_text | 文章的原始文本 | Text |
| cataloged_text | 按标题分类的文章文本 | Array |
| > title | 分类部分的标题 | Text |
| > sub_title | 分类部分内的子标题 | Text |
| > text | 分类部分内的文本内容 | Text |
| > links_in_text | 文本内容内的链接 | Array |
| >> link_name | 链接的名称或描述 | Text |
| >> url | 链接的URL | Url |
| images | 文章中图片的URL链接 | Array |
| > image_text | 图片下的文本描述 | Text |
| > image_url | 图片的URL | Url |
| see_also | 其他推荐文章 | Array |
| > title | 推荐文章的标题 | Text |
| > url | 推荐文章的URL | Url |
| references | 文章中的引用 | Array |
| > reference | 文章中的引用内容 | Text |
| >> urls | 引用内容中的URL | Array |
| >>> url_text | 引用URL的文本描述 | Text |
| >>> url | 引用文章或来源的URL | Url |
| external_links | 文章中引用的外部链接 | Array |
| > external_links_name | 外部链接的名称或描述 | Text |
| > link | 外部链接的URL | Text |
数据集创建
数据收集和处理
数据收集过程涉及直接从维基百科提取信息,确保覆盖所需属性的全面性。收集后,数据经历多个处理阶段:
- 解析:提取的原始数据被解析为结构化格式。
- 清洗:清洗过程涉及移除任何无关或错误条目,以提高数据质量。
验证
为确保数据完整性,实施了验证过程。每个条目在多个属性上进行检查,包括:
- 唯一性:每个记录被检查以确保其唯一性,消除任何重复。
- 完整性:检查数据集以确认所有必要字段被填充或填充,处理缺失数据。
- 一致性:进行交叉验证检查以确保跨多个属性的一致性,包括与历史记录的比较。
- 数据类型验证:确保所有数据类型正确分配且与预期格式一致。
- 填充率和重复检查:进行全面检查以验证填充率,确保无显著数据缺口,并严格筛查重复项。
这确保了数据集满足分析、研究和建模所需的高质量标准。
示例JSON
json [ { "timestamp": "2024-02-19", "url": "https://en.wikipedia.org/wiki/Adam_Storke", "title": "Adam Storke", "raw_text": "American actor This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: "Adam Storke" – news · newspapers · books · scholar · JSTOR (March 2013) (Learn how and when to remove this template message) Adam StorkeBornAdam J. Storke (1962-08-18) August 18, 1962 (age 61)New York, New York, U.S.OccupationActor Adam J. Storke (born August 18, 1962) is an American actor who has starred in television and film. He is best known for playing Julia Robertss love interest in the 1988 film Mystic Pizza and as Larry Underwood in the 1994 Stephen King mini series The Stand.
Biography Storke was born in New York City, New York, the son of Angela Thornton, an actress, and William Storke, a film and television producer. His well-known television role is in the soap opera Search for Tomorrow as Andrew Ryder in 1985 and in the short lived TV series in 1998 Prey. Adam has appeared in some TV movies and has made guest appearances on several television series, including Miami Vice, L.A. Law, American Dreams, Law & Order: Criminal Intent, Tales from the Crypt and 2005s Over There. His theatre credits include The Rimers of Eldritch.
Filmography Broadways Finest (2012) (film) New Amsterdam (2008) (TV) Over There (2005) (TV) Our Generation (2003) (TV) Crossing Jordan (2003) (TV) Johnson County War (2002) (mini) Roughing It (2002) (TV) Prey (1998) (TV) Rough Riders (1997) Escape from Terror: The Teresa Stamper Story (1995) (as Paul Stamper) Tales From The Crypt (1994) Attack of the 5 Ft. 2 In. Women (1994) The Stand (1994) (mini) Death Becomes Her (1992) Highway to Hell (1992) The Phantom of the Opera (1990) Mystic Pizza (1988) A Gathering of Old Men (1987) Ill Take Manhattan (1987) References
External links Adam Storke at IMDb Authority control databases International ISNI VIAF National Israel United States", "cataloged_text": [ { "links_in_text": [ { "link_name": "actor", "url": "https://en.wikipedia.org/wiki/Actor" }, { "link_name": "Julia Roberts", "url": "https://en.wikipedia.org/wiki/Julia_Roberts" }, { "link_name": "Mystic Pizza", "url": "https://en.wikipedia.org/wiki/Mystic_Pizza" }, { "link_name": "Stephen King", "url": "https://en.wikipedia.org/wiki/Stephen_King" }, { "link_name": "mini series", "url": "https://en.wikipedia.org/wiki/Mini_series" }, { "link_name": "The Stand", "url": "https://en.wikipedia.org/wiki/The_Stand_(1994_miniseries)" } ], "text": "Adam J. Storke (born August 18, 1962) is an American actor who has starred in television and film. He is best known for playing Julia Robertss love interest in the 1988 film Mystic Pizza and as Larry Underwood in the 1994 Stephen King mini series The Stand.", "title": "Adam Storke" }, { "links_in_text": [ { "link_name": "New York City", "url": "https://en.wikipedia.org/wiki/New_York_City" }, { "link_name": "New York", "url": "https://en.wikipedia.org/wiki/New_York_(state)" }, { "link_name": "citation needed", "url": "https://en.wikipedia.org/wiki/Wikipedia:Citation_needed" }, { "link_name": "soap opera", "url": "https://en.wikipedia.org/wiki/Soap_opera" }, { "link_name": "Search for Tomorrow", "url": "https://en.wikipedia.org/wiki/Search_for_Tomorrow" }, { "link_name": "Prey", "url": "https://en.wikipedia.org/wiki/Prey_(U.S.TV_series)" }, { "link_name": "Miami Vice", "url": "https://en.wikipedia.org/wiki/Miami_Vice" }, { "link_name": "L.A. Law", "url": "https://en.wikipedia.org/wiki/L.A.Law" }, { "link_name": "American Dreams", "url": "https://en.wikipedia.org/wiki/American_Dreams" }, { "link_name": "Law & Order: Criminal Intent", "url": "https://en.wikipedia.org/wiki/Law%26_Order:Criminal_Intent" }, { "link_name": "Tales from the Crypt", "url": "https://en.wikipedia.org/wiki/Tales_from_the_Crypt(TV_series)" }, { "link_name": "Over There", "url": "https://en.wikipedia.org/wiki/Over_There(American_TV_series)" }, { "link_name": "The Rimers of Eldritch", "url": "https://en.wikipedia.org/wiki/The_Rimers_of_Eldritch" } ], "text": "Storke was born in New York City, New York, the son of Angela Thornton, an actress, and William Storke, a film and television producer.[citation needed] His well-known television role is in the soap opera Search for Tomorrow as Andrew Ryder in 1985 and in the short lived TV series in 1998 Prey. Adam has appeared in some TV movies and has made guest appearances on several television series, including Miami Vice, L.A. Law, American Dreams, Law & Order: Criminal Intent, Tales from the Crypt and 2005s Over There. His theatre credits include The Rimers of Eldritch.", "title": "Biography" }, { "links_in_text": [ { "link_name": "Broadways Finest", "url": "https://en.wikipedia.org/wiki/Broadway%27s_Finest" }, { "link_name": "New Amsterdam", "url": "https://en.wikipedia.org/wiki/New_Amsterdam_(2008_TV_series)" }, { "link_name": "Over There", "url": "https://en.wikipedia.org/wiki/Over_There_(American_TV_series)" }, { "link_name": "Crossing Jordan", "url": "https://en.wikipedia.org/wiki/Crossing_Jordan" }, { "link_name": "Johnson County War", "url": "https://en.wikipedia.org/wiki/Johnson_County_War" }, { "link_name": "Roughing It", "url": "https://en.wikipedia.org/wiki/Roughing_It" }, { "link_name": "Prey", "url": "https://en.wikipedia.org/wiki/Prey_(American_TV_series)" }, { "link_name": "Rough Riders", "url": "https://en.wikipedia.org/wiki/Rough_Riders" }, { "link_name": "Escape from Terror: The Teresa Stamper Story", "url": "https://en.wikipedia.org/wiki/Escape_from_Terror:The_Teresa_Stamper_Story" }, { "link_name": "Tales From The Crypt", "url": "https://en.wikipedia.org/wiki/Tales_from_the_Crypt(TV_series)" }, { "link_name": "Attack of the 5 Ft. 2 In. Women", "url": "https://en.wikipedia.org/wiki/Attack_of_the_5_Ft.2_In.Women" }, { "link_name": "The Stand", "url": "https://en.wikipedia.org/wiki/The_Stand(1994_miniseries)" }, { "link_name": "Death Becomes Her", "url": "https://en.wikipedia.org/wiki/Death_Becomes_Her" }, { "link_name": "Highway to Hell", "url": "https://en.wikipedia.org/wiki/Highway_to_Hell(film)" }, { "link_name": "The Phantom of the Opera", "url": "https://en.wikipedia.org/wiki/The_Phantom_of_the_Opera_(miniseries)" }, { "link_name": "Mystic Pizza", "url": "https://en.wikipedia.org/wiki/Mystic_Pizza" }, { "link_name": "A Gathering of Old Men", "url": "https://en.wikipedia.org/wiki/A_Gathering_of_Old_Men" }, { "link_name": "Ill Take Manhattan", "url": "https://en.wikipedia.org/wiki/I%27ll_Take_Manhattan" } ], "text": "Broadways Finest (2012) (film) New Amsterdam (2008) (TV) Over There (2005) (TV) Our Generation (2003) (TV) Crossing Jordan (2003) (TV) Johnson County War (2002) (mini) Roughing It (2002) (TV) Prey (1998) (TV) Rough Riders (1997) Escape from Terror: The Teresa Stamper Story (1995) (as Paul Stamper) Tales From The Crypt (1994) Attack of the 5 Ft. 2 In. Women (1994) The Stand (1994) (mini) Death Becomes Her (1992) Highway to Hell (1992) The Phantom of the Opera (1990) Mystic Pizza (1988) A Gathering of Old Men (1987) Ill Take Manhattan (1987)", "title": "Filmography" } ], "images": [], "see_also": null, "references": [], "external_links": [ { "Link": "https://www.google.com/search?as_eq=wikipedia&q=%22Adam+Storke%22", "external_links_name": ""Adam Storke"" }, { "Link": "https://www.google.com/search?tbm=nws&q=%22Adam+Storke%22+-wikipedia&tbs=ar:1", "external_links_name": "news" }, { "Link": "https://www.google.com/search?&q=%22Adam+Storke%22&tbs=bkt:s&tbm=bks", "external_links_name": "newspapers" }, { "Link": "https://www.google.com/search?tbs=bks:1&q=%22Adam+Storke%22+-wikipedia", "external_links_name": "books" }, { "Link": "https://scholar.google.com/scholar?q=%22Adam+Storke%22", "external_links_name": "scholar" }, { "Link": "https://www.jstor.org/action/doBasicSearch?Query=%22



