five

jordiclive/wikipedia-summary-dataset

收藏
Hugging Face2023-02-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jordiclive/wikipedia-summary-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description - **Repository:** https://github.com/tscheepers/Wikipedia-Summary-Dataset ### Dataset Summary This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use the smaller, more concise, and more definitional summaries in their research. Or if one just wants to use a smaller but still diverse dataset for efficient training with resource constraints. A summary or introduction of an article is everything starting from the page title up to the content outline. ### Citation Information ``` @mastersthesis{scheepers2017compositionality, author = {Scheepers, Thijs}, title = {Improving the Compositionality of Word Embeddings}, school = {Universiteit van Amsterdam}, year = {2017}, month = {11}, address = {Science Park 904, Amsterdam, Netherlands} } ```
提供机构:
jordiclive
原始信息汇总

数据集概述

  • 数据集名称: Wikipedia-Summary-Dataset
  • 数据集用途: 用于机器学习和自然语言处理的研究。
  • 数据内容: 包含2017年9月提取的英文Wikipedia文章的所有标题和摘要(或介绍)。
  • 数据特点:
    • 与常规Wikipedia转储不同,该数据集仅包含提取的摘要,而非整个未经处理的页面内容。
    • 适用于需要使用更简洁、定义明确的摘要进行研究的场景,或资源受限情况下需要高效训练的小型多样化数据集。
  • 摘要定义: 文章的摘要或介绍包括从页面标题开始到内容大纲之前的所有内容。

引用信息

@mastersthesis{scheepers2017compositionality, author = {Scheepers, Thijs}, title = {Improving the Compositionality of Word Embeddings}, school = {Universiteit van Amsterdam}, year = {2017}, month = {11}, address = {Science Park 904, Amsterdam, Netherlands} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作