michaelauli/wiki_bio
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/michaelauli/wiki_bio
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- found
language_creators:
- found
language:
- en
license:
- cc-by-sa-3.0
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- table-to-text
task_ids: []
paperswithcode_id: wikibio
pretty_name: WikiBio
dataset_info:
features:
- name: input_text
struct:
- name: table
sequence:
- name: column_header
dtype: string
- name: row_number
dtype: int16
- name: content
dtype: string
- name: context
dtype: string
- name: target_text
dtype: string
splits:
- name: train
num_bytes: 619269257
num_examples: 582659
- name: test
num_bytes: 77264695
num_examples: 72831
- name: val
num_bytes: 77335069
num_examples: 72831
download_size: 333998704
dataset_size: 773869021
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** https://github.com/DavidGrangier/wikipedia-biography-dataset
- **Paper:** https://arxiv.org/pdf/1603.07771.pdf
- **GitHub:** https://github.com/DavidGrangier/wikipedia-biography-dataset
### Dataset Summary
This Dataset contains 728321 biographies extracted from Wikipedia containing the first paragraph of the biography and the tabular infobox.
### Supported Tasks and Leaderboards
The main purpose of this dataset is developing text generation models.
### Languages
English.
## Dataset Structure
### Data Instances
More Information Needed
### Data Fields
The structure of a single sample is the following:
```json
{
"input_text":{
"context":"pope michael iii of alexandria\n",
"table":{
"column_header":[
"type",
"ended",
"death_date",
"title",
"enthroned",
"name",
"buried",
"religion",
"predecessor",
"nationality",
"article_title",
"feast_day",
"birth_place",
"residence",
"successor"
],
"content":[
"pope",
"16 march 907",
"16 march 907",
"56th of st. mark pope of alexandria & patriarch of the see",
"25 april 880",
"michael iii of alexandria",
"monastery of saint macarius the great",
"coptic orthodox christian",
"shenouda i",
"egyptian",
"pope michael iii of alexandria\n",
"16 -rrb- march -lrb- 20 baramhat in the coptic calendar",
"egypt",
"saint mark 's church",
"gabriel i"
],
"row_number":[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
},
"target_text":"pope michael iii of alexandria -lrb- also known as khail iii -rrb- was the coptic pope of alexandria and patriarch of the see of st. mark -lrb- 880 -- 907 -rrb- .\nin 882 , the governor of egypt , ahmad ibn tulun , forced khail to pay heavy contributions , forcing him to sell a church and some attached properties to the local jewish community .\nthis building was at one time believed to have later become the site of the cairo geniza .\n"
}
```
where, in the `"table"` field, all the information of the Wikpedia infobox is stored (the header of the infobox is stored in `"column_header"` and the information in the `"content"` field).
### Data Splits
- Train: 582659 samples.
- Test: 72831 samples.
- Validation: 72831 samples.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
This dataset was announced in the paper <em>Neural Text Generation from Structured Data with Application to the Biography Domain</em> [(arxiv link)](https://arxiv.org/pdf/1603.07771.pdf) and is stored in [this](https://github.com/DavidGrangier/wikipedia-biography-dataset) repo (owned by DavidGrangier).
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
This dataset is ditributed under Creative Comons CC BY-SA 3.0 License.
### Citation Information
For refering the original paper in BibTex format:
```
@article{DBLP:journals/corr/LebretGA16,
author = {R{\'{e}}mi Lebret and
David Grangier and
Michael Auli},
title = {Generating Text from Structured Data with Application to the Biography
Domain},
journal = {CoRR},
volume = {abs/1603.07771},
year = {2016},
url = {http://arxiv.org/abs/1603.07771},
archivePrefix = {arXiv},
eprint = {1603.07771},
timestamp = {Mon, 13 Aug 2018 16:48:30 +0200},
biburl = {https://dblp.org/rec/journals/corr/LebretGA16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
### Contributions
Thanks to [@alejandrocros](https://github.com/alejandrocros) for adding this dataset.
提供机构:
michaelauli
原始信息汇总
数据集概述
数据集基本信息
- 数据集名称: WikiBio
- 语言: 英语
- 许可证: CC BY-SA 3.0
- 多语言性: 单语种
- 数据集大小: 100K<n<1M
- 源数据: 原始数据
- 任务类别: 表格到文本
- PapersWithCode ID: wikibio
数据集结构
特征
- 输入文本:
- 表:
- 列头: 字符串
- 行号: 整数16位
- 内容: 字符串
- 上下文: 字符串
- 表:
- 目标文本: 字符串
数据分割
- 训练集:
- 字节数: 619269257
- 样本数: 582659
- 测试集:
- 字节数: 77264695
- 样本数: 72831
- 验证集:
- 字节数: 77335069
- 样本数: 72831
数据集大小
- 下载大小: 333998704
- 数据集大小: 773869021
数据集创建
源数据
- 论文: Neural Text Generation from Structured Data with Application to the Biography Domain
- GitHub仓库: https://github.com/DavidGrangier/wikipedia-biography-dataset
数据集摘要
该数据集包含从维基百科提取的728321个传记,包含传记的第一段和表格信息框。
支持的任务和排行榜
该数据集的主要目的是开发文本生成模型。
使用数据集的注意事项
许可证信息
该数据集在Creative Commons CC BY-SA 3.0许可证下发布。
引用信息
引用原始论文时,请使用以下BibTeX格式:
@article{DBLP:journals/corr/LebretGA16, author = {R{{e}}mi Lebret and David Grangier and Michael Auli}, title = {Generating Text from Structured Data with Application to the Biography Domain}, journal = {CoRR}, volume = {abs/1603.07771}, year = {2016}, url = {http://arxiv.org/abs/1603.07771}, archivePrefix = {arXiv}, eprint = {1603.07771}, timestamp = {Mon, 13 Aug 2018 16:48:30 +0200}, biburl = {https://dblp.org/rec/journals/corr/LebretGA16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
贡献者
感谢@alejandrocros添加此数据集。
搜集汇总
数据集介绍

构建方式
在自然语言生成领域,结构化数据向文本的转换是一项关键任务,WikiBio数据集为此提供了重要资源。该数据集构建于维基百科平台,通过自动化流程提取了728,321条人物传记条目,每条数据均包含信息框表格与传记首段文本。构建过程中,研究者从维基百科页面中系统性地采集信息框的结构化数据,并将其与对应的传记首段进行配对,形成表格到文本的映射关系。数据经过规范化处理,确保表格字段与文本内容在语义层面保持对齐,为后续的文本生成研究奠定了坚实基础。
特点
WikiBio数据集在表格到文本生成任务中展现出鲜明的特征。其核心在于将维基百科信息框的异构结构化数据与连贯的传记文本相结合,表格字段涵盖人物类型、任职时间、国籍等多元属性,而目标文本则为自然流畅的叙述段落。数据集规模庞大,包含超过58万训练样本,且严格划分训练、验证与测试集,保障了模型评估的可靠性。数据以JSON格式组织,清晰区分输入表格与目标文本,这种结构化为深度学习模型提供了理想的序列到序列学习框架。
使用方法
该数据集主要服务于文本生成模型的开发与评估,尤其适用于基于结构化数据的自然语言生成任务。研究人员可通过加载标准化的数据分割,将表格字段作为模型输入,传记首段作为生成目标,训练序列到序列模型或预训练语言模型。典型应用包括评估模型从表格中提取关键信息并生成连贯叙述的能力,同时可用于研究数据到文本的忠实度与流畅度等指标。使用时应遵循CC BY-SA 3.0许可协议,并引用原始论文以尊重学术贡献。
背景与挑战
背景概述
WikiBio数据集诞生于2016年,由Rémi Lebret、David Grangier与Michael Auli等研究人员共同构建,旨在推动结构化数据到自然文本生成领域的研究。该数据集聚焦于传记文本生成这一核心问题,从维基百科中提取了超过72万条人物传记条目,每条数据均包含信息框表格与对应的首段摘要文本。这一资源为自然语言生成模型提供了大规模、高质量的平行语料,显著促进了数据驱动文本生成技术的发展,尤其在基于表格的叙述生成任务上树立了重要基准,对计算语言学与人工智能领域产生了深远影响。
当前挑战
WikiBio数据集致力于解决从结构化表格数据生成连贯、准确自然语言描述的挑战,这要求模型深入理解表格中离散字段间的语义关联,并将其转化为流畅的叙述文本。在构建过程中,数据集面临多重挑战:维基百科信息框结构多样、字段命名不一致,需进行大量规范化处理以确保数据质量;同时,传记首段摘要与表格信息并非严格对齐,存在信息取舍与语言风格差异,增加了数据配准的复杂度。此外,数据规模庞大,在抽取、清洗与格式统一过程中需高效处理海量异构信息,保障数据的可靠性与一致性。
常用场景
经典使用场景
在自然语言生成领域,WikiBio数据集以其结构化表格与传记文本的对应关系,成为表格到文本生成任务的经典基准。该数据集从维基百科中提取了超过70万条人物传记条目,每条数据包含信息框表格及其对应的首段摘要文本。这一设计使得模型能够学习如何将离散的结构化数据转化为连贯、流畅的自然语言描述,为研究者提供了评估文本生成模型性能的标准化平台。
实际应用
在实际应用层面,WikiBio数据集所支撑的技术能够自动化生成人物简介、产品描述、新闻简报等结构化内容的文本摘要。例如,在知识库系统或智能助理中,可根据数据库中的属性条目快速生成用户可读的叙述性介绍,显著提升信息呈现的效率和可读性。这类技术也被应用于商业智能报告生成、无障碍信息访问等场景,将冰冷的数据转化为易于理解的叙述。
衍生相关工作
围绕WikiBio数据集,学术界衍生了一系列经典研究工作。其开创性论文提出了基于注意力机制的神经生成模型,为后续研究奠定了基础。后续工作在此基础上探索了强化学习优化、内容规划、事实一致性约束等方向,例如引入拷贝网络处理罕见实体,或使用预训练语言模型提升生成质量。这些研究不断推动着数据到文本生成技术的边界,并催生了更多针对特定领域或更复杂结构数据的类似数据集。
以上内容由遇见数据集搜集并总结生成



