five

neulab/wiki_asp

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/neulab/wiki_asp
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - summarization task_ids: [] paperswithcode_id: wikiasp pretty_name: WikiAsp tags: - aspect-based-summarization dataset_info: - config_name: album features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1907323642 num_examples: 24434 - name: test num_bytes: 232999001 num_examples: 3038 - name: validation num_bytes: 234990092 num_examples: 3104 download_size: 644173065 dataset_size: 2375312735 - config_name: animal features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 497474133 num_examples: 16540 - name: test num_bytes: 61315970 num_examples: 2007 - name: validation num_bytes: 57943532 num_examples: 2005 download_size: 150974930 dataset_size: 616733635 - config_name: artist features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1876134255 num_examples: 26754 - name: test num_bytes: 237751553 num_examples: 3329 - name: validation num_bytes: 223240910 num_examples: 3194 download_size: 626686303 dataset_size: 2337126718 - config_name: building features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1100057273 num_examples: 20449 - name: test num_bytes: 134357678 num_examples: 2482 - name: validation num_bytes: 139387376 num_examples: 2607 download_size: 346224042 dataset_size: 1373802327 - config_name: company features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1606057076 num_examples: 24353 - name: test num_bytes: 199282041 num_examples: 3029 - name: validation num_bytes: 200498778 num_examples: 2946 download_size: 504194353 dataset_size: 2005837895 - config_name: educational_institution features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1623000534 num_examples: 17634 - name: test num_bytes: 200476681 num_examples: 2267 - name: validation num_bytes: 203262430 num_examples: 2141 download_size: 471033992 dataset_size: 2026739645 - config_name: event features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 748201660 num_examples: 6475 - name: test num_bytes: 96212295 num_examples: 828 - name: validation num_bytes: 97431395 num_examples: 807 download_size: 240072903 dataset_size: 941845350 - config_name: film features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 2370068027 num_examples: 32129 - name: test num_bytes: 294918370 num_examples: 3981 - name: validation num_bytes: 290240851 num_examples: 4014 download_size: 808231638 dataset_size: 2955227248 - config_name: group features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1025166800 num_examples: 11966 - name: test num_bytes: 114239405 num_examples: 1444 - name: validation num_bytes: 120863870 num_examples: 1462 download_size: 344498865 dataset_size: 1260270075 - config_name: historic_place features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 256158020 num_examples: 4919 - name: test num_bytes: 31201154 num_examples: 600 - name: validation num_bytes: 29058067 num_examples: 601 download_size: 77289509 dataset_size: 316417241 - config_name: infrastructure features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1124486451 num_examples: 17226 - name: test num_bytes: 134820330 num_examples: 2091 - name: validation num_bytes: 125193140 num_examples: 1984 download_size: 328804337 dataset_size: 1384499921 - config_name: mean_of_transportation features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 650424738 num_examples: 9277 - name: test num_bytes: 89759392 num_examples: 1170 - name: validation num_bytes: 88440901 num_examples: 1215 download_size: 210234418 dataset_size: 828625031 - config_name: office_holder features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1643899203 num_examples: 18177 - name: test num_bytes: 207433317 num_examples: 2333 - name: validation num_bytes: 202624275 num_examples: 2218 download_size: 524721727 dataset_size: 2053956795 - config_name: plant features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 239150885 num_examples: 6107 - name: test num_bytes: 31340125 num_examples: 774 - name: validation num_bytes: 28752150 num_examples: 786 download_size: 77890632 dataset_size: 299243160 - config_name: single features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1277277277 num_examples: 14217 - name: test num_bytes: 152328537 num_examples: 1712 - name: validation num_bytes: 160312594 num_examples: 1734 download_size: 429214401 dataset_size: 1589918408 - config_name: soccer_player features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 604502541 num_examples: 17599 - name: test num_bytes: 72820378 num_examples: 2280 - name: validation num_bytes: 76705685 num_examples: 2150 download_size: 193347234 dataset_size: 754028604 - config_name: software features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1122906186 num_examples: 13516 - name: test num_bytes: 133717992 num_examples: 1638 - name: validation num_bytes: 134578157 num_examples: 1637 download_size: 356764908 dataset_size: 1391202335 - config_name: television_show features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 893325347 num_examples: 8717 - name: test num_bytes: 115155155 num_examples: 1072 - name: validation num_bytes: 119461892 num_examples: 1128 download_size: 302093407 dataset_size: 1127942394 - config_name: town features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 772504751 num_examples: 14818 - name: test num_bytes: 100975827 num_examples: 1831 - name: validation num_bytes: 101522638 num_examples: 1911 download_size: 243261734 dataset_size: 975003216 - config_name: written_work features: - name: exid dtype: string - name: inputs sequence: string - name: targets sequence: sequence: string splits: - name: train num_bytes: 1491395960 num_examples: 15065 - name: test num_bytes: 189537205 num_examples: 1931 - name: validation num_bytes: 185707567 num_examples: 1843 download_size: 498307235 dataset_size: 1866640732 --- # Dataset Card for WikiAsp ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Wiki Asp](https://github.com/neulab/wikiasp) - **Repository:** [GitHub](https://github.com/neulab/wikiasp) - **Paper:** [WikiAsp: A Dataset for Multi-domain Aspect-based Summarization](https://arxiv.org/abs/2011.07832) ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances An example from the "plant" configuration: ``` { 'exid': 'train-78-8', 'inputs': ['< EOT > calcareous rocks and barrens , wooded cliff edges .', 'plant an erect short - lived perennial ( or biennial ) herb whose slender leafy stems radiate from the base , and are 3 - 5 dm tall , giving it a bushy appearance .', 'leaves densely hairy , grayish - green , simple and alternate on the stem .', 'flowers are bright yellow to yellow - orange , cross - shaped , each having 4 spatula - shaped petals about 5 mm long .', 'fruit is a nearly globe - shaped capsule , about 3 mm in diameter , with 1 or 2 seeds in each cell .', 'flowering period : early april to late may .', 'even though there are many members of the mustard family in the range of this species , no other plant shares this combination of characters : bright yellow flowers , grayish - green stems and foliage , globe - shaped fruits with a long style , perennial habit , and the habitat of limestone rocky cliffs .', 'timber removal may be beneficial and even needed to maintain the open character of the habitat for this species .', 'hand removal of trees in the vicinity of the population is necessary to avoid impacts from timber operations .', 'southwest indiana , north central kentucky , and north central tennessee .', 'email : naturepreserves @ ky . gov feedback naturepreserves @ ky . gov | about the agency | about this site copyright © 2003 - 2013 commonwealth of kentucky .', 'all rights reserved .', '<EOS>' ], 'targets': [ ['description', 'physaria globosa is a small plant covered with dense hairs giving it a grayish appearance . it produces yellow flowers in the spring , and its fruit is globe - shaped . its preferred habitat is dry limestone cliffs , barrens , cedar glades , steep wooded slopes , and talus areas . some have also been found in areas of deeper soil and roadsides .' ], ['conservation', 'the population fluctuates year to year , but on average there are about 2000 living plants at any one time , divided among 33 known locations . threats include forms of habitat degradation and destruction , including road construction and grading , mowing , dumping , herbicides , alteration of waterways , livestock damage , and invasive species of plants such as japanese honeysuckle , garlic mustard , alsike clover , sweet clover , meadow fescue , and multiflora rose . all populations are considered vulnerable to extirpation .' ] ] } ``` ### Data Fields - `exid`: a unique identifier - `input`: the cited references and consists of tokenized sentences (with NLTK) - `targets`: a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed] ### Contributions Thanks to [@katnoria](https://github.com/katnoria) for adding this dataset.
提供机构:
neulab
原始信息汇总

数据集卡片 - WikiAsp

数据集描述

数据集概述

WikiAsp 是一个面向多领域基于方面的摘要数据集。该数据集包含多个配置,每个配置针对不同的主题领域。

支持的任务和排行榜

该数据集主要用于基于方面的摘要任务。

语言

数据集中的文本主要使用英语。

数据集结构

数据实例

以下是一个来自 "plant" 配置的示例:

json { "exid": "train-78-8", "inputs": [ "< EOT > calcareous rocks and barrens, wooded cliff edges.", "plant an erect short-lived perennial (or biennial) herb whose slender leafy stems radiate from the base, and are 3-5 dm tall, giving it a bushy appearance.", "leaves densely hairy, grayish-green, simple and alternate on the stem.", "flowers are bright yellow to yellow-orange, cross-shaped, each having 4 spatula-shaped petals about 5 mm long.", "fruit is a nearly globe-shaped capsule, about 3 mm in diameter, with 1 or 2 seeds in each cell.", "flowering period: early april to late may.", "even though there are many members of the mustard family in the range of this species, no other plant shares this combination of characters: bright yellow flowers, grayish-green stems and foliage, globe-shaped fruits with a long style, perennial habit, and the habitat of limestone rocky cliffs.", "timber removal may be beneficial and even needed to maintain the open character of the habitat for this species.", "hand removal of trees in the vicinity of the population is necessary to avoid impacts from timber operations.", "southwest indiana, north central kentucky, and north central tennessee.", "email: naturepreserves@ky.gov feedback naturepreserves@ky.gov | about the agency | about this site copyright © 2003-2013 commonwealth of kentucky.", "all rights reserved.", "<EOS>" ], "targets": [ [ "description", "physaria globosa is a small plant covered with dense hairs giving it a grayish appearance. it produces yellow flowers in the spring, and its fruit is globe-shaped. its preferred habitat is dry limestone cliffs, barrens, cedar glades, steep wooded slopes, and talus areas. some have also been found in areas of deeper soil and roadsides." ], [ "conservation", "the population fluctuates year to year, but on average there are about 2000 living plants at any one time, divided among 33 known locations. threats include forms of habitat degradation and destruction, including road construction and grading, mowing, dumping, herbicides, alteration of waterways, livestock damage, and invasive species of plants such as japanese honeysuckle, garlic mustard, alsike clover, sweet clover, meadow fescue, and multiflora rose. all populations are considered vulnerable to extirpation." ] ] }

数据字段

  • exid: 唯一标识符
  • inputs: 引用的参考文献,由分词后的句子组成(使用NLTK)
  • targets: 基于方面的摘要列表,每个元素是一个包含目标方面和方面摘要的配对

数据分割

数据集包含多个配置,每个配置都有训练集、测试集和验证集。以下是部分配置的详细信息:

  • album:

    • 训练集:24434个样本,1907323642字节
    • 测试集:3038个样本,232999001字节
    • 验证集:3104个样本,234990092字节
    • 下载大小:644173065字节
    • 数据集大小:2375312735字节
  • animal:

    • 训练集:16540个样本,497474133字节
    • 测试集:2007个样本,61315970字节
    • 验证集:2005个样本,57943532字节
    • 下载大小:150974930字节
    • 数据集大小:616733635字节
  • artist:

    • 训练集:26754个样本,1876134255字节
    • 测试集:3329个样本,237751553字节
    • 验证集:3194个样本,223240910字节
    • 下载大小:626686303字节
    • 数据集大小:2337126718字节
  • building:

    • 训练集:20449个样本,1100057273字节
    • 测试集:2482个样本,134357678字节
    • 验证集:2607个样本,139387376字节
    • 下载大小:346224042字节
    • 数据集大小:1373802327字节
  • company:

    • 训练集:24353个样本,1606057076字节
    • 测试集:3029个样本,199282041字节
    • 验证集:2946个样本,200498778字节
    • 下载大小:504194353字节
    • 数据集大小:2005837895字节
  • educational_institution:

    • 训练集:17634个样本,1623000534字节
    • 测试集:2267个样本,200476681字节
    • 验证集:2141个样本,203262430字节
    • 下载大小:471033992字节
    • 数据集大小:2026739645字节
  • event:

    • 训练集:6475个样本,748201660字节
    • 测试集:828个样本,96212295字节
    • 验证集:807个样本,97431395字节
    • 下载大小:240072903字节
    • 数据集大小:941845350字节
  • film:

    • 训练集:32129个样本,2370068027字节
    • 测试集:3981个样本,294918370字节
    • 验证集:4014个样本,290240851字节
    • 下载大小:808231638字节
    • 数据集大小:2955227248字节
  • group:

    • 训练集:11966个样本,1025166800字节
    • 测试集:1444个样本,114239405字节
    • 验证集:1462个样本,120863870字节
    • 下载大小:344498865字节
    • 数据集大小:1260270075字节
  • historic_place:

    • 训练集:4919个样本,256158020字节
    • 测试集:600个样本,31201154字节
    • 验证集:601个样本,29058067字节
    • 下载大小:77289509字节
    • 数据集大小:316417241字节
  • infrastructure:

    • 训练集:17226个样本,1124486451字节
    • 测试集:2091个样本,134820330字节
    • 验证集:1984个样本,125193140字节
    • 下载大小:328804337字节
    • 数据集大小:1384499921字节
  • mean_of_transportation:

    • 训练集:9277个样本,650424738字节
    • 测试集:1170个样本,89759392字节
    • 验证集:1215个样本,88440901字节
    • 下载大小:210234418字节
    • 数据集大小:828625031字节
  • office_holder:

    • 训练集:18177个样本,1643899203字节
    • 测试集:2333个样本,207433317字节
    • 验证集:2218个样本,202624275字节
    • 下载大小:524721727字节
    • 数据集大小:2053956795字节
  • plant:

    • 训练集:6107个样本,239150885字节
    • 测试集:774个样本,31340125字节
    • 验证集:786个样本,28752150字节
    • 下载大小:77890632字节
    • 数据集大小:299243160字节
  • single:

    • 训练集:14217个样本,1277277277字节
    • 测试集:1712个样本,152328537字节
    • 验证集:1734个样本,160312594字节
    • 下载大小:429214401字节
    • 数据集大小:1589918408字节
  • soccer_player:

    • 训练集:17599个样本,604502541字节
    • 测试集:2280个样本,72820378字节
    • 验证集:2150个样本,76705685字节
    • 下载大小:193347234字节
    • 数据集大小:754028604字节
  • software:

    • 训练集:13516个样本,1122906186字节
    • 测试集:1638个样本,133717992字节
    • 验证集:1637个样本,134578157字节
    • 下载大小:356764908字节
    • 数据集大小:1391202335字节
  • television_show:

    • 训练集:8717个样本,893325347字节
    • 测试集:1072个样本,115155155字节
    • 验证集:1128个样本,119461892字节
    • 下载大小:302093407字节
    • 数据集大小:1127942394字节
  • town:

    • 训练集:14818个样本,772504751字节
    • 测试集:1831个样本,100975827字节
    • 验证集:1911个样本,101522638字节
    • 下载大小:243261734字节
    • 数据集大小:975003216字节
  • written_work:

    • 训练集:15065个样本,1491395960字节
    • 测试集:1931个样本,189537205字节
    • 验证集:1843个样本,185707567字节
    • 下载大小:498307235字节
    • 数据集大小:1866640732字节
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集名为'neulab/wiki_asp',是一个用于多领域基于方面摘要任务的英语数据集,规模在10K到100K之间。数据集包含输入文本和基于不同方面的目标摘要,如描述和保守,采用cc-by-sa-4.0许可。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作