neulab/wiki_asp
收藏数据集卡片 - WikiAsp
数据集描述
数据集概述
WikiAsp 是一个面向多领域基于方面的摘要数据集。该数据集包含多个配置,每个配置针对不同的主题领域。
支持的任务和排行榜
该数据集主要用于基于方面的摘要任务。
语言
数据集中的文本主要使用英语。
数据集结构
数据实例
以下是一个来自 "plant" 配置的示例:
json { "exid": "train-78-8", "inputs": [ "< EOT > calcareous rocks and barrens, wooded cliff edges.", "plant an erect short-lived perennial (or biennial) herb whose slender leafy stems radiate from the base, and are 3-5 dm tall, giving it a bushy appearance.", "leaves densely hairy, grayish-green, simple and alternate on the stem.", "flowers are bright yellow to yellow-orange, cross-shaped, each having 4 spatula-shaped petals about 5 mm long.", "fruit is a nearly globe-shaped capsule, about 3 mm in diameter, with 1 or 2 seeds in each cell.", "flowering period: early april to late may.", "even though there are many members of the mustard family in the range of this species, no other plant shares this combination of characters: bright yellow flowers, grayish-green stems and foliage, globe-shaped fruits with a long style, perennial habit, and the habitat of limestone rocky cliffs.", "timber removal may be beneficial and even needed to maintain the open character of the habitat for this species.", "hand removal of trees in the vicinity of the population is necessary to avoid impacts from timber operations.", "southwest indiana, north central kentucky, and north central tennessee.", "email: naturepreserves@ky.gov feedback naturepreserves@ky.gov | about the agency | about this site copyright © 2003-2013 commonwealth of kentucky.", "all rights reserved.", "<EOS>" ], "targets": [ [ "description", "physaria globosa is a small plant covered with dense hairs giving it a grayish appearance. it produces yellow flowers in the spring, and its fruit is globe-shaped. its preferred habitat is dry limestone cliffs, barrens, cedar glades, steep wooded slopes, and talus areas. some have also been found in areas of deeper soil and roadsides." ], [ "conservation", "the population fluctuates year to year, but on average there are about 2000 living plants at any one time, divided among 33 known locations. threats include forms of habitat degradation and destruction, including road construction and grading, mowing, dumping, herbicides, alteration of waterways, livestock damage, and invasive species of plants such as japanese honeysuckle, garlic mustard, alsike clover, sweet clover, meadow fescue, and multiflora rose. all populations are considered vulnerable to extirpation." ] ] }
数据字段
exid: 唯一标识符inputs: 引用的参考文献,由分词后的句子组成(使用NLTK)targets: 基于方面的摘要列表,每个元素是一个包含目标方面和方面摘要的配对
数据分割
数据集包含多个配置,每个配置都有训练集、测试集和验证集。以下是部分配置的详细信息:
-
album:
- 训练集:24434个样本,1907323642字节
- 测试集:3038个样本,232999001字节
- 验证集:3104个样本,234990092字节
- 下载大小:644173065字节
- 数据集大小:2375312735字节
-
animal:
- 训练集:16540个样本,497474133字节
- 测试集:2007个样本,61315970字节
- 验证集:2005个样本,57943532字节
- 下载大小:150974930字节
- 数据集大小:616733635字节
-
artist:
- 训练集:26754个样本,1876134255字节
- 测试集:3329个样本,237751553字节
- 验证集:3194个样本,223240910字节
- 下载大小:626686303字节
- 数据集大小:2337126718字节
-
building:
- 训练集:20449个样本,1100057273字节
- 测试集:2482个样本,134357678字节
- 验证集:2607个样本,139387376字节
- 下载大小:346224042字节
- 数据集大小:1373802327字节
-
company:
- 训练集:24353个样本,1606057076字节
- 测试集:3029个样本,199282041字节
- 验证集:2946个样本,200498778字节
- 下载大小:504194353字节
- 数据集大小:2005837895字节
-
educational_institution:
- 训练集:17634个样本,1623000534字节
- 测试集:2267个样本,200476681字节
- 验证集:2141个样本,203262430字节
- 下载大小:471033992字节
- 数据集大小:2026739645字节
-
event:
- 训练集:6475个样本,748201660字节
- 测试集:828个样本,96212295字节
- 验证集:807个样本,97431395字节
- 下载大小:240072903字节
- 数据集大小:941845350字节
-
film:
- 训练集:32129个样本,2370068027字节
- 测试集:3981个样本,294918370字节
- 验证集:4014个样本,290240851字节
- 下载大小:808231638字节
- 数据集大小:2955227248字节
-
group:
- 训练集:11966个样本,1025166800字节
- 测试集:1444个样本,114239405字节
- 验证集:1462个样本,120863870字节
- 下载大小:344498865字节
- 数据集大小:1260270075字节
-
historic_place:
- 训练集:4919个样本,256158020字节
- 测试集:600个样本,31201154字节
- 验证集:601个样本,29058067字节
- 下载大小:77289509字节
- 数据集大小:316417241字节
-
infrastructure:
- 训练集:17226个样本,1124486451字节
- 测试集:2091个样本,134820330字节
- 验证集:1984个样本,125193140字节
- 下载大小:328804337字节
- 数据集大小:1384499921字节
-
mean_of_transportation:
- 训练集:9277个样本,650424738字节
- 测试集:1170个样本,89759392字节
- 验证集:1215个样本,88440901字节
- 下载大小:210234418字节
- 数据集大小:828625031字节
-
office_holder:
- 训练集:18177个样本,1643899203字节
- 测试集:2333个样本,207433317字节
- 验证集:2218个样本,202624275字节
- 下载大小:524721727字节
- 数据集大小:2053956795字节
-
plant:
- 训练集:6107个样本,239150885字节
- 测试集:774个样本,31340125字节
- 验证集:786个样本,28752150字节
- 下载大小:77890632字节
- 数据集大小:299243160字节
-
single:
- 训练集:14217个样本,1277277277字节
- 测试集:1712个样本,152328537字节
- 验证集:1734个样本,160312594字节
- 下载大小:429214401字节
- 数据集大小:1589918408字节
-
soccer_player:
- 训练集:17599个样本,604502541字节
- 测试集:2280个样本,72820378字节
- 验证集:2150个样本,76705685字节
- 下载大小:193347234字节
- 数据集大小:754028604字节
-
software:
- 训练集:13516个样本,1122906186字节
- 测试集:1638个样本,133717992字节
- 验证集:1637个样本,134578157字节
- 下载大小:356764908字节
- 数据集大小:1391202335字节
-
television_show:
- 训练集:8717个样本,893325347字节
- 测试集:1072个样本,115155155字节
- 验证集:1128个样本,119461892字节
- 下载大小:302093407字节
- 数据集大小:1127942394字节
-
town:
- 训练集:14818个样本,772504751字节
- 测试集:1831个样本,100975827字节
- 验证集:1911个样本,101522638字节
- 下载大小:243261734字节
- 数据集大小:975003216字节
-
written_work:
- 训练集:15065个样本,1491395960字节
- 测试集:1931个样本,189537205字节
- 验证集:1843个样本,185707567字节
- 下载大小:498307235字节
- 数据集大小:1866640732字节




