DFKI-SLT/DWIE

Name: DFKI-SLT/DWIE
Creator: DFKI-SLT
Published: 2024-05-15 06:42:35
License: 暂无描述

Hugging Face2024-05-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/DFKI-SLT/DWIE

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - en license: other multilinguality: - monolingual size_categories: - 10M<n<100M source_datasets: - original task_categories: - feature-extraction - text-classification task_ids: - entity-linking-classification paperswithcode_id: acronym-identification pretty_name: DWIE (Deutsche Welle corpus for Information Extraction) is a new dataset for document-level multi-task Information Extraction (IE). tags: - Named Entity Recognition, Coreference Resolution, Relation Extraction, Entity Linking dataset_info: config_name: Task_1 features: - name: id dtype: string - name: content dtype: string - name: tags dtype: string - name: mentions list: - name: begin dtype: int32 - name: end dtype: int32 - name: text dtype: string - name: concept dtype: int32 - name: candidates sequence: string - name: scores sequence: float32 - name: concepts list: - name: concept dtype: int32 - name: text dtype: string - name: keyword dtype: bool - name: count dtype: int32 - name: link dtype: string - name: tags sequence: string - name: relations list: - name: s dtype: int32 - name: p dtype: string - name: o dtype: int32 - name: frames list: - name: type dtype: string - name: slots list: - name: name dtype: string - name: value dtype: int32 - name: iptc sequence: string splits: - name: train num_bytes: 16533390 num_examples: 802 download_size: 3822277 dataset_size: 16533390 configs: - config_name: Task_1 data_files: - split: train path: Task_1/train-* default: true train-eval-index: - col_mapping: labels: tags tokens: tokens config: default splits: eval_split: test task_id: entity_extraction --- # Dataset Card for DWIE ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://opendatalab.com/DWIE](https://opendatalab.com/DWIE) - **Repository:** [https://github.com/klimzaporojets/DWIE](https://github.com/klimzaporojets/DWIE) - **Paper:** [DWIE: an entity-centric dataset for multi-task document-level information extraction](https://arxiv.org/abs/2009.12626) - **Leaderboard:** [https://opendatalab.com/DWIE](https://opendatalab.com/DWIE) - **Size of downloaded dataset files:** 40.8 MB ### Dataset Summary DWIE (Deutsche Welle corpus for Information Extraction) is a new dataset for document-level multi-task Information Extraction (IE). It combines four main IE sub-tasks: 1.Named Entity Recognition: 23,130 entities classified in 311 multi-label entity types (tags). 2.Coreference Resolution: 43,373 entity mentions clustered in 23,130 entities. 3.Relation Extraction: 21,749 annotated relations between entities classified in 65 multi-label relation types. 4.Entity Linking: the named entities are linked to Wikipedia (version 20181115). For details, see the paper https://arxiv.org/pdf/2009.12626v2.pdf. ### Supported Tasks and Leaderboards - **Tasks:** Named Entity Recognition, Coreference Resolution, Relation extraction and entity linking in scientific papers - **Leaderboards:** [https://opendatalab.com/DWIE](https://opendatalab.com/DWIE) ### Languages The language in the dataset is English. ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 40.8 MB An example of 'train' looks as follows, provided sample of the data: ```json {'id': 'DW_3980038', 'content': 'Proposed Nabucco Gas Pipeline Gets European Bank Backing\nThe heads of the EU\'s European Investment Bank and the European Bank for Reconstruction and Development (EBRD) said Tuesday, Jan. 27, that they are prepared to provide financial backing for the Nabucco gas pipeline.\nSpurred on by Europe\'s worst-ever gas crisis earlier this month, which left millions of homes across the continent without heat in the depths of winter, Hungarian Prime Minister Ferenc Gyurcsany invited top-ranking officials from both the EU and the countries involved in Nabucco to inject fresh momentum into the slow-moving project. Nabucco, an ambitious but still-unbuilt gas pipeline aimed at reducing Europe\'s energy reliance on Russia, is a 3,300-kilometer (2,050-mile) pipeline between Turkey and Austria. Costing an estimated 7.9 billion euros, the aim is to transport up to 31 billion cubic meters of gas each year from the Caspian Sea to Western Europe, bypassing Russia and Ukraine. Nabucco currently has six shareholders -- OMV of Austria, MOL of Hungary, Transgaz of Romania, Bulgargaz of Bulgaria, Botas of Turkey and RWE of Germany. But for the pipeline to get moving, Nabucco would need an initial cash injection of an estimated 300 million euros. Both the EIB and EBRD said they were willing to invest in the early stages of the project through a series of loans, providing certain conditions are met. "The EIB is ready to finance projects that further EU objectives of increased sustainability and energy security," said Philippe Maystadt, president of the European Investment Bank, during the opening addresses by participants at the "Nabucco summit" in Hungary. The EIB is prepared to finance "up to 25 percent of project cost," provided a secure intergovernmental agreement on the Nabucco pipeline is reached, he said. Maystadt noted that of 48 billion euros of financing it provided last year, a quarter was for energy projects. EBRD President Thomas Mirow also offered financial backing to the Nabucco pipeline, on the condition that it "meets the requirements of solid project financing." The bank would need to see concrete plans and completion guarantees, besides a stable political agreement, said Mirow. EU wary of future gas crises Czech Prime Minister Mirek Topolanek, whose country currently holds the rotating presidency of the EU, spoke about the recent gas crisis caused by a pricing dispute between Russia and Ukraine that affected supplies to Europe. "A new crisis could emerge at any time, and next time it could be even worse," Topolanek said. He added that reaching an agreement on Nabucco is a "test of European solidarity." The latest gas row between Russia and Ukraine has highlighted Europe\'s need to diversify its energy sources and thrown the spotlight on Nabucco. But critics insist that the vast project will remain nothing but a pipe dream because its backers cannot guarantee that they will ever have sufficient gas supplies to make it profitable. EU Energy Commissioner Andris Piebalgs urged political leaders to commit firmly to Nabucco by the end of March, or risk jeopardizing the project. In his opening address as host, Hungarian Prime Minister Ferenc Gyurcsany called on the EU to provide 200 to 300 million euros within the next few weeks to get the construction of the pipeline off the ground. Gyurcsany stressed that he was not hoping for a loan, but rather for starting capital from the EU. US Deputy Assistant Secretary of State Matthew Bryza noted that the Tuesday summit had made it clear that Gyurcsany, who dismissed Nabucco as "a dream" in 2007, was now fully committed to the energy supply diversification project. On the supply side, Turkmenistan and Azerbaijan both indicated they would be willing to supply some of the gas. "Azerbaijan, which is according to current plans is a transit country, could eventually serve as a supplier as well," Azerbaijani President Ilham Aliyev said. Azerbaijan\'s gas reserves of some two or three trillion cubic meters would be sufficient to last "several decades," he said. Austrian Economy Minister Reinhold Mitterlehner suggested that Egypt and Iran could also be brought in as suppliers in the long term. But a deal currently seems unlikely with Iran given the long-running international standoff over its disputed nuclear program. Russia, Ukraine still wrangling Meanwhile, Russia and Ukraine were still wrangling over the details of the deal which ended their gas quarrel earlier this month. Ukrainian President Viktor Yushchenko said on Tuesday he would stand by the terms of the agreement with Russia, even though not all the details are to his liking. But Russian officials questioned his reliability, saying that the political rivalry between Yushchenko and Prime Minister Yulia Timoshenko could still lead Kiev to cancel the contract. "The agreements signed are not easy ones, but Ukraine fully takes up the performance (of its commitments) and guarantees full-fledged transit to European consumers," Yushchenko told journalists in Brussels after a meeting with the head of the European Commission, Jose Manuel Barroso. The assurance that Yushchenko would abide by the terms of the agreement finalized by Timoshenko was "an important step forward in allowing us to focus on our broader relationship," Barroso said. But the spokesman for Russian Prime Minister Vladimir Putin said that Moscow still feared that the growing rivalry between Yushchenko and Timoshenko, who are set to face off in next year\'s presidential election, could torpedo the deal. EU in talks to upgrade Ukraine\'s transit system Yushchenko\'s working breakfast with Barroso was dominated by the energy question, with both men highlighting the need to upgrade Ukraine\'s gas-transit system and build more links between Ukrainian and European energy markets. The commission is set to host an international conference aimed at gathering donations to upgrade Ukraine\'s gas-transit system on March 23 in Brussels. The EU and Ukraine have agreed to form a joint expert group to plan the meeting, the leaders said Tuesday. During the conflict, Barroso had warned that both Russia and Ukraine were damaging their credibility as reliable partners. But on Monday he said that "in bilateral relations, we are not taking any negative consequences from (the gas row) because we believe Ukraine wants to deepen the relationship with the EU, and we also want to deepen the relationship with Ukraine." He also said that "we have to state very clearly that we were disappointed by the problems between Ukraine and Russia," and called for political stability and reform in Ukraine. His diplomatic balancing act is likely to have a frosty reception in Moscow, where Peskov said that Russia "would prefer to hear from the European states a very serious and severe evaluation of who is guilty for interrupting the transit."', 'tags': "['all', 'train']", 'mentions': [{'begin': 9, 'end': 29, 'text': 'Nabucco Gas Pipeline', 'concept': 1, 'candidates': [], 'scores': []}, {'begin': 287, 'end': 293, 'text': 'Europe', 'concept': 2, 'candidates': ['Europe', 'UEFA', 'Europe_(band)', 'UEFA_competitions', 'European_Athletic_Association', 'European_theatre_of_World_War_II', 'European_Union', 'Europe_(dinghy)', 'European_Cricket_Council', 'UEFA_Champions_League', 'Senior_League_World_Series_(Europe–Africa_Region)', 'Big_League_World_Series_(Europe–Africa_Region)', 'Sailing_at_the_2004_Summer_Olympics_–_Europe', 'Neolithic_Europe', 'History_of_Europe', 'Europe_(magazine)'], 'scores': [0.8408304452896118, 0.10987312346696854, 0.01377162616699934, 0.002099192701280117, 0.0015916954725980759, 0.0015686274273321033, 0.001522491336800158, 0.0013148789294064045, 0.0012456747936084867, 0.000991926179267466, 0.0008073817589320242, 0.0007843137136660516, 0.000761245668400079, 0.0006920415326021612, 0.0005536332027986646, 0.000530565157532692]}, 0.00554528646171093, 0.004390018526464701, 0.003234750358387828, 0.002772643230855465, 0.001617375179193914]}, {'begin': 6757, 'end': 6765, 'text': 'European', 'concept': 13, 'candidates': None, 'scores': []}], 'concepts': [{'concept': 0, 'text': 'European Investment Bank', 'keyword': True, 'count': 5, 'link': 'European_Investment_Bank', 'tags': ['iptc::11000000', 'slot::keyword', 'topic::politics', 'type::entity', 'type::igo', 'type::organization']}, {'concept': 66, 'text': None, 'keyword': False, 'count': 0, 'link': 'Czech_Republic', 'tags': []}], 'relations': [{'s': 0, 'p': 'institution_of', 'o': 2}, {'s': 0, 'p': 'part_of', 'o': 2}, {'s': 3, 'p': 'institution_of', 'o': 2}, {'s': 3, 'p': 'part_of', 'o': 2}, {'s': 6, 'p': 'head_of', 'o': 0}, {'s': 6, 'p': 'member_of', 'o': 0}, {'s': 7, 'p': 'agent_of', 'o': 4}, {'s': 7, 'p': 'citizen_of', 'o': 4}, {'s': 7, 'p': 'citizen_of-x', 'o': 55}, {'s': 7, 'p': 'head_of_state', 'o': 4}, {'s': 7, 'p': 'head_of_state-x', 'o': 55}, {'s': 8, 'p': 'agent_of', 'o': 4}, {'s': 8, 'p': 'citizen_of', 'o': 4}, {'s': 8, 'p': 'citizen_of-x', 'o': 55}, {'s': 8, 'p': 'head_of_gov', 'o': 4}, {'s': 8, 'p': 'head_of_gov-x', 'o': 55}, {'s': 9, 'p': 'head_of', 'o': 59}, {'s': 9, 'p': 'member_of', 'o': 59}, {'s': 10, 'p': 'head_of', 'o': 3}, {'s': 10, 'p': 'member_of', 'o': 3}, {'s': 11, 'p': 'citizen_of', 'o': 66}, {'s': 11, 'p': 'citizen_of-x', 'o': 36}, {'s': 11, 'p': 'head_of_state', 'o': 66}, {'s': 11, 'p': 'head_of_state-x', 'o': 36}, {'s': 12, 'p': 'agent_of', 'o': 24}, {'s': 12, 'p': 'citizen_of', 'o': 24}, {'s': 12, 'p': 'citizen_of-x', 'o': 15}, {'s': 12, 'p': 'head_of_gov', 'o': 24}, {'s': 12, 'p': 'head_of_gov-x', 'o': 15}, {'s': 15, 'p': 'gpe0', 'o': 24}, {'s': 22, 'p': 'based_in0', 'o': 18}, {'s': 22, 'p': 'based_in0-x', 'o': 50}, {'s': 23, 'p': 'based_in0', 'o': 24}, {'s': 23, 'p': 'based_in0-x', 'o': 15}, {'s': 25, 'p': 'based_in0', 'o': 26}, {'s': 27, 'p': 'based_in0', 'o': 28}, {'s': 29, 'p': 'based_in0', 'o': 17}, {'s': 30, 'p': 'based_in0', 'o': 31}, {'s': 33, 'p': 'event_in0', 'o': 24}, {'s': 36, 'p': 'gpe0', 'o': 66}, {'s': 38, 'p': 'member_of', 'o': 2}, {'s': 43, 'p': 'agent_of', 'o': 41}, {'s': 43, 'p': 'citizen_of', 'o': 41}, {'s': 48, 'p': 'gpe0', 'o': 47}, {'s': 49, 'p': 'agent_of', 'o': 47}, {'s': 49, 'p': 'citizen_of', 'o': 47}, {'s': 49, 'p': 'citizen_of-x', 'o': 48}, {'s': 49, 'p': 'head_of_state', 'o': 47}, {'s': 49, 'p': 'head_of_state-x', 'o': 48}, {'s': 50, 'p': 'gpe0', 'o': 18}, {'s': 52, 'p': 'agent_of', 'o': 18}, {'s': 52, 'p': 'citizen_of', 'o': 18}, {'s': 52, 'p': 'citizen_of-x', 'o': 50}, {'s': 52, 'p': 'minister_of', 'o': 18}, {'s': 52, 'p': 'minister_of-x', 'o': 50}, {'s': 55, 'p': 'gpe0', 'o': 4}, {'s': 56, 'p': 'gpe0', 'o': 5}, {'s': 57, 'p': 'in0', 'o': 4}, {'s': 57, 'p': 'in0-x', 'o': 55}, {'s': 58, 'p': 'in0', 'o': 65}, {'s': 59, 'p': 'institution_of', 'o': 2}, {'s': 59, 'p': 'part_of', 'o': 2}, {'s': 60, 'p': 'agent_of', 'o': 5}, {'s': 60, 'p': 'citizen_of', 'o': 5}, {'s': 60, 'p': 'citizen_of-x', 'o': 56}, {'s': 60, 'p': 'head_of_gov', 'o': 5}, {'s': 60, 'p': 'head_of_gov-x', 'o': 56}, {'s': 61, 'p': 'in0', 'o': 5}, {'s': 61, 'p': 'in0-x', 'o': 56}], 'frames': [{'type': 'none', 'slots': []}], 'iptc': ['04000000', '11000000', '20000344', '20000346', '20000378', '20000638']} ``` ### Data Fields - `id` : unique identifier of the article. - `content` : textual content of the article downloaded with src/dwie_download.py script. - `tags` : used to differentiate between train and test sets of documents. - `mentions`: a list of entity mentions in the article each with the following keys: - `begin` : offset of the first character of the mention (inside content field). - `end` : offset of the last character of the mention (inside content field). - `text` : the textual representation of the entity mention. - `concept` : the id of the entity that represents the entity mention (multiple entity mentions in the article can refer to the same concept). - `candidates` : the candidate Wikipedia links. - `scores` : the prior probabilities of the candidates entity links calculated on Wikipedia corpus. - `concepts` : a list of entities that cluster each of the entity mentions. Each entity is annotated with the following keys: - `concept` : the unique document-level entity id. - `text` : the text of the longest mention that belong to the entity. - `keyword` : indicates whether the entity is a keyword. - `count` : the number of entity mentions in the document that belong to the entity. - `link` : the entity link to Wikipedia. - `tags` : multi-label classification labels associated to the entity. - `relations` : a list of document-level relations between entities (concepts). Each of the relations is annotated with the following keys: - `s` : the subject entity id involved in the relation. - `p` : the predicate that defines the relation name (i.e., "citizen_of", "member_of", etc.). - `o` : the object entity id involved in the relation. - `iptc` : multi-label article IPTC classification codes. For detailed meaning of each of the codes, please refer to the official IPTC code list. ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{zaporojets2021dwie, title={DWIE: An entity-centric dataset for multi-task document-level information extraction}, author={Zaporojets, Klim and Deleu, Johannes and Develder, Chris and Demeester, Thomas}, journal={Information Processing \& Management}, volume={58}, number={4}, pages={102563}, year={2021}, publisher={Elsevier} } ``` ### Contributions Thanks to [@basvoju](https://github.com/basvoju) for adding this dataset.

提供机构：

DFKI-SLT

原始信息汇总

数据集概述

数据集基本信息

名称: DWIE (Deutsche Welle corpus for Information Extraction)
语言: 英语
许可证: 其他
多语言性: 单语
大小: 10M<n<100M
源数据: 原始数据
任务类别: 特征提取, 文本分类
任务ID: entity-linking-classification
论文代码ID: acronym-identification
标签: Named Entity Recognition, Coreference Resolution, Relation Extraction, Entity Linking

数据集结构

数据字段

id: 文章的唯一标识符。
content: 文章的文本内容，通过src/dwie_download.py脚本下载。
tags: 用于区分训练集和测试集的文档。
mentions: 文章中实体提及的列表，每个提及包含以下键:
- begin: 提及的第一个字符的偏移量（在content字段内）。
- end: 提及的最后一个字符的偏移量（在content字段内）。
- text: 实体提及的文本表示。
- concept: 表示实体提及的实体ID（文章中的多个实体提及可以指代同一个概念）。
- candidates: 候选的维基百科链接。
- scores: 基于维基百科语料库计算的候选实体链接的先验概率。
concepts: 聚合每个实体提及的实体列表，每个实体注释包含以下键:
- concept: 文档级实体的唯一ID。
- text: 属于实体的最长提及的文本。
- keyword: 指示实体是否为关键词。
- count: 文档中属于实体的实体提及的数量。
- link: 实体链接到维基百科。
- tags: 与实体关联的多标签分类标签。
relations: 实体（概念）之间的文档级关系列表，每个关系注释包含以下键:
- s: 关系中的主体实体ID。
- p: 定义关系名称的谓词（例如，"citizen_of", "member_of"等）。
- o: 关系中的客体实体ID。
iptc: 文章的多标签IPTC分类代码。

数据集创建

数据集来源

初始数据收集和规范化: 未提供详细信息。
源语言生产者: 未提供详细信息。

注释

注释过程: 未提供详细信息。
注释者: 未提供详细信息。

个人和敏感信息

个人和敏感信息处理: 未提供详细信息。

使用数据集的考虑

数据集的社会影响: 未提供详细信息。
数据集的偏见讨论: 未提供详细信息。
其他已知限制: 未提供详细信息。

搜集汇总

数据集介绍

构建方式

在信息抽取领域，DWIE数据集的构建体现了对多任务协同标注的深度探索。该数据集源自德国之声的新闻报道，通过专家标注的方式，系统性地整合了命名实体识别、共指消解、关系抽取和实体链接四大核心任务。其构建过程采用文档级标注策略，确保每个文档内的实体提及、概念聚类及关系网络形成连贯的语义结构。标注过程中，实体与维基百科条目进行链接，并赋予多标签分类，使得数据集在保持原始文本丰富性的同时，具备了结构化知识的深度。

特点

DWIE数据集的显著特点在于其多任务集成与文档级语义完整性。数据集涵盖23,130个实体，划分为311种多标签实体类型，并包含43,373个实体提及的共指聚类，以及21,749条标注关系，涉及65种多标签关系类型。每个实体均链接至特定版本的维基百科，并附有候选链接的概率评分，增强了实体消歧的可靠性。数据以JSON格式呈现，结构清晰，字段涵盖内容、提及、概念、关系及IPTC分类代码，支持复杂的信息抽取模型训练与评估。

使用方法

使用DWIE数据集时，研究者可借助Hugging Face平台直接加载，其预定义的配置支持实体抽取等任务的便捷访问。数据集适用于训练文档级多任务信息抽取模型，用户可通过解析content字段获取原始文本，并利用mentions、concepts和relations字段进行实体识别、共指解析及关系分类。数据集的训练集包含802个文档，结构一致，便于分割为训练与测试集。此外，实体链接的候选列表与评分为消歧任务提供了丰富特征，支持端到端或分阶段的模型开发。

背景与挑战

背景概述

在信息抽取领域，文档级多任务处理长期面临数据资源匮乏的困境。为应对这一挑战，德国人工智能研究中心（DFKI）的研究团队于2020年推出了DWIE（Deutsche Welle corpus for Information Extraction）数据集。该数据集基于德国之声的新闻文本构建，旨在通过整合命名实体识别、共指消解、关系抽取和实体链接四大核心任务，为文档级信息抽取研究提供统一的评估基准。其创新性地采用多标签分类体系，涵盖311种实体类型与65种关系类型，显著推动了跨任务联合建模方法的发展，成为该领域的重要里程碑。

当前挑战

DWIE数据集致力于解决文档级信息抽取中跨任务协同的复杂性挑战，其核心在于如何实现命名实体、共指链、语义关系与外部知识库链接的联合标注与建模。构建过程中的主要挑战体现在多维度标注体系的融合：专家需在长篇文档中保持实体提及、概念聚类及关系网络的一致性标注，同时需处理多标签分类带来的类别重叠问题。此外，将实体精准链接至特定版本的维基百科条目，要求标注者具备深厚的领域知识以应对候选链接的歧义性，这大幅增加了数据构建的复杂度与时间成本。

常用场景

经典使用场景

在自然语言处理领域，文档级信息抽取任务常面临多任务协同的挑战，DWIE数据集以其丰富的标注层次为这一难题提供了经典范例。该数据集整合了命名实体识别、共指消解、关系抽取和实体链接四大子任务，使研究者能够在单一文档框架下同步训练和评估模型，尤其适用于探索跨任务间的依赖关系与联合优化策略，为文档级理解模型的开发奠定了坚实基础。

解决学术问题

DWIE数据集有效应对了信息抽取研究中长期存在的任务割裂与文档上下文缺失问题。通过提供统一标注的文档级语料，它使得学术工作能够深入探究实体提及、共指链与语义关系之间的内在关联，推动了多任务学习、端到端文档建模等前沿方向的发展，显著提升了模型对长文本中复杂语义结构的捕捉能力，为知识图谱构建、事件理解等高层应用提供了理论支撑。

衍生相关工作

围绕DWIE数据集，学术界已涌现出一系列经典研究工作。例如，部分研究利用其多任务标注特性，提出了统一的文档级信息抽取框架，实现了子任务间的参数共享与联合推理；另有工作专注于改进共指消解与关系抽取的交互机制，以提升长文档中实体关联的连贯性；此外，基于DWIE的实体链接任务也催生了结合上下文与知识库先验的神经链接模型，这些成果共同推动了文档级自然语言理解技术的演进与完善。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集