multitacred
收藏huggingface.co2025-03-26 收录
下载链接:
https://huggingface.co/datasets/DFKI-SLT/multitacred
下载链接
链接失效反馈官方服务:
资源简介:
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset
(https://nlp.stanford.edu/projects/tacred). It covers 12 typologically diverse languages from 9 language families,
and was created by the Speech & Language Technology group of DFKI (https://www.dfki.de/slt) by machine-translating the
instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the
original TACRED's data collection and annotation process, see the Stanford paper (https://aclanthology.org/D17-1004/).
Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an
invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the
instances).
Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish,
Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers.
Please see our ACL paper (https://arxiv.org/abs/2305.04582) for full details.
NOTE: This Datasetreader supports a reduced version of the original TACRED JSON format with the following changes:
- Removed fields: stanford_pos, stanford_ner, stanford_head, stanford_deprel, docid
The motivation for this is that we want to support additional languages, for which these fields were not required
or available. The reader expects the specification of a language-specific configuration specifying the variant
(original, revisited or retacred) and the language (as a two-letter iso code).
The DatasetReader changes the offsets of the following fields, to conform with standard Python usage (see
_generate_examples()):
- subj_end to subj_end + 1 (make end offset exclusive)
- obj_end to obj_end + 1 (make end offset exclusive)
NOTE 2: The MultiTACRED dataset offers an additional 'split', namely the backtranslated test data (translated to a
target language and then back to English). To access this split, use dataset['backtranslated_test'].
You can find the TACRED dataset reader for the English version of the dataset at
https://huggingface.co/datasets/DFKI-SLT/tacred.
MultiTACRED乃大规模TAC关系抽取数据集(https://nlp.stanford.edu/projects/tacred)的多语言版本。该数据集涵盖了来自9个语系、12种类型多样的语言,由DFKI(https://www.dfki.de/slt)语音与语言技术小组通过机器翻译原始TACRED数据集的实例并自动投影其实体标注而创建。关于原始TACRED的数据收集和标注过程的详细信息,请参阅斯坦福大学的研究论文(https://aclanthology.org/D17-1004/)。翻译的语法经过验证,通过检查XML标签标记的正确性。任何具有无效标签结构(例如缺失或无效的头尾标签对)的翻译都将被舍弃(平均而言,2.3%的实例)。
所涵盖的语言包括:阿拉伯语、汉语、芬兰语、法语、德语、印地语、匈牙利语、日语、波兰语、俄语、西班牙语、土耳其语。预期用途为监督关系分类。目标受众为研究人员。
请参阅我们的ACL论文(https://arxiv.org/abs/2305.04582)以获取完整详情。
注意:此DatasetReader支持原始TACRED JSON格式的缩减版本,并进行了以下更改:
- 删除的字段:stanford_pos、stanford_ner、stanford_head、stanford_deprel、docid
- 进行此更改的动机是为了支持更多语言,对于这些语言,这些字段不是必需的或不可用。读取器期望指定一个语言特定的配置,指定变体(原始、重访或retacred)和语言(作为两字母ISO代码)。
DatasetReader更改了以下字段的偏移量,以符合标准的Python用法(参见_generate_examples()):
- subj_end 更改为 subj_end + 1(使结束偏移量排他性)
- obj_end 更改为 obj_end + 1(使结束偏移量排他性)
注意2:MultiTACRED数据集还提供了一种额外的'split',即反向翻译的测试数据(翻译为目标语言,然后翻译回英语)。要访问此split,请使用dataset['backtranslated_test']。
您可以在https://huggingface.co/datasets/DFKI-SLT/tacred找到TACRED数据集的英语版本读取器。
提供机构:
huggingface.co



