five

QAmultilabelEURLEXsamples

收藏
IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/qamultilabeleurlexsamples
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset is the sampling dataset from EURLEX57k and built for multi-answer questioning task with EUROVOC. , Each legal document in the EURLEX57k dataset is assigned several labels from the European Vocabulary (EUROVOC), which maintains thousands of concepts such as export industry and organic acid. Before building the data, the sample is chosen. A Z-scorebased online sample size calculator is used to determine the sample sizes. The given confidence level is 95%. A 5% margin of uncertainty is used. The computation results in a 381 out of 45,000 train sample size. Additionally, 362 out of 6,000 were drawn for both validation and test samples. The train, validation, and test data examples after data building are 1708, 1650, and 1648, respectively. The dataset is the validation sample dataset. Data building is the initial stage in preparing the dataset for the multi-answer questioning task with label hierarchy. The simulation data are obtained by sampling the EURLEX dataset via the Z-score. The labels for the multiple answers are obtained by mapping the labeled Eurovoc concepts to the subdomain trees (/categories list) in the Eurovoc hierarchy. Then, labels and title(/text) are combined as the inputs for an extractive multiple answer questioning task. Titles is proved gaining similar performance as the legal document (Chalkidis et al., 2019) which could be utilized to deal with the long input problem for pre-trained models with restricted input lengths. Tokenization and label alignment are used in the second step to process the inputs. The third step involves fine-tuning pre-trained BERT-based models for the multi-answer question task using the pre-processed data. And using seqeval and the suggested auxiliary classification metric on validation and test samples, the performance of the fine-tuned models is assessed. The key elements of the methodology are presented in the subsections. 
提供机构:
LI, WANG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作