SemEval-2021 Task 1: Lexical Complexity Prediction

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/5137149

下载链接

链接失效反馈

官方服务：

资源简介：

We ran a shared task on Lexical Complexity Prediction via SemEval using this data. The data was released according to the following schedule: Trial data available: July 31, 2020 Training data available: September 4, 2020 Test data available/Evaluation starts: January 11, 2021 See the task website for further information: https://sites.google.com/view/lcpsharedtask2021 See the CodaLab site to view submissions: https://competitions.codalab.org/competitions/27420 If you find this dataset useful, please cite the task paper as: Shardlow, M. Evans, R., Paetzold, G and Zampieri, M. SemEval-2021 Task 1: Lexical Complexity Prediction. In Proceedings of the 15th International Workshop on Semantic Evaluation. 2021. The script evaluate.py is the same evaluation script that was used as part of the shared task. You can run this by calling "python evaluate.py /res/ /ref/" where res and ref are directories containing the system's results and the reference labels respectively. The trial, training and test data are available. The trial data comprises of 99 MWEs (29 bible, 33 biomed and 37 europarl) and 421 single word instances (143 bible, 135 biomed and 143 europarl). The training data comprises of 1,517 MWEs (505 bible, 514 biomed and 498 europarl) and 7,662 single word instances (2,574 bible, 2,576 biomed and 2,512 europarl). The test data comprises of 184 MWEs (66 bible, 53 biomed and 65 europarl) and 917 single word instances (283 bible, 289 biomed and 345 europarl). It is compressedd and encrypted using 7zip as above. The password will be released to those registered for the task. The order of the test data has been randomised to prevent systems from overfitting to the data ordering. All submissions should be made via CodaLab as above. The trial and training data is arranged by token and sorted by complexity (i.e., instances with the same token appear together in groups, and the groups are sorted by the complexity of the lowest scored item). Tokens that appear in one partition will not appear in another partition. We have deliberately included more than one instance of a token where possible to identify places where the context affects the complexity of a word. Consider, for example, the following two sentences from the trial data: We now have a proposal on the table which lays down strict emissions values for the next ten years and which simultaneously creates clarity and incentives for technological innovations. Mr President, in coordination with other groups, I would like to table an oral amendment concerning the draft bill in the Duma to ignore certain rulings of the European Court of Human Rights. When table is used as a noun in the first sentence (albeit in an abstract sense) it receives a score of 0.01 indicating it is very easily understood. However, when it is used as a verb in the second, it is scored at 0.23, indicating a more difficult word to comprehend. The data comes from three sources: biblical text, biomedical articles and proceedings of the European Parliament. These sources were selected as they contain a natural mixture of common language and difficult to understand expressions, whilst each containing vastly different domain-specific vocabulary.

创建时间：

2021-07-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集