Dataset for an LLM score extraction challenge
收藏Figshare2025-11-25 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Dataset_for_an_LLM_score_extraction_challenge/30712835
下载链接
链接失效反馈官方服务:
资源简介:
This zipfile contains three plain text files. One describes the task, one contains the task and one contains the answers to the task. To access the information you will need to use the 7-zip software and write LLM!!! when it prompts you. Here is the challenge. Please do not share the files anywhere online - they are encrypted to prevent LLMs reading the answers.The second file contains an example of scores extracted by ChatGPT 4.1-mini and the accuracy statistics for the following prompt:The following report gives one of the following scores 1* 2* 3* 4* or a number between and/or contains an evaluation or -1 to flag and unknown score. If the report is a score then return that score. Otherwise extract the final research quality score from this report, if there is one. Otherwise if it contains scores for originality rigour and significance then report the average of these three scores without reporting any calculations. Otherwise report -1 for missing value. Return your answer in this formatWhere is one of 1* 2* 3* 4* or a number between or -1 for missing. Only output the score.[Text with score goes here]--------The dataset includes outputs from Magistral, Llama 4 Scout and Gemma3 27b when asked to give a REF score to a journal article based on REF guidelines.Some outputs are truncated to 100 tokens or are truncated for other reasons. Some contain a score, others don't.The task is to use LLMs to obtain the REF score described by each report, or return -1 if it does not report a score.The scoring scale is 1* to 4*, and -1 should be returned if there is not possible to be confident about the score.For background information, this is what the scores mean (from: https://2021.ref.ac.uk/guidance-on-results/guidance-on-ref-2021-results/index.html):4*: Quality that is world-leading in terms of originality, significance and rigour.3*: Quality that is internationally excellent in terms of originality, significance and rigour but which falls short of the highest standards of excellence.2*: Quality that is recognised internationally in terms of originality, significance and rigour1*: Quality that is recognised nationally in terms of originality, significance and rigour.The LLM should report either an overall score, or, if no overall score is reported then the average of the significance, originality, and rigour scores, if all three are given. These scores should be ignored if one or two are missing.To count as a correct answer, the LLM score must only include the number and (optionally) a star after the number. Additional spaces are also allowed at the start and end of the response as well as between the number and the star.Examples of correct answer formats3.4*23*4 *-1Examples of incorrect answer formats1. 3**4**Score: 2*-1*The gold standard is the score in the report (or -1) as judged by a human.Some of the gold standard judgements are subjective and you may disagree. For example, when three scores are given with no context then these are assumed to be rigour, originality and significance and rounded. When two scores are included, then this is usually counted as unknown score.The number extracted is counted as correct if it is exact or within (For clarity, any symbol in the output other than a space and the following counts as an automatic fail: 0123456789.*-Python code to check the answers is here: https://github.com/MikeThelwall/LargeLanguageModels/blob/main/1446_REF_score_reports_correct_scores.py
创建时间:
2025-11-25



