five

AA-LCR

收藏
魔搭社区2026-05-16 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/evalscope/AA-LCR
下载链接
链接失效反馈
官方服务:
资源简介:
# Artificial Analysis Long Context Reasoning (AA-LCR) Dataset AA-LCR includes 100 hard text-based questions that require reasoning across multiple real-world documents, with each document set averaging ~100k input tokens. Questions are designed such that answers cannot be directly retrieved from documents and must instead be reasoned from multiple information sources. ## Dataset Development AA-LCR was created through a rigorous multi-phase process involving several members of the Artificial Analysis research team and more than a dozen undergraduate students who were engaged on a short-term contract basis to write and/or validate questions. **Document Curation**: We selected diverse document sets (company reports, government consultations, legal documents, academic papers) averaging ~100,000 tokens each, representing real materials knowledge workers analyze. **Question Creation**: Undergraduate students from various disciplines developed questions with access via a dataset development dashboard to non-frontier test models to validate question difficulty (GPT-4o-mini, Llama-3.1-70B, Gemini 1.5 Flash). These models were specifically chosen to give creators a sense of AI capabilities without access to frontier models, preventing adversarial selection against particular frontier models. Creators were instructed to develop practical questions requiring multi-document reasoning, and to ensure that the questions were sufficiently hard for the above models to fail to get them right. **Human Validation**: Every question was verified through human testing: - Evaluators answered questions using the same document sets provided to AI models - Human performance revealed the benchmark's challenging nature - individual evaluators achieved modest accuracy rates, typically answering 40-60% of questions correctly on the first attempt - However, when presented with correct answers, evaluators showed high agreement confirming question validity and demonstrating that while difficult, the questions had clear, defensible answers - Questions failing verification were revised or discarded - Every question in AA-LCR was answered correctly by at least one human tester, ensuring all questions have verified solutions This approach validates that AA-LCR tests genuine reasoning capabilities rather than obscure knowledge, while acknowledging the inherent difficulty of long context reasoning tasks even for human experts. ## Technical Details AA-LCR comprises 100 questions across 7 types of text-only documents (i.e. Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials and Survey Reports). Multiple independent documents, forming a Document Set with a total length of ~100k tokens are passed as context for each question. For instance, the Company Documents topic includes separate document sets containing 2023 and 2024 company reports, respectively. Each question requires using the Document Set and applying general and mathematical reasoning. <div class="overflow-x-auto my-6"> <table class="min-w-full border border-gray-300 bg-white"> <thead class="bg-gray-50"> <tr> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Parent Category</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Total Questions</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Total Document Sets</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Total Documents</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Total Tokens</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">Average Token Per Document Set</th> </tr> </thead> <tbody class="divide-y divide-gray-200"> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Company Documents</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">63</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">16</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">92</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1,476,239</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">92,265</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Industry Reports</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">8</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">4</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">18</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">410,698</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">102,675</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Government Consultations</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">11</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">3</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">60</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">325,254</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">108,418</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Academia</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">5</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">14</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">223,776</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">111,888</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Legal</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">6</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">23</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">233,050</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">116,525</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Marketing</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">6</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">16</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">217,694</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">108,847</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">Survey Reports</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">11</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">93,046</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">93,046</td> </tr> <tr class="bg-gray-100 font-semibold"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">Full Dataset</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">100</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">30</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">234</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">2,979,757</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">99,325</td> </tr> </tbody> </table> </div> **Sample Question:** ```json For the company and quarter where the company reported a 13.5% decline on the prior quarters operating income. What was their adjusted EBITDA? List the company name and adjusted EBITDA Answer: Equinix, $901 million ``` Examples of other types of questions include: - **Financial Analysis and Comparative Metrics:** Extract financial data and calculate performance metrics - **Legal and Regulatory Interpretation**: Identify cases/policies under exclusion rules, interpret outcomes and applicability and surface cited sections/definitions - **Multi-Document Information Synthesis:** Find and connect information scattered across multiple documents to identify themes and correlate data points - **Temporal and Conditional Logic Analysis:** Track time-series trends, implement conditional decision rules, and determine threshold-based alerts or actions - **Research and Classification:** Analyze patterns, classify and identify relevant documents to recall specific information **Prompt Template:** We load the relevant documents for each question into context in the same prompt as the question text. Pre-extracted document text can be found in AA-LCR_extracted-text.zip. ```python documents_text = "\n\n".join(f"BEGIN DOCUMENT {i + 1}:\n{doc}\nEND DOCUMENT {i + 1}" for i, doc in enumerate(docs)) prompt = """BEGIN INPUT DOCUMENTS {documents_text} END INPUT DOCUMENTS Answer the following question using the input documents provided above. START QUESTION {question} END QUESTION """ ``` Reported token counts per question are based on the completed prompt, using the `cl100k_base` tokenizer from `tiktoken`. The order in which documents are loaded in matters - they should be added to the prompt template in the order of the filenames in `data_source_filenames`. Below are code snippets showing how we read the questions and extracted text files from disk. ``` def load_questions(self) -> list[dict]: """Load LCR questions from HuggingFace dataset""" csv_path = hf_hub_download( repo_id="ArtificialAnalysis/AA-LCR", filename="AA-LCR_Dataset.csv", repo_type="dataset", ) questions = [] with open(csv_path, encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: # Parse data_source_filenames as ordered list if "data_source_filenames" in row and isinstance(row["data_source_filenames"], str): row["data_source_filenames"] = row["data_source_filenames"].split(";") # Parse answer as list (semicolon-separated criteria) if "answer" in row and isinstance(row["answer"], str): row["answer"] = row["answer"].split(";") questions.append(row) return questions def get_document_set( self, dataset_folder: str, document_category: str, document_set_id: str, data_source_filenames: list[str] ) -> list[str]: """Get document set for a question in the order specified by data_source_filenames""" # Documents are extracted to lcr/lcr/{category}/{set_id}/ from the HuggingFace zip document_set_path = os.path.join(dataset_folder, document_category, document_set_id) document_texts = [] for filename in data_source_filenames: document_path = os.path.join(document_set_path, filename) with open(document_path, encoding="utf-8") as f: document_texts.append(f.read()) return document_texts ``` ## Scoring Approach We use an LLM-based equality checker to evaluate responses: ``` Assess whether the following CANDIDATE ANSWER is CORRECT or INCORRECT. For the CANDIDATE ANSWER to be correct, it must be consistent with the OFFICIAL ANSWER. The question, for reference only: {question} The OFFICIAL ANSWER: {official_answer} CANDIDATE ANSWER TO ASSESS: {candidate_answer} Reply only with CORRECT or INCORRECT. ``` Qwen3 235B A22B 2507 Non-reasoning is used as the equality checker model. ## Access and Citation The AA-LCR dataset is available at [https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR](https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR). If you use AA-LCR in your research, please cite: ```json @dataset{artificialanalysis2025lcr, title={Artificial Analysis Long Context Reasoning Benchmark(LCR)}, author={Artificial Analysis Team}, year={2025}, publisher={Artificial Analysis, Inc.} } ``` ## License **Question set**: Licensed under the Apache License 2.0 **Document set**: Provided as a text representation of documents publicly available at time of dataset creation. We do not claim copyright or place any license over this data.

# 人工分析长上下文推理(Artificial Analysis Long Context Reasoning, AA-LCR)数据集 AA-LCR 包含100道高难度文本类问题,需依托多份真实文档开展跨文档推理;每份文档集的平均输入Token数约为10万。此类问题的答案无法直接从单份文档中检索获取,仅能通过整合多源信息推导得出。 ## 数据集构建流程 AA-LCR 的构建历经严谨的多阶段流程,由人工分析研究团队的多名成员,以及十余名以短期合同形式参与的本科生共同完成问题编写与验证工作。 ### 文档遴选 我们选取了多样化的文档集(涵盖公司报告、政府咨询文件、法律文书、学术论文等),单份文档集平均长度约为10万Token,贴合知识工作者实际分析的真实材料场景。 ### 问题编写 来自不同学科的本科生通过数据集开发仪表盘,可调用非前沿测试模型(GPT-4o-mini、Llama-3.1-70B、Gemini 1.5 Flash)来验证问题难度。选择这些模型的初衷是让问题编写者能够感知AI的实际能力,同时无需接触前沿模型,避免出现针对特定前沿模型的对抗性选题。编写者被要求编写需依托多文档推理的实用型问题,并确保此类问题难度足够高,即便使用上述测试模型也难以正确作答。 ### 人工验证 所有问题均需通过人工测试验证: - 评估人员需使用与AI模型相同的文档集作答 - 人类作答结果印证了本基准的挑战性:单个评估人员首次尝试的正确率普遍较低,仅为40%-60% - 但在获知正确答案后,评估人员对问题有效性的认可度极高,这表明尽管问题难度颇高,但均具备清晰且可论证的正确答案 - 未通过验证的问题将被修订或直接舍弃 - AA-LCR 中的每一道问题均至少有一名人类测试者给出了正确答案,确保所有问题均拥有经过验证的可行解 ## 技术细节 AA-LCR 涵盖7类纯文本文档场景下的100道问题,分别为公司报告、行业报告、政府咨询文件、学术文献、法律文书、营销材料及调查报告。每道问题的上下文由多份独立文档组成文档集,总长度约为10万Token。例如,公司文档主题包含分别涵盖2023年与2024年公司报告的独立文档集。 <div class="overflow-x-auto my-6"> <table class="min-w-full border border-gray-300 bg-white"> <thead class="bg-gray-50"> <tr> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">父类别</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">问题总数</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">文档集总数</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">文档总数</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">总Token数</th> <th class="border border-gray-300 px-4 py-3 text-left text-sm font-semibold text-gray-900">单文档集平均Token数</th> </tr> </thead> <tbody class="divide-y divide-gray-200"> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">公司文档</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">63</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">16</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">92</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1,476,239</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">92,265</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">行业报告</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">8</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">4</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">18</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">410,698</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">102,675</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">政府咨询文件</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">11</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">3</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">60</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">325,254</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">108,418</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">学术文献</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">5</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">14</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">223,776</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">111,888</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">法律文书</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">6</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">23</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">233,050</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">116,525</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">营销材料</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">6</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">2</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">16</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">217,694</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">108,847</td> </tr> <tr class="hover:bg-gray-50"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">调查报告</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">1</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">11</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">93,046</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900">93,046</td> </tr> <tr class="bg-gray-100 font-semibold"> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">全数据集</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">100</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">30</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">234</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">2,979,757</td> <td class="border border-gray-300 px-4 py-3 text-sm text-gray-900 font-bold">99,325</td> </tr> </tbody> </table> </div> **示例问题**: json 针对某公司某季度,该公司报告称其当期营业利润较上一季度下滑13.5%。请问该公司的调整后息税折旧摊销前利润(adjusted EBITDA)为多少?请列出公司名称与调整后息税折旧摊销前利润数值。 答案:Equinix,9.01亿美元 其他类型问题示例包括: - **财务分析与对比指标**:提取财务数据并计算绩效指标 - **法律与监管解读**:识别排除规则下的案例/政策,解读其结果与适用性,并列出引用的条款/定义 - **多文档信息合成**:查找并关联分散在多份文档中的信息,以识别主题并关联数据点 - **时序与条件逻辑分析**:追踪时间序列趋势,实施条件决策规则,并确定基于阈值的警报或操作 - **研究与分类**:分析模式、分类并识别相关文档,以召回特定信息 **提示词模板**: 我们将每道问题对应的相关文档与问题文本一同加载至上下文。预提取的文档文本可在AA-LCR_extracted-text.zip中获取。 python documents_text = " ".join(f"BEGIN DOCUMENT {i + 1}: {doc} END DOCUMENT {i + 1}" for i, doc in enumerate(docs)) prompt = """BEGIN INPUT DOCUMENTS {documents_text} END INPUT DOCUMENTS Answer the following question using the input documents provided above. START QUESTION {question} END QUESTION """ 每道问题的Token计数基于完整提示词,使用`tiktoken`库中的`cl100k_base`分词器计算。 文档加载顺序至关重要——需按照`data_source_filenames`中的文件名顺序将文档添加至提示词模板。以下代码片段展示了如何从磁盘读取问题与提取后的文本文件: python def load_questions(self) -> list[dict]: """Load LCR questions from HuggingFace dataset""" csv_path = hf_hub_download( repo_id="ArtificialAnalysis/AA-LCR", filename="AA-LCR_Dataset.csv", repo_type="dataset", ) questions = [] with open(csv_path, encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: # Parse data_source_filenames as ordered list if "data_source_filenames" in row and isinstance(row["data_source_filenames"], str): row["data_source_filenames"] = row["data_source_filenames"].split(";") # Parse answer as list (semicolon-separated criteria) if "answer" in row and isinstance(row["answer"], str): row["answer"] = row["answer"].split(";") questions.append(row) return questions def get_document_set( self, dataset_folder: str, document_category: str, document_set_id: str, data_source_filenames: list[str] ) -> list[str]: """Get document set for a question in the order specified by data_source_filenames""" # Documents are extracted to lcr/lcr/{category}/{set_id}/ from the HuggingFace zip document_set_path = os.path.join(dataset_folder, document_category, document_set_id) document_texts = [] for filename in data_source_filenames: document_path = os.path.join(document_set_path, filename) with open(document_path, encoding="utf-8") as f: document_texts.append(f.read()) return document_texts ## 评分方法 我们使用基于大语言模型(Large Language Model, LLM)的一致性校验器来评估作答结果: Assess whether the following CANDIDATE ANSWER is CORRECT or INCORRECT. For the CANDIDATE ANSWER to be correct, it must be consistent with the OFFICIAL ANSWER. The question, for reference only: {question} The OFFICIAL ANSWER: {official_answer} CANDIDATE ANSWER TO ASSESS: {candidate_answer} Reply only with CORRECT or INCORRECT. 本次评估使用的一致性校验模型为Qwen3 235B A22B 2507 Non-reasoning。 ## 获取与引用 AA-LCR 数据集可在[https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR](https://huggingface.co/datasets/ArtificialAnalysis/AA-LCR)获取。 若您在研究中使用AA-LCR数据集,请引用如下文献: json @dataset{artificialanalysis2025lcr, title={Artificial Analysis Long Context Reasoning Benchmark(LCR)}, author={Artificial Analysis Team}, year={2025}, publisher={Artificial Analysis, Inc.} } ## 许可协议 **问题集**:采用Apache License 2.0协议授权 **文档集**:提供的是数据集创建时公开可用的文档文本表征,我们不主张对该数据享有版权,亦未为其施加任何许可限制。
提供机构:
maas
创建时间:
2025-10-21
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
AA-LCR是一个包含100个困难问题的长上下文推理基准数据集,每个问题需基于平均约100k tokens的多文档集进行推理,答案无法直接检索而必须综合多源信息得出。数据集覆盖公司报告、法律文档等7种真实世界文档类型,通过严格的人工验证确保问题挑战性和答案可验证性,旨在测试AI模型的长上下文推理能力。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作