five

Futurex-Past

收藏
魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/futurex-ai/Futurex-Past
下载链接
链接失效反馈
官方服务:
资源简介:
# FutureX-Past ## 📜 Overview This repository contains a dataset of past questions from the **FutureX benchmark**. FutureX is a live, dynamic benchmark designed to evaluate the future prediction capabilities of Large Language Model (LLM) agents. It features a fully automated pipeline that generates new questions about upcoming real-world events, deploys agents to predict their outcomes, and scores the results automatically. For more information on the live benchmark, please refer to our [technical report/blog post link]. The events corresponding to the questions in *this* dataset have already occurred. This historical data, while not suitable for evaluating live prediction, serves as a valuable resource for a variety of other research and development purposes. ## ✨ Why Use This Dataset? This dataset provides a rich collection of complex, real-world questions that required timely information retrieval and reasoning to solve. It is a valuable asset for: - **Model Behavior Analysis**: Study how different LLM agents attempt to solve these problems. Analyze their tool usage, search queries, and reasoning paths when faced with uncertainty. - **Reinforcement Learning**: Use the dataset as to train RL agents to predict the future by controlling the date of search engine. - **Search and Information Retrieval Evaluation**: Since the ground truth answers are known, this dataset serves as a high-quality testbed for evaluating an agent's ability to find specific, time-sensitive information from the web. - **Static QA Benchmark**: The dataset can be used as a challenging static question-answering benchmark that requires models to integrate knowledge and reason about events, even if the "future" aspect is removed. ## ⚠️ Important Note on Usage This dataset is comprised of **historical data**. The outcomes of all events are known and may be part of the training data of more recent models. Therefore, it **should not** be used to evaluate the *live future prediction* capabilities of LLMs, as this would lead to contaminated and invalid results. For live evaluation, please refer to the ongoing weekly challange (https://futurex-ai.github.io/). ## 💾 Dataset Schema The dataset is provided in a structured format (e.g., CSV, JSON Lines). Each entry corresponds to a single prediction task and contains the following fields: - `question_id` (string): A unique identifier for the question. - *Example: `620165c0-1c39-442a-9ac9-93e179e8c33e`* - `question` (string): The prediction question that was posed to the agent. - *Example: "北京时间2024年8月1日晚上8点,美联储的联邦基金利率目标范围是多少?"* - `answer` (string): The ground truth answer, recorded after the event occurred. - *Example: "5.25%"* - `setting_time` (timestamp): The date and time when the question was generated and posed. - *Example: `2025-07-28`* - `options` (string/array): For multiple-choice questions (Levels 1 & 2), this field contains the possible options. It may be null for open-ended questions. - *Example: `["A", "D"]`* - `level` (integer): The difficulty level of the question, from 1 to 4, as defined by the FutureX benchmark. 1. **Basic** (Few choices) 2. **Wide Search** (Many Choices) 3. **Deep Search** (Open-ended, Low Volatility) 4. **Super Agent** (Open-ended, High Volatility) - `prompt` (string): The full prompt that was provided to the LLM agent for the task. ## 🤝 Citation If you use this dataset in your research, please cite the original FutureX paper: Code snippet ``` @misc{zeng2025futurexadvancedlivebenchmark, title={FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction}, author={Zhiyuan Zeng and Jiashuo Liu and Siyuan Chen and Tianci He and Yali Liao and Jinpeng Wang and Zaiyuan Wang and Yang Yang and Lingyue Yin and Mingren Yin and Zhenwei Zhu and Tianle Cai and Zehui Chen and Jiecao Chen and Yantao Du and Xiang Gao and Jiacheng Guo and Liang Hu and Jianpeng Jiao and Xiangsheng Li and Jingkai Liu and Shuang Ni and Zhoufutu Wen and Ge Zhang and Kaiyuan Zhang and Xin Zhou and Jose Blanchet and Xipeng Qiu and Mengdi Wang and Wenhao Huang}, year={2025}, eprint={2508.11987}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2508.11987}, } ``` ------

# FutureX-Past ## 📜 概述 本仓库收录了源自**FutureX基准测试(FutureX benchmark)**的历史问题数据集。 FutureX是一个实时动态的基准测试平台,旨在评估大语言模型(Large Language Model, LLM)智能体的未来预测能力。该平台配备全自动化流程,可生成针对即将发生的真实世界事件的全新问题,部署智能体对事件结果进行预测,并自动对预测结果进行评分。如需了解该实时基准测试的更多详情,请参阅我们的[技术报告/博客文章链接]。 本数据集对应问题所涉及的事件均已发生。这类历史数据虽不适用于实时预测能力评估,但可作为多种其他研发场景的宝贵资源。 ## ✨ 为何选用本数据集? 本数据集收录了大量复杂的真实世界问题,求解这类问题需要及时获取信息并进行逻辑推理,是极具价值的研究资产,适用于以下场景: - **模型行为分析**:研究不同LLM智能体解决此类问题的路径,分析其在面对不确定性时的工具调用、搜索查询与推理逻辑。 - **强化学习(Reinforcement Learning, RL)**:将本数据集用于训练强化学习智能体,使其通过控制搜索引擎的查询时间来实现未来预测。 - **搜索与信息检索评估**:由于已知所有问题的标准答案,本数据集可作为高质量测试平台,用于评估智能体从网络中获取特定时效性信息的能力。 - **静态问答基准测试**:即便移除“未来预测”的属性,本数据集仍可作为极具挑战性的静态问答基准,要求模型整合知识并针对事件进行逻辑推理。 ## ⚠️ 使用重要须知 本数据集由**历史数据**构成,所有事件的结果均已公开,且可能已被部分较新的模型纳入训练数据。因此,**严禁**使用本数据集评估LLM的实时未来预测能力,否则将导致结果污染且无效。如需进行实时评估,请参与当前持续进行的每周挑战(https://futurex-ai.github.io/)。 ## 💾 数据集结构 本数据集以结构化格式提供(例如CSV、JSON Lines),每条数据对应一项独立的预测任务,包含以下字段: - `question_id`(字符串类型):问题的唯一标识符。示例:`620165c0-1c39-442a-9ac9-93e179e8c33e` - `question`(字符串类型):向智能体提出的预测问题。示例:"北京时间2024年8月1日晚上8点,美联储的联邦基金利率目标范围是多少?" - `answer`(字符串类型):事件发生后记录的标准答案。示例:"5.25%" - `setting_time`(时间戳类型):问题生成并提出的日期与时间。示例:`2025-07-28` - `options`(字符串/数组类型):针对1级与2级的选择题,此字段包含所有可选答案;开放题中此字段可为空。示例:`["A", "D"]` - `level`(整数类型):由FutureX基准测试定义的问题难度等级,共分为1至4级: 1. **基础级(Basic)**:选项较少 2. **广度搜索级(Wide Search)**:选项较多 3. **深度搜索级(Deep Search)**:开放题,低波动性 4. **超级智能体级(Super Agent)**:开放题,高波动性 - `prompt`(字符串类型):向LLM智能体提供的完整任务提示词。 ## 🤝 引用须知 若您在研究中使用本数据集,请引用原始FutureX论文: @misc{zeng2025futurexadvancedlivebenchmark, title={FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction}, author={Zhiyuan Zeng and Jiashuo Liu and Siyuan Chen and Tianci He and Yali Liao and Jinpeng Wang and Zaiyuan Wang and Yang Yang and Lingyue Yin and Mingren Yin and Zhenwei Zhu and Tianle Cai and Zehui Chen and Jiecao Chen and Yantao Du and Xiang Gao and Jiacheng Guo and Liang Hu and Jianpeng Jiao and Xiangsheng Li and Jingkai Liu and Shuang Ni and Zhoufutu Wen and Ge Zhang and Kaiyuan Zhang and Xin Zhou and Jose Blanchet and Xipeng Qiu and Mengdi Wang and Wenhao Huang}, year={2025}, eprint={2508.11987}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2508.11987}, }
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作