five

BrowseCompLongContext

收藏
魔搭社区2026-01-06 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/openai-mirror/BrowseCompLongContext
下载链接
链接失效反馈
官方服务:
资源简介:
# BrowseComp Long Context BrowseComp Long Context is a dataset based on [BrowseComp](https://openai.com/index/browsecomp/) to benchmark LLM’s capability to retrieve relevant information from noisy data in its context. It converts the agentic question answering tasks from Browsecomp into long context tasks. For each of the questions in a subset of BrowseComp, a list of urls are attached. Each url will be paired with an indicator indicating whether the content of the web page is required to answer the question or is additional content served as supplement information or noise. The required urls are collected and reviewed by a human to ensure they are sufficient and necessary to answer the original question. The additional urls are obtained by searching relevant questions that can help answer the original question. The data is extensible to different context windows, with the provided list of urls, it’s feasible to construct model prompts beyond 1m context window. This eval is challenging because: - The constructed prompt is based on real data where most of the context is somewhat relevant, as opposed to a broad web corpus where very little data is relevant - The model must combine multiple pieces of information in order to answer the question - The order of the information appearing in the context might not align with the order of when they needed in reasoning flow - The model needs to be able to not be confused by additional information that is relevant - The longer the context, the harder the task ## Data Schema For each row the data will contain: - Problem - Answer - List of urls each paired with a "required" or "additional" to indicate if the url is required to answer the question. Data can be loaded with the following method. ```python def derive_key(password: str, length: int) -> bytes: """Derive a fixed-length key from the password using SHA256.""" hasher = hashlib.sha256() hasher.update(password.encode()) key = hasher.digest() return key * (length // len(key)) + key[: length % len(key)] def decrypt(ciphertext_b64: str, password: str) -> str: """Decrypt base64-encoded ciphertext with XOR.""" encrypted = base64.b64decode(ciphertext_b64) key = derive_key(password, len(encrypted)) decrypted = bytes(a ^ b for a, b in zip(encrypted, key)) return decrypted.decode() data = [ { "problem": decrypt(row["problem"], row["canary"]), "answer": decrypt(row["answer"], row["canary"]), "urls": decrypt(row['urls'],row["canary"]), } for row in encrypted_data ] ``` ## Reference prompt construction implementation ```python def _fit_pages(self, pages:list[str], token_budget:int): """ Fit pages into a token budget. Args: pages: list of pages to fit into the token budget. token_budget: the token budget. Returns: tuple: - int: number of pages fitted into the token budget. - int: token budget remaining. """ fitted_pages = 0 for page in pages: page_tokens = self._count_token(page) if page_tokens<=token_budget: token_budget-=page_tokens fitted_pages+=1 else: break return fitted_pages,token_budget def render_prompt(self, problem: str, urls: list[tuple[str,bool]], token_budget: int) -> tuple[str, bool]: """ Render a prompt for a given problem and a list of URLs. Args: problem: The problem to answer. urls: List of URLs to use to answer the problem. token_budget: The token budget. Returns: tuple: - str: constructed model prompt. - bool: whether the prompt was constructed successfully. """ initial_msg = f"""Given a list of websites, answer the following question: {problem}\n Your final answer should be a concise sentence, in the following format: Final Answer: put your answer here. It's critical your answer is concise and following the format strictly.\n""" final_msg = f"""\nNow answer the original question, recall the question is: {problem} VERY IMPORTANT: Do not use any web search tools or browser tools to answer the question, you may only use the provided documents to answer the question.""" token_budget -= self._count_token(initial_msg)+self._count_token(final_msg) required_pages = [self._fetch_url(url) for url,is_required in urls if is_required] additional_pages = [self._fetch_url(url) for url,is_required in urls if not is_required] num_required_fitted, token_budget = self._fit_pages(required_pages, token_budget) if num_required_fitted < len(required_pages): return "", False num_additional_fitted, token_budget = self._fit_pages(additional_pages, token_budget) page_msgs = [*required_pages[:num_required_fitted], *additional_pages[:num_additional_fitted]] self._rng.shuffle(page_msgs) return "\n".join([initial_msg, *page_msgs, final_msg]), True ``` \* Note the implementation and quality of _fetch_url method can affect the benchmark results. It’s recommended to use a consistent implementation of this method across different runs. ## Grading Grading is performed following the same method of [BrowseComp](https://openai.com/index/browsecomp/). More specifically, it can be done by prompting a model with a grading template providing question, model response and reference answer.

# BrowseComp 长上下文数据集(BrowseComp Long Context) BrowseComp 长上下文数据集是基于[BrowseComp](https://openai.com/index/browsecomp/)构建的基准数据集,用于测评大语言模型(Large Language Model,LLM)从上下文中的噪声数据里检索相关信息的能力。该数据集将BrowseComp中的智能体问答任务转换为长上下文任务。 针对BrowseComp的一个子集中的每个问题,均附带一组URL列表。每个URL会与一个标记配对,该标记用于指明对应网页内容是回答问题所需的必要信息,还是作为补充信息或噪声的额外内容。必要URL由人工收集并审核,以确保其足以且必须用于回答原始问题;额外URL则通过搜索可辅助回答原始问题的相关问题获取。 该数据集可适配不同的上下文窗口,借助提供的URL列表,能够构建超过100万Token上下文窗口的模型提示词。 该测评任务具有较高挑战性,原因如下: - 构建的提示词基于真实数据,其中大部分上下文仅具有一定相关性,而非像通用网页语料库那样仅有极少数据相关 - 模型需整合多条信息才能回答问题 - 上下文中信息的出现顺序可能与推理流程中所需的顺序不一致 - 模型需具备抵御相关额外信息干扰的能力 - 上下文越长,任务难度越高 ## 数据模式 每条数据记录包含以下内容: - 问题(Problem) - 答案(Answer) - URL列表,每个URL均会与"required(必要)"或"additional(额外)"标记配对,用于指明该URL对应的内容是否为回答问题所需。 可通过以下方法加载数据: python def derive_key(password: str, length: int) -> bytes: """使用SHA256从密码派生固定长度的密钥。""" hasher = hashlib.sha256() hasher.update(password.encode()) key = hasher.digest() return key * (length // len(key)) + key[: length % len(key)] def decrypt(ciphertext_b64: str, password: str) -> str: """使用异或(XOR)解密Base64编码的密文。""" encrypted = base64.b64decode(ciphertext_b64) key = derive_key(password, len(encrypted)) decrypted = bytes(a ^ b for a, b in zip(encrypted, key)) return decrypted.decode() data = [ { "problem": decrypt(row["problem"], row["canary"]), "answer": decrypt(row["answer"], row["canary"]), "urls": decrypt(row['urls'],row["canary"]), } for row in encrypted_data ] ## 参考提示词构建实现 python def _fit_pages(self, pages:list[str], token_budget:int): """ 将页面适配至Token预算范围内。 Args: pages: 需适配至Token预算的页面列表。 token_budget: 可用Token预算。 Returns: 元组: - int: 适配至预算内的页面数量。 - int: 剩余Token预算。 """ fitted_pages = 0 for page in pages: page_tokens = self._count_token(page) if page_tokens<=token_budget: token_budget-=page_tokens fitted_pages+=1 else: break return fitted_pages,token_budget def render_prompt(self, problem: str, urls: list[tuple[str,bool]], token_budget: int) -> tuple[str, bool]: """ 为给定问题和URL列表生成模型提示词。 Args: problem: 需要解答的问题。 urls: 用于解答问题的URL列表。 token_budget: 可用Token预算。 Returns: 元组: - str: 构建完成的模型提示词。 - bool: 提示词是否成功构建。 """ initial_msg = f"""Given a list of websites, answer the following question: {problem} Your final answer should be a concise sentence, in the following format: Final Answer: put your answer here. It's critical your answer is concise and following the format strictly. """ final_msg = f""" Now answer the original question, recall the question is: {problem} VERY IMPORTANT: Do not use any web search tools or browser tools to answer the question, you may only use the provided documents to answer the question.""" token_budget -= self._count_token(initial_msg)+self._count_token(final_msg) required_pages = [self._fetch_url(url) for url,is_required in urls if is_required] additional_pages = [self._fetch_url(url) for url,is_required in urls if not is_required] num_required_fitted, token_budget = self._fit_pages(required_pages, token_budget) if num_required_fitted < len(required_pages): return "", False num_additional_fitted, token_budget = self._fit_pages(additional_pages, token_budget) page_msgs = [*required_pages[:num_required_fitted], *additional_pages[:num_additional_fitted]] self._rng.shuffle(page_msgs) return " ".join([initial_msg, *page_msgs, final_msg]), True * 注:_fetch_url方法的实现与质量会影响基准测评结果,建议在不同运行中使用一致的该方法实现。 ## 测评打分 测评打分遵循与[BrowseComp](https://openai.com/index/browsecomp/)相同的方法。具体而言,可通过向模型提供包含问题、模型输出与参考答案的打分模板来完成打分。
提供机构:
maas
创建时间:
2025-08-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作