five

HSCodeComp

收藏
魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AIDC-AI/HSCodeComp
下载链接
链接失效反馈
官方服务:
资源简介:
# HSCodeComp: A Realistic and Expert-Level Benchmark for Deep Search Agents in Hierarchical Rule Application [Paper](https://arxiv.org/abs/2510.19631) | [Code](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/HSCodeComp) | [Dataset on Hugging Face](https://huggingface.co/datasets/AIDC-AI/HSCodeComp) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/downloads/) [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow.svg)](https://huggingface.co/datasets/AIDC-AI/HSCodeComp) <div align="center"> ⭐ _**MarcoPolo Team**_ ⭐ [_**Alibaba International Digital Commerce**_](https://aidc-ai.com) 🗂️ [**Data**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/HSCodeComp/data/test_data.jsonl) </div> --- ## 📌 Overview <div align="center"> <img src="assets/overview.png" alt="Overview" width="60%" style="display: inline-block; vertical-align: top; margin-right: 2%;"> <img src="assets/teaser_img.png" alt="Teaser" width="34%" style="display: inline-block; vertical-align: top;"> </div> **HSCodeComp** is the first realistic, expert-level e-commerce benchmark designed to evaluate **deep search agents** on their ability to perform Level-3 knowledge—**hierarchical rule application**—a critical yet overlooked capability in current agent evaluation frameworks. The task requires agents to predict the exact **10-digit Harmonized System Code (HSCode)** for products described with **noisy, real-world e-commerce domain**, by correctly applying complex, hierarcahical tariff rules (e.g., from eWTP and official customs rulings). These rules often contain **vague language** and **implicit logic**, making accurate classification highly challenging. Our evaluation reveals a stark performance gap: * 🔹 **Best AI agent (SmolAgent + GPT-5 VLM): 46.8%** * 🔹 **Human experts: 95.0%** Besides, ablation study also reveals that **inference-time scaling fails to improve the performance**. These highlight that deep search with **hierarchical rule application** remains a major unsolved challenge for state-of-the-art AI agent systems. --- ## 🔥 News * [2025/10/] 🔥 We released the [paper](https://arxiv.org/abs/2510.19631) and [dataset](https://huggingface.co/datasets/AIDC-AI/HSCodeComp) of our challenging HSCodeComp dataset. --- ## 📋 Dataset ![](assets/example_product_v3.png) This figure reveals that the data format HSCodeComp dataset. ### Input Each product $x \in \mathcal{X}$ contains rich information: $x = (t, A, c, i, p, u, r)$, where: - **$t$**: Product title - **$A = \{(k_j, v_j)\}_{j=1}^K$**: Set of $K$ product attributes (e.g., material, package size) - **$c$**: Product categories defined by the e-commerce platform - **$p$**: Price - **$u$**: Currency ### Knowledge: Hierarchical Rules The task requires agents to effectively utilize three types of e-commerce domain knowledge: 1. **Hierarchical tariff rules** from official classification systems (e.g., eWTP) with complex implicit logic and vague linguistic constraints 2. **Human-written decision rules** that specify how to correctly apply tariff rules 3. **Official customs rulings databases** (e.g., U.S. CROSS) containing historical HSCode classification decisions ### Output The HSCode $y \in \mathcal{Y}$ is a single **10-digit numeric string** $\mathcal{Y} \subseteq \{0,1,\ldots,9\}^{10}$. The HSCode structure is hierarchical: - **First 2 digits**: HS chapter - **First 4 digits**: HS heading - **First 6 digits**: HS sub-heading - **Last 4 digits (7-10)**: Country-specific codes The 10-digit HSCode must follow a valid path in the official HS taxonomy. Please refer to [our paper](https://arxiv.org/abs/2510.19631) for more details about these data. ### Dataset Collection and Statistic ![](assets/label_process_v4.png) We engage several domain experts in HSCode prediction, and conduct a well-designed 6 steps pipeline to construct dataset. The important details of our proposed HSCodeComp is provided in following table. | Metric | Value | |--------|-------| | **Total Products** | 632 expert-annotated entries | | **HS Chapters** | 27 chapters | | **First-level Categories** | 32 categories | | **Data Source** | Large-scale e-commerce platforms | | **Validation** | Multiple domain experts | | **Inter-annotator Agreement** | >98% | | **Models Tested** | 14 foundation models, 6 open-source agents, 3 closed-source systems | | **Knowledge Level** | Level 3: Hierarchical rule application | --- ## ⚙️ Sample Usage ### 📁 Repository Structure ```bash HSCodeComp/ ├── data/ │ └── test_data.csv # Product descriptions, attributes and ground-truth HSCodes ├── eval/ │ └── test_llm.py # Evaluation script for model predictions ├── LICENSE └── README.md ``` ### 🛠️ Environment Setup ```bash # Create and activate a virtual environment (optional but recommended) python -m venv hscodcomp_env source hscodcomp_env/bin/activate # Linux/macOS # hscodcomp_env\Scripts\activate # Windows # Install dependencies (e.g., pandas, etc.) pip install pandas,openai,tqdm,threading,dotenv # set openai keys and base urls in HSCodeComp/.env ``` ### 🚀 Run Evaluation ```bash # Set models_to_test = ["gpt-4o"] in eval/test_llm.py python eval/test_llm.py ``` The script reports **exact-match accuracy** at **2-digit, 4-digit, 6-digit, 8-digit, and 10-digit** levels. --- ## 📊 Benchmark Performance ### Complete Evaluation on HSCodeComp ![](assets/main_exp_result.png) The top-performming baseline SmolAgent (GPT-5 with vision capability) achieves the best performance, while it sill largely lag behind human expert performance. <img src="assets/closed_source_main_exp_result.jpg" alt="Closed Source Main Exp Result" width="60%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);"> Closed-source agent systems still largely underperform domain expert and open-source agent systems with GPT-5 backbone model. ### Current Agents Fail to Leverage Hierarchical Decision Rules <img src="assets/fail_to_use_DR.jpg" alt="Closed Source Main Exp Result" width="75%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);"> Performance degrades when human decision rules are included in the system prompt. ### More Thinking Leads to Worse Performance <img src="assets/more_think_leads_to_worse_performance.jpg" alt="Closed Source Main Exp Result" width="75%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);"> * More thinking leads to more errors and hallucinations in this highly domain-specific HSCode prediction task. * When accurate information is available, through calling tools, prioritizing tool utilization over reasoning yields better results. ### Test-time Scaling Fails to Improve Performance <img src="assets/tts.jpg" alt="Closed Source Main Exp Result" width="80%" style="border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.1);"> Two kinds of inference-time scaling strategy (majority voting and self-reflection) fails to effectively improve the performance. > For complete experimental results, please refer to [our paper](https://arxiv.org/abs/2510.19631). --- ## 🤝 Acknowledgements We thank the human experts who meticulously annotated and validated the HSCodes. Their domain knowledge is the foundation of this benchmark’s quality and realism. --- ## 🛡️ License This project is licensed under the **Apache-2.0 License** --- ## ⚠️ DISCLAIMER Our datasets are constructed using publicly accessible product data sources. Although we remove the product image and url in the HSCodeComp, we still cannot guarantee that our datasets are completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us ([Tian Lan](https://github.com/gmftbyGMFTBY) and [Longyue Wang](https://www.longyuewang.com/)), and we will promptly address the matter. ---

# HSCodeComp:面向层级规则应用的深度搜索智能体评测基准——兼具真实性与专家级难度 [Paper](https://arxiv.org/abs/2510.19631) | [Code](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/HSCodeComp) | [Dataset on Hugging Face](https://huggingface.co/datasets/AIDC-AI/HSCodeComp) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/downloads/) [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow.svg)](https://huggingface.co/datasets/AIDC-AI/HSCodeComp) <div align="center"> ⭐ _**MarcoPolo团队**_ ⭐ [_**阿里巴巴国际数字商业集团**_](https://aidc-ai.com) 🗂️ [**数据文件**](https://github.com/AIDC-AI/Marco-Search-Agent/tree/main/HSCodeComp/data/test_data.jsonl) </div> --- ## 📌 概述 <div align="center"> <img src="assets/overview.png" alt="概述图" width="60%" style="display: inline-block; vertical-align: top; margin-right: 2%;"> <img src="assets/teaser_img.png" alt="示例图" width="34%" style="display: inline-block; vertical-align: top;"> </div> **HSCodeComp**是首个兼具真实性与专家级难度的电商基准测试集,旨在评估**深度搜索智能体(deep search agents)**对三级知识——**层级规则应用(hierarchical rule application)**——的掌握能力,而该能力在当前的智能体评测框架中尚未得到足够重视。 本任务要求智能体针对带有真实电商领域噪声的商品描述,准确预测其对应的**10位协调制度编码(Harmonized System Code, HSCode)**,需正确应用复杂的层级关税规则(例如源自eWTP及官方海关裁定的规则)。这些规则往往包含模糊表述与隐含逻辑,导致精准分类极具挑战性。我们的评测结果揭示了显著的性能差距: * 🔹 **最优AI智能体(SmolAgent + GPT-5 VLM):46.8%** * 🔹 **人类专家:95.0%** 此外,消融实验还表明**推理时缩放策略无法提升模型性能**。这凸显出具备层级规则应用能力的深度搜索,仍是当前主流AI智能体系统尚未解决的核心挑战。 --- ## 🔥 最新动态 * [2025/10] 🔥 我们发布了挑战性基准HSCodeComp的[论文](https://arxiv.org/abs/2510.19631)与[数据集](https://huggingface.co/datasets/AIDC-AI/HSCodeComp)。 --- ## 📋 数据集详情 ![示例商品](assets/example_product_v3.png) 本图展示了HSCodeComp数据集的数据格式。 ### 输入 每个商品样本 $x in mathcal{X}$ 包含丰富信息:$x = (t, A, c, i, p, u, r)$,其中: - **$t$**:商品标题 - **$A = {(k_j, v_j)}_{j=1}^K$**:包含$K$个商品属性的集合(例如材质、包装尺寸) - **$c$**:电商平台定义的商品类目 - **$p$**:商品价格 - **$u$**:计价货币 ### 知识:层级规则 本任务要求智能体有效利用三类电商领域知识: 1. **源自官方分类系统(如eWTP)的层级关税规则**,其包含复杂的隐含逻辑与模糊语言约束 2. **人工撰写的决策规则**,用于说明关税规则的正确应用方式 3. **官方海关裁定数据库**(如美国CROSS),收录了历史HSCode分类决策 ### 输出 HSCode $y in mathcal{Y}$ 为单一**10位数字字符串**,其中 $mathcal{Y} subseteq {0,1,ldots,9}^{10}$。HSCode的结构具有层级性: - **前2位**:HS章 - **前4位**:HS税目 - **前6位**:HS子目 - **最后4位(第7-10位)**:国别专属编码 该10位HSCode必须符合官方HS分类体系的有效路径。有关数据的更多细节,请参阅[我们的论文](https://arxiv.org/abs/2510.19631)。 ### 数据集收集与统计 ![标签处理流程](assets/label_process_v4.png) 我们邀请了多位HSCode预测领域专家,并通过精心设计的6步流程构建数据集。本基准HSCodeComp的核心细节如下表所示: | 指标 | 数值 | |--------|-------| | **总商品样本数** | 632条专家标注条目 | | **覆盖HS章数** | 27章 | | **一级类目数** | 32个 | | **数据来源** | 大型电商平台 | | **标注验证** | 多领域专家交叉验证 | | **标注者间一致性** | >98% | | **测试模型** | 14个基础模型、6个开源智能体、3个闭源系统 | | **知识层级** | 三级:层级规则应用 | --- ## ⚙️ 样本使用指南 ### 📁 仓库结构 bash HSCodeComp/ ├── data/ │ └── test_data.csv # 商品描述、属性与真实HSCode标签 ├── eval/ │ └── test_llm.py # 模型预测结果评测脚本 ├── LICENSE └── README.md ### 🛠️ 环境配置 bash # 创建并激活虚拟环境(可选但推荐) python -m venv hscodcomp_env source hscodcomp_env/bin/activate # Linux/macOS # hscodcomp_envScriptsactivate # Windows系统 # 安装依赖包(如pandas等) pip install pandas openai tqdm threading python-dotenv # 在HSCodeComp/.env文件中配置OpenAI密钥与基础接口地址 ### 🚀 运行评测 bash # 在eval/test_llm.py中设置models_to_test = ["gpt-4o"] python eval/test_llm.py 该脚本将报告**2位、4位、6位、8位及10位**层级的**精确匹配准确率**。 --- ## 📊 基准评测性能 ### HSCodeComp完整评测结果 ![主实验结果](assets/main_exp_result.png) 性能最优的基线模型SmolAgent(搭载视觉能力的GPT-5)虽取得了最佳表现,但仍大幅落后于人类专家的性能。 ![闭源系统主实验结果](assets/closed_source_main_exp_result.jpg) 闭源智能体系统的性能仍显著落后于领域专家与以GPT-5为骨干的开源智能体系统。 ### 当前智能体无法有效利用层级决策规则 ![无法利用决策规则](assets/fail_to_use_DR.jpg) 当系统提示中加入人工决策规则时,模型性能反而出现下降。 ### 更多思考反而导致性能下降 ![过度思考导致性能恶化](assets/more_think_leads_to_worse_performance.jpg) * 在这一高度领域专属的HSCode预测任务中,更多的思考会带来更多错误与幻觉现象。 * 当可获取准确信息时,相较于自主推理,优先调用工具能够获得更优的结果。 ### 推理时缩放策略无法提升性能 ![推理时缩放无效](assets/tts.jpg) 两类推理时缩放策略(多数投票与自我反思)均未能有效提升模型性能。 > 完整的实验结果请参阅[我们的论文](https://arxiv.org/abs/2510.19631)。 --- ## 🤝 致谢 我们感谢所有为数据集进行精细标注与验证的人类专家,他们的领域知识是本基准测试集具备高质量与真实性的核心基础。 --- ## 🛡️ 许可证 本项目采用**Apache-2.0许可证**进行授权。 --- ## ⚠️ 免责声明 本数据集基于公开可获取的商品数据源构建。尽管我们已移除了商品图片与URL信息,但仍无法保证本数据集完全不存在版权问题或不当内容。若您认为任何内容侵犯了您的权益或存在不当之处,请联系我们([Tian Lan](https://github.com/gmftbyGMFTBY) 与 [Longyue Wang](https://www.longyuewang.com/)),我们将及时处理相关事宜。
提供机构:
maas
创建时间:
2025-10-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作