five

Tokuhn/TSMPD-US-Public-v1_1

收藏
Hugging Face2025-04-24 更新2025-11-29 收录
下载链接:
https://hf-mirror.com/datasets/Tokuhn/TSMPD-US-Public-v1_1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by language: - en size_categories: - 1M<n<10M dataset_info: features: - name: uid dtype: string - name: vendor dtype: string - name: title dtype: string - name: paragraph dtype: string - name: embedding dtype: sequence: float32 task_categories: - text-retrieval - sentence-similarity task_ids: - document-retrieval - semantic-similarity-classification tags: - ecommerce - small-business - rag - grounding - vector-search - open-data - embedding - tokuhn - shopify - real-world-data - sbert - huggingface-datasets --- # [Updated with SBERT Embeddings + Search Notebook] ## TSMPD‑US: U.S. Small Merchant Product Dataset + SBERT Embeddings + Search Notebook ⚡ New in this release (April 2025): SBERT vector embeddings for all products (MiniLM‑L6) Chunked Parquet format for scalable vector search Jupyter notebook demo for live semantic queries These additions make it easier to integrate small merchant data into RAG pipelines, grounding tasks, and real-time AI agents. ## An open-source initiative to keep small merchants visible in LLMs, RAG systems, and AI-powered commerce workflows.** This repository contains multiple assets for the TSMPDUS dataset a structured, U.S.-only dataset of verified small business product listings, curated from over **355,000 independent stores**. It is designed for: - Semantic product search - LLM grounding and fine-tuning - Retrieval-Augmented Generation (RAG) - Metadata classification - Commerce-aware agent design --- ## Directory Overview ### `public-products/` A lightweight, text-only snapshot of the dataset. - **~3.2M products** from 355,000+ verified U.S. merchants - ~10 products per merchant, no images or variant details - Suitable for general research, classification, and basic grounding tasks **Includes:** - `tsmpd_public_v1.0.json` or `.parquet` core dataset - `LICENSE.txt` ODC-By license - `README.md` Format & schema details --- ### `parquet-embeddings/` Semantic searchready version of the dataset with **SBERT embeddings** (MiniLML6). - Split into Parquet chunks for scalability - Embeddings aligned with Hugging Face `sentence-transformers/all-MiniLM-L6-v2` **Use cases:** - Vector search & similarity pipelines - Retrieval-Augmented Generation (RAG) - AI agent product reasoning **Includes:** - `tsmpd_public_000.parquet`, `...001.parquet`, etc. - `README.md` Usage notes + embedding shape - `LICENSE.txt` Same ODC-By license unless extended --- ### `notebook-demo/` A minimal working demo for semantic product search over the embedded dataset. - Loads Parquet embeddings - Performs cosine similarity on live queries - Displays top product hits from the network **Includes:** - `tsmpd_search_demo.ipynb` Search notebook - `README.md` Instructions & dependencies --- ## Why This Matters Large models like ChatGPT and Claude do not crawl small stores the way Google does. Without structured visibility, the **long tail of small commerce risks becoming invisible** in AI-powered discovery systems. **TSMPD-US** is designed to prevent that by making small merchant data accessible, embeddable, and usable in todays LLM workflows. --- ## Licensing All public assets are distributed under the [Open Data Commons Attribution License (ODCBy)](https://opendatacommons.org/licenses/by/1-0/). For full product variants, image URLs, merchant domains, and source tracking, request access to the **Partner dataset** by emailing `jim@tokuhn.com`. --- ## How to Use This Repository - Load the text-only dataset via Hugging Face Datasets or `polars` - Run vector search with the SBERT Parquet chunks - Adapt the notebook demo for your own semantic or retrieval tasks - Fine-tune or evaluate grounding quality with real-world small merchant data Lets make sure AI doesnt erase the 99%. ---

license: ODC-BY language: - 英语 size_categories: - 100万<样本数<1000万 dataset_info: features: - name: uid dtype: 字符串 - name: vendor dtype: 字符串 - name: title dtype: 字符串 - name: paragraph dtype: 字符串 - name: embedding dtype: sequence: float32 task_categories: - 文本检索 - 句子相似度计算 task_ids: - 文档检索 - 语义相似度分类 tags: - 电子商务 - 小型商家 - 检索增强生成(Retrieval-Augmented Generation, RAG) - 锚定任务 - 向量搜索 - 开放数据 - 嵌入向量 - tokuhn - Shopify - 真实世界数据 - SBERT - Hugging Face数据集 --- # 【已更新SBERT嵌入与搜索笔记本】 ## TSMPD-US:美国小型商家产品数据集 + SBERT嵌入 + 搜索笔记本 ⚡ 本次更新(2025年4月)新增内容: SBERT向量嵌入(适用于所有商品,基于MiniLM-L6模型) 采用分块Parquet格式,支持可扩展向量搜索 提供面向实时语义查询的Jupyter Notebook演示示例 本次更新可简化小型商家数据集成至检索增强生成(Retrieval-Augmented Generation, RAG)流程、锚定任务与实时AI智能体(AI Agent)的操作流程。 ## 本开源项目旨在让小型商家在大语言模型(Large Language Model, LLM)、检索增强生成系统与AI驱动的商务工作流中获得可见性。 本仓库包含TSMPD-US数据集的多项配套资源,该数据集为仅面向美国地区的结构化经验证小型企业商品列表数据集,从超过35.5万家独立店铺中精选而来,适用于以下场景: - 语义商品搜索 - 大语言模型锚定与微调 - 检索增强生成(RAG) - 元数据分类 - 商务感知智能体设计 --- ## 目录概览 ### `public-products/` 公开文本数据集目录 该目录包含轻量型纯文本数据集快照。 - **约320万件商品**,源自35.5万余家经验证的美国商家 - 单商家平均约10件商品,无图片或商品变体详情 - 适用于通用研究、分类与基础锚定任务 **包含文件:** - `tsmpd_public_v1.0.json` 或 `.parquet`:核心数据集 - `LICENSE.txt`:ODC-BY许可证 - `README.md`:数据集格式与架构说明 --- ### `parquet-embeddings/` 嵌入数据集目录 该目录包含适配语义搜索的嵌入版数据集,搭载**SBERT嵌入(MiniLM-L6)**。 - 采用分块Parquet格式,支持横向扩展 - 嵌入向量与Hugging Face的`sentence-transformers/all-MiniLM-L6-v2`模型对齐 **适用场景:** - 向量搜索与相似度计算流程 - 检索增强生成(RAG) - AI智能体商品推理 **包含文件:** - `tsmpd_public_000.parquet`、`...001.parquet` 等分块Parquet数据文件 - `README.md`:使用说明与嵌入向量维度说明 - `LICENSE.txt`:与公开文本数据集一致的ODC-BY许可证(扩展版除外) --- ### `notebook-demo/` 演示笔记本目录 该目录包含面向嵌入版数据集的极简语义商品搜索演示程序。 - 加载Parquet格式嵌入数据 - 针对实时查询执行余弦相似度计算 - 展示检索返回的Top商品结果 **包含文件:** - `tsmpd_search_demo.ipynb`:搜索演示笔记本 - `README.md`:使用说明与依赖项列表 --- ## 项目意义 诸如ChatGPT与Claude等大语言模型,并不会像谷歌那样爬取小型商家店铺的信息。若缺乏结构化的信息可见性,小型商业的**长尾市场**将面临在AI驱动的发现系统中被彻底湮没的风险。 **TSMPD-US**数据集正是为解决这一问题而生——它让小型商家的数据能够被便捷访问、嵌入并应用于当前的大语言模型工作流中。 --- ## 许可证说明 所有公开资源均采用[开放数据共同体署名许可证(Open Data Commons Attribution License, ODC-By)](https://opendatacommons.org/licenses/by/1-0/)进行分发。 如需获取完整商品变体信息、图片链接、商家域名与来源追踪数据,请发送邮件至`jim@tokuhn.com`申请**合作伙伴数据集**的访问权限。 --- ## 仓库使用指南 - 通过Hugging Face Datasets或`polars`加载纯文本数据集 - 使用SBERT嵌入的分块Parquet文件执行向量搜索 - 改编演示笔记本以适配您的语义搜索或检索任务 - 使用真实小型商家数据微调或评估锚定任务的质量 让我们共同确保AI不会湮没这99%的小型商家。
提供机构:
Tokuhn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作