five

meridian-online/sherlock-annotated

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/meridian-online/sherlock-annotated
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification tags: - column-type-annotation - data-profiling - type-inference - sherlock - finetype pretty_name: Sherlock Column Type Annotations size_categories: - 100K<n<1M --- # Sherlock Column Type Annotations Column-level type annotations for the [Sherlock](https://github.com/mitmedialab/sherlock-project) corpus, produced by the [FineType](https://github.com/meridian-online/finetype) distillation pipeline. ## Dataset Description Each row represents a single column from the Sherlock test set, annotated with: - **Blind label** — an LLM classification made without seeing FineType's prediction - **FineType label** — the prediction from FineType's CharCNN inference engine - **Final label** — adjudicated result (blind-first: the blind label is preferred unless FineType's prediction is clearly more accurate) - **Ground truth** — Sherlock's original label for comparison The blind-first adjudication process ensures annotations are not anchored to FineType's predictions, producing high-quality training signal for model improvement. ## Schema | Column | Type | Description | |---|---|---| | `sherlock_index` | int64 | Column index in the Sherlock corpus (unique identifier) | | `split` | string | Sherlock dataset split (`test`) | | `sample_values` | string | JSON array of sample values from the column | | `blind_label` | string | LLM type label (blind, no FineType prediction shown) | | `blind_confidence` | string | Blind classification confidence (`high`, `medium`, `low`) | | `finetype_label` | string | FineType engine prediction (taxonomy key) | | `finetype_confidence` | float64 | FineType prediction confidence (0-1) | | `agreement` | string | Whether blind and FineType labels agree (`yes`/`no`) | | `final_label` | string | Adjudicated type label (FineType taxonomy key) | | `reasoning` | string | Adjudication reasoning when labels disagree | | `ground_truth_label` | string | Original Sherlock label | ## Type Taxonomy Labels use FineType's three-level taxonomy: `domain.category.type` (e.g., `identity.person.email`, `geography.location.country`). The taxonomy covers 250 types across 7 domains: container, datetime, finance, geography, identity, representation, and technology. ## Statistics - **Rows:** 102,461 unique columns - **Coverage:** 74.6% of the Sherlock test set (137,353 columns) - **Format:** Parquet (zstd compression) ## Usage ```python from datasets import load_dataset ds = load_dataset("meridian-online/sherlock-annotated") ``` ```sql -- DuckDB SELECT * FROM 'hf://datasets/meridian-online/sherlock-annotated/data/sherlock_annotated.parquet'; ``` ## License Apache 2.0
提供机构:
meridian-online
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作