meridian-online/sherlock-annotated
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/meridian-online/sherlock-annotated
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
tags:
- column-type-annotation
- data-profiling
- type-inference
- sherlock
- finetype
pretty_name: Sherlock Column Type Annotations
size_categories:
- 100K<n<1M
---
# Sherlock Column Type Annotations
Column-level type annotations for the [Sherlock](https://github.com/mitmedialab/sherlock-project) corpus, produced by the [FineType](https://github.com/meridian-online/finetype) distillation pipeline.
## Dataset Description
Each row represents a single column from the Sherlock test set, annotated with:
- **Blind label** — an LLM classification made without seeing FineType's prediction
- **FineType label** — the prediction from FineType's CharCNN inference engine
- **Final label** — adjudicated result (blind-first: the blind label is preferred unless FineType's prediction is clearly more accurate)
- **Ground truth** — Sherlock's original label for comparison
The blind-first adjudication process ensures annotations are not anchored to FineType's predictions, producing high-quality training signal for model improvement.
## Schema
| Column | Type | Description |
|---|---|---|
| `sherlock_index` | int64 | Column index in the Sherlock corpus (unique identifier) |
| `split` | string | Sherlock dataset split (`test`) |
| `sample_values` | string | JSON array of sample values from the column |
| `blind_label` | string | LLM type label (blind, no FineType prediction shown) |
| `blind_confidence` | string | Blind classification confidence (`high`, `medium`, `low`) |
| `finetype_label` | string | FineType engine prediction (taxonomy key) |
| `finetype_confidence` | float64 | FineType prediction confidence (0-1) |
| `agreement` | string | Whether blind and FineType labels agree (`yes`/`no`) |
| `final_label` | string | Adjudicated type label (FineType taxonomy key) |
| `reasoning` | string | Adjudication reasoning when labels disagree |
| `ground_truth_label` | string | Original Sherlock label |
## Type Taxonomy
Labels use FineType's three-level taxonomy: `domain.category.type` (e.g., `identity.person.email`, `geography.location.country`). The taxonomy covers 250 types across 7 domains: container, datetime, finance, geography, identity, representation, and technology.
## Statistics
- **Rows:** 102,461 unique columns
- **Coverage:** 74.6% of the Sherlock test set (137,353 columns)
- **Format:** Parquet (zstd compression)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("meridian-online/sherlock-annotated")
```
```sql
-- DuckDB
SELECT * FROM 'hf://datasets/meridian-online/sherlock-annotated/data/sherlock_annotated.parquet';
```
## License
Apache 2.0
提供机构:
meridian-online



