gyorilab/pubtator_variants_ner_seed
收藏Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/gyorilab/pubtator_variants_ner_seed
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
task_categories:
- token-classification
pretty_name: Seed NER Mutations and Variants Dataset
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: id
dtype: int64
- name: document_id
dtype: string
- name: passage_offset
dtype: int64
- name: text
dtype: string
- name: offsets
list: int64
- name: lengths
list: int64
- name: entity_types
list:
class_label:
names:
'0': CopyNumberVariant
'1': DNAMutation
'2': ProteinMutation
'3': SNP
'4': Gene
splits:
- name: train
num_bytes: 125950340
num_examples: 103959
download_size: 64656637
dataset_size: 125950340
---
# Seed NER Dataset
This dataset was derived from Pubtator3 and verified for correctness by GPT 5.2. It contains close to 1,000 rows of text with entity annotations. The dataset is intended for research and development in protein and DNA mutations, SNP identifiers and other genetic variants.
## Source
- `dataset_path`: `data/Seed Dataset/pubtator3_mutations_10k.jsonl`
- `patch_responses_path`: `data/Seed Dataset/responses_gpt5.2_10k.jsonl`
- `generated_at_utc`: `2026-02-26 23:27:36Z`
## Schema
- `id` (int64)
- `text` (string)
- `offsets` (list[int32])
- `lenghts` (list[int32])
- `entity_types` (ClassLabel sequence)
## Entity Labels
CopyNumberVariant, DNAMutation, ProteinMutation, SNP
## Size
- rows: 7565
提供机构:
gyorilab



