mjbommar/ogbert-v1-sft

Name: mjbommar/ogbert-v1-sft
Creator: mjbommar
Published: 2025-12-06 22:38:44
License: 暂无描述

Hugging Face2025-12-06 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/mjbommar/ogbert-v1-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - question-answering - text-generation task_ids: - open-domain-qa - closed-domain-qa pretty_name: OGBert SFT Dataset size_categories: - 1M<n<10M tags: - instruction-tuning - supervised-fine-tuning - modernbert - lexical - dictionary - question-answering --- # OGBert v1 SFT Dataset Supervised Fine-Tuning (SFT) dataset for OGBert, containing prompt-completion pairs generated from the OpenGloss dictionary. ## Dataset Description This dataset transforms dictionary entries into instruction-following format for fine-tuning language models. Each entry generates multiple training examples using different lexical properties: - **Definitions**: "Definition: X\nWhich word is defined?" → word - **Synonyms**: "Synonyms: X, Y, Z\nWhich word fits these synonyms?" → word - **Antonyms**: "Antonyms: X, Y, Z\nWhich word has these antonyms?" → word - **Hypernyms**: "Hypernyms: X, Y, Z\nWhich word is a hyponym of these?" → word - **Hyponyms**: "Hyponyms: X, Y, Z\nWhich word is the hypernym of these?" → word - **Collocations**: "Collocations: X, Y, Z\nWhich word appears in these collocations?" → word - **Examples**: "Example: X\nWhich word is illustrated?" → word - **Encyclopedia**: "Encyclopedia entry: X\nWhich word is described?" → word - **Lexical explanation**: "Lexical explanation: X\nWhich word is described?" → word ## Dataset Structure ### Fields - `prompt`: The instruction/question text - `completion`: The target word to predict - `prompt_type`: Type of prompt (definition, synonyms, antonyms, hypernyms, hyponyms, collocations, example, encyclopedia, lexical) - `reading_level`: Reading difficulty level (elementary, intermediate, advanced, unknown) - `domain_tag`: Domain/topic tag from the dictionary entry - `word_id`: Unique identifier for the source dictionary entry - `word`: The headword from the dictionary entry (same as completion) ### Splits - `train`: Training examples - `eval`: Evaluation examples (2% of total) ### Statistics - **Total pairs**: ~1M - **Average per entry**: ~7 prompts - **Prompt type distribution**: - Definitions: ~150k (most common, 1 per entry) - Examples: ~200k (2-3 per entry when available) - Synonyms: ~120k (when available) - Hypernyms: ~100k (when available) - Hyponyms: ~120k (when available) - Collocations: ~100k (when available) - Antonyms: ~80k (when available) - Encyclopedia: ~45k (when available) - Lexical: ~45k (when available) ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("mjbommar/ogbert-v1-sft") # Access splits train_data = dataset["train"] eval_data = dataset["eval"] # Example entry print(train_data[0]) # { # 'prompt': 'Definition: A large mammal with a trunk... Which word is defined?', # 'completion': 'elephant', # 'prompt_type': 'definition', # 'reading_level': 'elementary', # 'domain_tag': 'animals' # } ``` ### Training with Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments from datasets import load_dataset # Load dataset dataset = load_dataset("mjbommar/ogbert-v1-sft") # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base") model = AutoModelForCausalLM.from_pretrained("your-model-checkpoint") # Tokenize with prompt-completion format def tokenize_function(examples): prompts = [f"{p} Answer: " for p in examples["prompt"]] completions = examples["completion"] # Tokenize prompt prompt_tokens = tokenizer(prompts, truncation=True, max_length=256) # Tokenize completion completion_tokens = tokenizer(completions, add_special_tokens=False) # Combine and create labels (mask prompt, predict completion) input_ids = [] labels = [] for p_ids, c_ids in zip(prompt_tokens["input_ids"], completion_tokens["input_ids"]): combined = p_ids + c_ids label = [-100] * len(p_ids) + c_ids # Only compute loss on completion input_ids.append(combined) labels.append(label) return {"input_ids": input_ids, "labels": labels} tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names) # Train trainer = Trainer( model=model, args=TrainingArguments(output_dir="./results", ...), train_dataset=tokenized_dataset["train"], eval_dataset=tokenized_dataset["eval"], ) trainer.train() ``` ## Metadata Usage The `prompt_type`, `reading_level`, and `domain_tag` fields can be used for: - Stratified sampling during training - Curriculum learning (start with simpler prompts) - Domain-specific fine-tuning - Prompt type balancing - Analysis and evaluation by category ## Source Dataset - [mjbommar/opengloss-v1.1-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary) ## License Same as source dataset (OpenGloss project). ## Citation If you use this dataset, please cite the original OpenGloss project and dataset.

提供机构：

mjbommar

5,000+

优质数据集

54 个

任务类型

进入经典数据集