curvedinf/small-qa-1m

Name: curvedinf/small-qa-1m
Creator: curvedinf
Published: 2026-04-17 20:14:53
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/curvedinf/small-qa-1m

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit pretty_name: small-qa-autocomplete task_categories: - text-generation task_ids: - language-modeling tags: - synthetic - autocomplete - command-line - unit-conversion - fact-lookup size_categories: - 100K<n<1M --- # small-qa-autocomplete Synthetic `prompt -> completion` dataset for training/autocomplete workflows. Prompts are plain query text designed to resemble short user search/CLI-style requests. ## Dataset Summary - task type: next-token / completion-style supervision - source: synthetic prompt permutations with deterministic and model-backed answering - total rows: 999495 ## Splits - train: 979385 - validation: 10032 - test: 10078 ## Schema Each row includes: - `id`: stable row id - `prompt`: user query text - `completion`: target completion text - `domain`: domain family - `intent_id`: prompt intent identifier - `style_id`: prompt style variant - `template_id`: generator template identifier - `metadata`: JSON string with generation metadata ## Recommended Usage ```python from datasets import load_dataset # parquet ds = load_dataset("parquet", data_files={ "train": "train.parquet", "validation": "validation.parquet", "test": "test.parquet", }) # jsonl alternative # ds = load_dataset("json", data_files={ # "train": "train.jsonl", # "validation": "validation.jsonl", # "test": "test.jsonl", # }) ``` ## Notes - This dataset is synthetic and may contain occasional noise. - It is intended as base pretraining/finetuning material for autocomplete-like behavior.

提供机构：

curvedinf

5,000+

优质数据集

54 个

任务类型

进入经典数据集