curvedinf/small-qa-1m
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/curvedinf/small-qa-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
pretty_name: small-qa-autocomplete
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- synthetic
- autocomplete
- command-line
- unit-conversion
- fact-lookup
size_categories:
- 100K<n<1M
---
# small-qa-autocomplete
Synthetic `prompt -> completion` dataset for training/autocomplete workflows.
Prompts are plain query text designed to resemble short user search/CLI-style requests.
## Dataset Summary
- task type: next-token / completion-style supervision
- source: synthetic prompt permutations with deterministic and model-backed answering
- total rows: 999495
## Splits
- train: 979385
- validation: 10032
- test: 10078
## Schema
Each row includes:
- `id`: stable row id
- `prompt`: user query text
- `completion`: target completion text
- `domain`: domain family
- `intent_id`: prompt intent identifier
- `style_id`: prompt style variant
- `template_id`: generator template identifier
- `metadata`: JSON string with generation metadata
## Recommended Usage
```python
from datasets import load_dataset
# parquet
ds = load_dataset("parquet", data_files={
"train": "train.parquet",
"validation": "validation.parquet",
"test": "test.parquet",
})
# jsonl alternative
# ds = load_dataset("json", data_files={
# "train": "train.jsonl",
# "validation": "validation.jsonl",
# "test": "test.jsonl",
# })
```
## Notes
- This dataset is synthetic and may contain occasional noise.
- It is intended as base pretraining/finetuning material for autocomplete-like behavior.
提供机构:
curvedinf



