paperbd/paper_instructions_300K-v1
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/paperbd/paper_instructions_300K-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- question-answering
- summarization
- text-to-speech
language:
- en
size_categories:
- 10K<n<100K
---
## Dataset Summary
This dataset contains synthetic supervised fine-tuning data generated from academic papers using `text-albumentations`.
- Rows: 300,000
- Source documents: 1,500 papers
- Format: Alpaca-style instruction tuning rows
Data was synthetically generated using Qwen3.5-4B with `text-albumentations` library (https://github.com/avbiswas/text-albumentations)
Each row is derived from source text through structured augmentations such as:
- bullet extraction
- question-answer generation
- rephrasing
- continuation-style supervision
- comparison and retrieval-style tasks
- knowledge graph triplet extraction
The goal is to turn long-form technical text into diverse, task-shaped supervision for distillation and SFT workflows.
## Supported Tasks
- supervised fine-tuning
- instruction tuning
- distillation
## Data Structure
Each example follows an Alpaca-style schema:
```json
{
"instruction": "string",
"input": "string",
"output": "string"
}
```
## Source Data
The dataset was generated from a collection of 1,500 ML/AI papers. The source material was transformed into synthetic instruction-response pairs through structured augmentation pipelines rather than copied as raw passages alone.
## Limitations
- This is synthetic data, not human-written gold supervision.
- Output quality depends on the underlying model and prompting pipeline used during generation.
- The dataset may contain factual omissions, formatting inconsistencies, or augmentation artifacts.
- Coverage and style are shaped by the source papers and the selected augmentation families.
## License
This dataset is for educational purposes. Please ensure your use is compatible with the licenses and terms of the original source papers.
提供机构:
paperbd



