mlech26l/liquidrandom-data
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mlech26l/liquidrandom-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
size_categories:
- 10K<n<100K
tags:
- synthetic
- seed-data
- diversity
- llm-training
---
# liquidrandom-data
Diverse seed data for ML/LLM training data generation pipelines.
Used by the [liquidrandom](https://github.com/mlech26l/liquidrandom) Python package.
## Dataset Summary
This dataset contains 295,704 seed data samples across 13 categories,
generated using a hierarchical taxonomy tree approach with LLM-based quality validation
and fuzzy deduplication. Data is stored as Parquet with zstd compression.
## Categories
| Category | Samples | File |
|---|---|---|
| Coding Tasks | 30,069 | `coding_task.parquet` |
| Domains | 31,177 | `domain.parquet` |
| Emotional States | 25,843 | `emotional_state.parquet` |
| Instruction Complexity | 283 | `instruction_complexity.parquet` |
| Jobs | 32,537 | `job.parquet` |
| Languages | 29,176 | `language.parquet` |
| Math Categories | 25,369 | `math_category.parquet` |
| Personas | 24,995 | `persona.parquet` |
| Reasoning Patterns | 274 | `reasoning_pattern.parquet` |
| Scenarios | 31,180 | `scenario.parquet` |
| Science Topics | 30,340 | `science_topic.parquet` |
| Tool Groups | 6,729 | `tool_group.parquet` |
| Writing Styles | 27,732 | `writing_style.parquet` |
## Usage
```python
import liquidrandom
persona = liquidrandom.persona()
print(persona)
```
## Generation
Data was generated using the `liquidrandom` seed generation scripts with:
- Hierarchical taxonomy trees for diversity
- LLM-based quality validation
- Jaccard similarity deduplication
提供机构:
mlech26l



