five

mlech26l/liquidrandom-data

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mlech26l/liquidrandom-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en size_categories: - 10K<n<100K tags: - synthetic - seed-data - diversity - llm-training --- # liquidrandom-data Diverse seed data for ML/LLM training data generation pipelines. Used by the [liquidrandom](https://github.com/mlech26l/liquidrandom) Python package. ## Dataset Summary This dataset contains 295,704 seed data samples across 13 categories, generated using a hierarchical taxonomy tree approach with LLM-based quality validation and fuzzy deduplication. Data is stored as Parquet with zstd compression. ## Categories | Category | Samples | File | |---|---|---| | Coding Tasks | 30,069 | `coding_task.parquet` | | Domains | 31,177 | `domain.parquet` | | Emotional States | 25,843 | `emotional_state.parquet` | | Instruction Complexity | 283 | `instruction_complexity.parquet` | | Jobs | 32,537 | `job.parquet` | | Languages | 29,176 | `language.parquet` | | Math Categories | 25,369 | `math_category.parquet` | | Personas | 24,995 | `persona.parquet` | | Reasoning Patterns | 274 | `reasoning_pattern.parquet` | | Scenarios | 31,180 | `scenario.parquet` | | Science Topics | 30,340 | `science_topic.parquet` | | Tool Groups | 6,729 | `tool_group.parquet` | | Writing Styles | 27,732 | `writing_style.parquet` | ## Usage ```python import liquidrandom persona = liquidrandom.persona() print(persona) ``` ## Generation Data was generated using the `liquidrandom` seed generation scripts with: - Hierarchical taxonomy trees for diversity - LLM-based quality validation - Jaccard similarity deduplication
提供机构:
mlech26l
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作