IlyasFardaouixx/legalfinance-500k-mixed
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/IlyasFardaouixx/legalfinance-500k-mixed
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- legal
- finance
- instruction-following
- question-answering
- classification
- summarization
- code-generation
- reasoning
size_categories:
- 10K<n<100K
task_categories:
- text-generation
- question-answering
- text-classification
- summarization
---
# 🚀 LegalFinance-5M Mixed Dataset Builder
Welcome to the **Data-Set-Builder-**, a production-grade synthetic data engine designed to build massive, high-quality datasets for Legal and Financial AI.



## 💡 The Vision
Building domain-specific datasets (Legal/Finance) is traditionally slow and expensive. This pipeline changes the game by parallelizing generation across **6 different AI providers** simultaneously, reaching speeds of **250,000+ rows per hour**.
Whether you're fine-tuning a Llama-3, Mistral, or Gemini model, this builder gives you the raw material at scale.
## 🛠️ Performance Engine
This isn't just a simple script. It's a distributed worker system:
* **⚡ 6-Provider Grid**: Load-balanced across **Groq, Cerebras, OpenRouter, Google Gemini, Mistral, and Cohere**.
* **🧩 12 Task Categories**: Expert-level Q&A, complex reasoning, legal code generation, financial analysis, and more.
* **🧼 Clean & Refined**: Built-in MinHash-LSH deduplication and Pydantic validation ensure your model isn't learning from garbage.
* **🎯 Infinity Mode**: Set it to run until your keys are exhausted, then auto-upload to Hugging Face.
## 🚀 Quick Start (Full Auto)
If you want to just start building right now:
1. **Clone the Beast**:
```bash
git clone https://github.com/IlyasFardaouix/Data-Set-Builder-
cd Data-Set-Builder-
```
2. **Config**:
Rename `.env.example` to `.env` and drop in your API keys.
3. **Launch**:
```bash
python scripts/run_complete_pipeline.py
```
*This will generate, clean, and upload everything to Hugging Face while you sleep.*
## 📂 Project Structure
```text
├── generators/ # Worker logic for different task types
├── pipeline/ # Cleaning, deduplication, and HF uploader
├── prompts/ # Expert-crafted domain prompts
├── scripts/ # CLI Entrypoints
├── data/ # Local storage (raw & processed)
└── config.py # The "Brain" of the operation
```
## 📖 About the Project
This project was born out of a simple problem: **high-quality, domain-specific AI training data is too expensive.**
By merging the latest in high-speed inference (like Cerebras and Groq) with a modular task-distribution system, we've created a way for individual researchers and small teams to build "Big Tech" grade datasets on a "Free Tier" budget. Our goal is to democratize fine-tuning for specialized fields like Law and Finance.
### #Hashtags
#AIData #SyntheticData #LLM #FineTuning #LegalAI #FinancialAI #OpenDataSet #BigData #MachineLearning #AIOps #DataSetBuilder
---
*Built with ❤️ for the AI Research Community.*
提供机构:
IlyasFardaouixx



