five

IlyasFardaouixx/legalfinance-500k-mixed

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/IlyasFardaouixx/legalfinance-500k-mixed
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - legal - finance - instruction-following - question-answering - classification - summarization - code-generation - reasoning size_categories: - 10K<n<100K task_categories: - text-generation - question-answering - text-classification - summarization --- # 🚀 LegalFinance-5M Mixed Dataset Builder Welcome to the **Data-Set-Builder-**, a production-grade synthetic data engine designed to build massive, high-quality datasets for Legal and Financial AI. ![Dataset Stats](https://img.shields.io/badge/Status-Running-brightgreen?style=for-the-badge) ![Target](https://img.shields.io/badge/Target-12M_Rows-blue?style=for-the-badge) ![Providers](https://img.shields.io/badge/Providers-6_Parallel-orange?style=for-the-badge) ## 💡 The Vision Building domain-specific datasets (Legal/Finance) is traditionally slow and expensive. This pipeline changes the game by parallelizing generation across **6 different AI providers** simultaneously, reaching speeds of **250,000+ rows per hour**. Whether you're fine-tuning a Llama-3, Mistral, or Gemini model, this builder gives you the raw material at scale. ## 🛠️ Performance Engine This isn't just a simple script. It's a distributed worker system: * **⚡ 6-Provider Grid**: Load-balanced across **Groq, Cerebras, OpenRouter, Google Gemini, Mistral, and Cohere**. * **🧩 12 Task Categories**: Expert-level Q&A, complex reasoning, legal code generation, financial analysis, and more. * **🧼 Clean & Refined**: Built-in MinHash-LSH deduplication and Pydantic validation ensure your model isn't learning from garbage. * **🎯 Infinity Mode**: Set it to run until your keys are exhausted, then auto-upload to Hugging Face. ## 🚀 Quick Start (Full Auto) If you want to just start building right now: 1. **Clone the Beast**: ```bash git clone https://github.com/IlyasFardaouix/Data-Set-Builder- cd Data-Set-Builder- ``` 2. **Config**: Rename `.env.example` to `.env` and drop in your API keys. 3. **Launch**: ```bash python scripts/run_complete_pipeline.py ``` *This will generate, clean, and upload everything to Hugging Face while you sleep.* ## 📂 Project Structure ```text ├── generators/ # Worker logic for different task types ├── pipeline/ # Cleaning, deduplication, and HF uploader ├── prompts/ # Expert-crafted domain prompts ├── scripts/ # CLI Entrypoints ├── data/ # Local storage (raw & processed) └── config.py # The "Brain" of the operation ``` ## 📖 About the Project This project was born out of a simple problem: **high-quality, domain-specific AI training data is too expensive.** By merging the latest in high-speed inference (like Cerebras and Groq) with a modular task-distribution system, we've created a way for individual researchers and small teams to build "Big Tech" grade datasets on a "Free Tier" budget. Our goal is to democratize fine-tuning for specialized fields like Law and Finance. ### #Hashtags #AIData #SyntheticData #LLM #FineTuning #LegalAI #FinancialAI #OpenDataSet #BigData #MachineLearning #AIOps #DataSetBuilder --- *Built with ❤️ for the AI Research Community.*
提供机构:
IlyasFardaouixx
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作