OMCHOKSI108/my-cloud-data-lake

Name: OMCHOKSI108/my-cloud-data-lake
Creator: OMCHOKSI108
Published: 2026-02-15 14:18:18
License: 暂无描述

Hugging Face2026-02-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OMCHOKSI108/my-cloud-data-lake

下载链接

链接失效反馈

官方服务：

资源简介：

# 🌊 Zero-Cost Cloud Data Lake on Hugging Face A complete serverless data pipeline that converts 12GB+ of CSV data to optimized Parquet format and serves it via a FastAPI on Hugging Face Spaces. ## 🏗️ Architecture Overview ``` 📁 Local CSV Data (12GB+) ↓ 🔄 Conversion Script (DuckDB + SNAPPY) ↓ 📦 Parquet Files (4.4GB - 67% compression) ↓ ☁️ Hugging Face Dataset ↓ 🚀 Serverless FastAPI ↓ 🌐 Public API Endpoint ``` ## 📊 Dataset Information - **Source**: 793 CSV files across 6 timeframes - **Assets**: 100+ financial instruments (Forex, Crypto, Stocks, Commodities) - **Timeframes**: 1min, 5min, 15min, 30min, 1hr, 4hr, 1day - **Format**: OHLCV (Open, High, Low, Close, Volume) - **Compression**: 67% size reduction (13.4GB → 4.4GB) - **Storage**: Hugging Face Datasets ## 🚀 Quick Start ### 1. API Access - **Live API**: https://omchoksi108-forexdatalake.hf.space - **Documentation**: https://omchoksi108-forexdatalake.hf.space/docs - **Dataset**: https://huggingface.co/datasets/omchoksi108/forex-cloud-data-lake ### 2. API Endpoints | Endpoint | Method | Description | |----------|---------|-------------| | `/` | GET | API information | | `/health` | GET | Health check | | `/describe` | GET | Dataset schema | | `/preview` | GET | Quick data preview | | `/query` | POST | Execute SQL queries | ### 3. Example Usage #### Health Check ```bash curl https://omchoksi108-forexdatalake.hf.space/health ``` #### Get Dataset Schema ```bash curl https://omchoksi108-forexdatalake.hf.space/describe ``` #### Preview Data ```bash curl "https://omchoksi108-forexdatalake.hf.space/preview?limit=5" ``` #### Execute SQL Query ```bash curl -X POST https://omchoksi108-forexdatalake.hf.space/query \ -H "Content-Type: application/json" \ -d '{ "sql_query": "SELECT ts, close, volume FROM data WHERE close > 100 LIMIT 10", "limit": 100 }' ``` ## 📁 File Structure ``` ├── requirements_local.txt # Local dependencies ├── scripts/ │ ├── 1_convert_to_parquet.py # CSV → Parquet conversion │ └── 2_upload_to_hf.py # Upload to Hugging Face ├── api_deploy/ │ ├── main.py # FastAPI server │ ├── requirements.txt # API dependencies │ └── Dockerfile # Container configuration ├── parquet_data/ # Converted Parquet files └── forexdatalake/ # Hugging Face Space files ``` ## 🛠️ Local Setup ### Prerequisites - Python 3.9+ - Hugging Face account with write permissions ### Installation ```bash pip install -r requirements_local.txt ``` ### Environment Setup Create `.env` file: ```env HF_TOKEN=your_hf_write_token_here HF_USERNAME=omchoksi108 ``` ### Data Conversion ```bash python scripts/1_convert_to_parquet.py ``` ### Upload to Hugging Face ```bash python scripts/2_upload_to_hf.py ``` ## 🔧 Technical Details ### Data Processing - **Engine**: DuckDB (in-memory analytical database) - **Compression**: SNAPPY (balanced speed/size) - **Format**: Apache Parquet (columnar storage) - **Memory Usage**: Optimized for large datasets ### API Features - **Serverless**: No server management required - **Streaming**: Direct Parquet queries without download - **Caching**: HTTP file caching for performance - **Security**: SQL injection protection - **Monitoring**: Health checks and error handling ### Performance Metrics - **Conversion Rate**: ~1.8 files/second - **Compression Ratio**: 67% size reduction - **Query Response**: Sub-second for most queries - **Concurrent Users**: Handles multiple simultaneous requests ## 📈 Data Schema Each Parquet file contains: ```sql ts TIMESTAMP -- Data timestamp (YYYY-MM-DD HH:MM) open DOUBLE -- Opening price high DOUBLE -- Highest price low DOUBLE -- Lowest price close DOUBLE -- Closing price volume BIGINT -- Trading volume ``` ## 🔒 Security Features - **SQL Validation**: Only SELECT statements allowed - **Query Limits**: Configurable result limits - **Input Sanitization**: Prevents SQL injection - **Private Dataset**: Access controlled via Hugging Face - **Token Security**: No hardcoded credentials ## 🚀 Deployment ### Automatic Deployment 1. Push changes to `forexdatalake` folder 2. Hugging Face automatically builds Docker container 3. API becomes available at your Space URL ### Manual Deployment ```bash cd forexdatalake git add . git commit -m "Update API" git push origin master ``` ## 📊 Query Examples ### Filter by Symbol ```sql SELECT * FROM data WHERE filename LIKE '%BTCUSD%' LIMIT 100 ``` ### Time Range Analysis ```sql SELECT ts, close, volume FROM data WHERE ts BETWEEN '2023-01-01' AND '2023-12-31' AND symbol = 'EURUSD' ORDER BY ts DESC ``` ### Price Statistics ```sql SELECT AVG(close) as avg_close, MIN(close) as min_close, MAX(close) as max_close, COUNT(*) as total_records FROM data WHERE symbol = 'BTCUSD' ``` ### Volume Analysis ```sql SELECT DATE(ts) as date, SUM(volume) as daily_volume, AVG(close) as avg_price FROM data WHERE symbol = 'ETHUSD' GROUP BY DATE(ts) ORDER BY date DESC LIMIT 30 ``` ## 🔄 Data Updates ### Adding New Data 1. Add CSV files to appropriate timeframe folders 2. Run conversion script 3. Upload updated Parquet files 4. API automatically queries latest data ### Schema Changes - Update conversion script for new columns - Re-run conversion process - Upload updated dataset ## 🐛 Troubleshooting ### Common Issues **Token Permissions** ``` Error: 403 Forbidden - You don't have rights to create dataset Solution: Create new HF token with write permissions ``` **Memory Issues** ``` Error: Out of memory during conversion Solution: Reduce DuckDB memory_limit in script ``` **API Timeouts** ``` Error: Request timeout Solution: Add LIMIT clause to queries ``` ### Performance Optimization - Use specific column selection instead of `SELECT *` - Add appropriate WHERE clauses - Use LIMIT for large result sets - Consider time-based partitioning for queries ## 📚 Resources - **DuckDB Documentation**: https://duckdb.org/docs/ - **FastAPI Documentation**: https://fastapi.tiangolo.com/ - **Hugging Face Datasets**: https://huggingface.co/docs/datasets/ - **Parquet Format**: https://parquet.apache.org/ ## 🤝 Contributing 1. Fork the repository 2. Create feature branch 3. Make changes 4. Test thoroughly 5. Submit pull request ## 📄 License This project is licensed under the MIT License. ## 🙏 Acknowledgments - **DuckDB Team** - High-performance analytical database - **Hugging Face** - Free hosting and infrastructure - **FastAPI** - Modern web framework - **Parquet Community** - Efficient columnar storage --- **🌟 Star this project if you find it useful!** **📧 Contact**: For issues and questions, use the Issues tab on Hugging Face.

提供机构：

OMCHOKSI108

5,000+

优质数据集

54 个

任务类型

进入经典数据集