OMCHOKSI108/my-cloud-data-lake
收藏Hugging Face2026-02-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OMCHOKSI108/my-cloud-data-lake
下载链接
链接失效反馈官方服务:
资源简介:
# 🌊 Zero-Cost Cloud Data Lake on Hugging Face
A complete serverless data pipeline that converts 12GB+ of CSV data to optimized Parquet format and serves it via a FastAPI on Hugging Face Spaces.
## 🏗️ Architecture Overview
```
📁 Local CSV Data (12GB+)
↓
🔄 Conversion Script (DuckDB + SNAPPY)
↓
📦 Parquet Files (4.4GB - 67% compression)
↓
☁️ Hugging Face Dataset
↓
🚀 Serverless FastAPI
↓
🌐 Public API Endpoint
```
## 📊 Dataset Information
- **Source**: 793 CSV files across 6 timeframes
- **Assets**: 100+ financial instruments (Forex, Crypto, Stocks, Commodities)
- **Timeframes**: 1min, 5min, 15min, 30min, 1hr, 4hr, 1day
- **Format**: OHLCV (Open, High, Low, Close, Volume)
- **Compression**: 67% size reduction (13.4GB → 4.4GB)
- **Storage**: Hugging Face Datasets
## 🚀 Quick Start
### 1. API Access
- **Live API**: https://omchoksi108-forexdatalake.hf.space
- **Documentation**: https://omchoksi108-forexdatalake.hf.space/docs
- **Dataset**: https://huggingface.co/datasets/omchoksi108/forex-cloud-data-lake
### 2. API Endpoints
| Endpoint | Method | Description |
|----------|---------|-------------|
| `/` | GET | API information |
| `/health` | GET | Health check |
| `/describe` | GET | Dataset schema |
| `/preview` | GET | Quick data preview |
| `/query` | POST | Execute SQL queries |
### 3. Example Usage
#### Health Check
```bash
curl https://omchoksi108-forexdatalake.hf.space/health
```
#### Get Dataset Schema
```bash
curl https://omchoksi108-forexdatalake.hf.space/describe
```
#### Preview Data
```bash
curl "https://omchoksi108-forexdatalake.hf.space/preview?limit=5"
```
#### Execute SQL Query
```bash
curl -X POST https://omchoksi108-forexdatalake.hf.space/query \
-H "Content-Type: application/json" \
-d '{
"sql_query": "SELECT ts, close, volume FROM data WHERE close > 100 LIMIT 10",
"limit": 100
}'
```
## 📁 File Structure
```
├── requirements_local.txt # Local dependencies
├── scripts/
│ ├── 1_convert_to_parquet.py # CSV → Parquet conversion
│ └── 2_upload_to_hf.py # Upload to Hugging Face
├── api_deploy/
│ ├── main.py # FastAPI server
│ ├── requirements.txt # API dependencies
│ └── Dockerfile # Container configuration
├── parquet_data/ # Converted Parquet files
└── forexdatalake/ # Hugging Face Space files
```
## 🛠️ Local Setup
### Prerequisites
- Python 3.9+
- Hugging Face account with write permissions
### Installation
```bash
pip install -r requirements_local.txt
```
### Environment Setup
Create `.env` file:
```env
HF_TOKEN=your_hf_write_token_here
HF_USERNAME=omchoksi108
```
### Data Conversion
```bash
python scripts/1_convert_to_parquet.py
```
### Upload to Hugging Face
```bash
python scripts/2_upload_to_hf.py
```
## 🔧 Technical Details
### Data Processing
- **Engine**: DuckDB (in-memory analytical database)
- **Compression**: SNAPPY (balanced speed/size)
- **Format**: Apache Parquet (columnar storage)
- **Memory Usage**: Optimized for large datasets
### API Features
- **Serverless**: No server management required
- **Streaming**: Direct Parquet queries without download
- **Caching**: HTTP file caching for performance
- **Security**: SQL injection protection
- **Monitoring**: Health checks and error handling
### Performance Metrics
- **Conversion Rate**: ~1.8 files/second
- **Compression Ratio**: 67% size reduction
- **Query Response**: Sub-second for most queries
- **Concurrent Users**: Handles multiple simultaneous requests
## 📈 Data Schema
Each Parquet file contains:
```sql
ts TIMESTAMP -- Data timestamp (YYYY-MM-DD HH:MM)
open DOUBLE -- Opening price
high DOUBLE -- Highest price
low DOUBLE -- Lowest price
close DOUBLE -- Closing price
volume BIGINT -- Trading volume
```
## 🔒 Security Features
- **SQL Validation**: Only SELECT statements allowed
- **Query Limits**: Configurable result limits
- **Input Sanitization**: Prevents SQL injection
- **Private Dataset**: Access controlled via Hugging Face
- **Token Security**: No hardcoded credentials
## 🚀 Deployment
### Automatic Deployment
1. Push changes to `forexdatalake` folder
2. Hugging Face automatically builds Docker container
3. API becomes available at your Space URL
### Manual Deployment
```bash
cd forexdatalake
git add .
git commit -m "Update API"
git push origin master
```
## 📊 Query Examples
### Filter by Symbol
```sql
SELECT * FROM data
WHERE filename LIKE '%BTCUSD%'
LIMIT 100
```
### Time Range Analysis
```sql
SELECT ts, close, volume
FROM data
WHERE ts BETWEEN '2023-01-01' AND '2023-12-31'
AND symbol = 'EURUSD'
ORDER BY ts DESC
```
### Price Statistics
```sql
SELECT
AVG(close) as avg_close,
MIN(close) as min_close,
MAX(close) as max_close,
COUNT(*) as total_records
FROM data
WHERE symbol = 'BTCUSD'
```
### Volume Analysis
```sql
SELECT
DATE(ts) as date,
SUM(volume) as daily_volume,
AVG(close) as avg_price
FROM data
WHERE symbol = 'ETHUSD'
GROUP BY DATE(ts)
ORDER BY date DESC
LIMIT 30
```
## 🔄 Data Updates
### Adding New Data
1. Add CSV files to appropriate timeframe folders
2. Run conversion script
3. Upload updated Parquet files
4. API automatically queries latest data
### Schema Changes
- Update conversion script for new columns
- Re-run conversion process
- Upload updated dataset
## 🐛 Troubleshooting
### Common Issues
**Token Permissions**
```
Error: 403 Forbidden - You don't have rights to create dataset
Solution: Create new HF token with write permissions
```
**Memory Issues**
```
Error: Out of memory during conversion
Solution: Reduce DuckDB memory_limit in script
```
**API Timeouts**
```
Error: Request timeout
Solution: Add LIMIT clause to queries
```
### Performance Optimization
- Use specific column selection instead of `SELECT *`
- Add appropriate WHERE clauses
- Use LIMIT for large result sets
- Consider time-based partitioning for queries
## 📚 Resources
- **DuckDB Documentation**: https://duckdb.org/docs/
- **FastAPI Documentation**: https://fastapi.tiangolo.com/
- **Hugging Face Datasets**: https://huggingface.co/docs/datasets/
- **Parquet Format**: https://parquet.apache.org/
## 🤝 Contributing
1. Fork the repository
2. Create feature branch
3. Make changes
4. Test thoroughly
5. Submit pull request
## 📄 License
This project is licensed under the MIT License.
## 🙏 Acknowledgments
- **DuckDB Team** - High-performance analytical database
- **Hugging Face** - Free hosting and infrastructure
- **FastAPI** - Modern web framework
- **Parquet Community** - Efficient columnar storage
---
**🌟 Star this project if you find it useful!**
**📧 Contact**: For issues and questions, use the Issues tab on Hugging Face.
提供机构:
OMCHOKSI108



