technicalheist/cricket-alpaca
收藏Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/technicalheist/cricket-alpaca
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-generation
- instruction-tuning
size_categories:
- 100M<n<1B
source_datasets:
- original
annotations_creators:
- no-annotation
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
train:
num_examples: 5353920
num_bytes: ~2.5GB
validation:
num_examples: 669240
num_bytes: ~300MB
test:
num_examples: 669360
num_bytes: ~300MB
dataset_size: ~3.1GB
---
# Cricket Match Alpaca Dataset
This dataset contains cricket match information formatted for instruction-tuning of Large Language Models (LLM) in Alpaca format.
## Dataset Description
- **Total Matches**: 55,771
- **Total Entries**: 6,692,520
- **Questions per Match**: 120 (20+ per category, 6 categories)
- **Date Range**: 2015-2024
- **Format**: Alpaca (instruction-input-output)
## Dataset Splits
| Split | Matches | Entries | Percentage |
|-------|---------|---------|------------|
| Train | 44,616 | 5,353,920 | 80% |
| Valid | 5,577 | 669,240 | 10% |
| Test | 5,578 | 669,360 | 10% |
- **Split Method**: Match-level split (all 120 questions for a match go to the same split)
- **Random Seed**: 42
- **No Data Leakage**: Matches are not shared across splits
## Data Format
Each entry is a JSON object with the following structure:
```json
{
"instruction": "Question text with match-specific details",
"input": "Context: Tournament: X, Teams: Y vs Z, Scores: A-B...",
"output": "Answer based on the match data"
}
```
### Example Entry
```json
{
"instruction": "What is the outcome of the cricket match between VB Kanchi Veerans and Karaikudi Kaalai on 2021-07-27?",
"input": "Tournament: Tamil Nadu Premier League, Regular Season, Home: VB Kanchi Veerans (Kanchi Veerans), Away: Karaikudi Kaalai (Karaikudi Kaalai), Scores: 148-149, Country: India (IN), Winner: Karaikudi Kaalai, Note: ",
"output": "Karaikudi Kaalai won against VB Kanchi Veerans by 1 runs on 2021-07-27. Scores: Karaikudi Kaalai 149, VB Kanchi Veerans 148."
}
```
## Question Categories
### 1. Basic Outcome (20 questions)
Questions about match results, winners, and outcomes.
### 2. Team Information (20 questions)
Questions about team names, short names, codes, and slugs.
### 3. Tournament Details (20 questions)
Questions about tournament names, slugs, and categories.
### 4. Geographic (20 questions)
Questions about country names, alpha2, and alpha3 codes.
### 5. Match Metadata (20 questions)
Questions about match dates, status, seasons, and notes.
### 6. Scores and Margin (20 questions)
Questions about scores, margins, and run differences.
## Fields Available
| Field | Description | Example |
|-------|-------------|---------|
| match_date | Date of the match | 2021-07-27 |
| home_team_name | Home team full name | VB Kanchi Veerans |
| away_team_name | Away team full name | Karaikudi Kaalai |
| home_shortName | Home team short name | Kanchi Veerans |
| away_shortName | Away team short name | Karaikudi Kaalai |
| home_nameCode | Home team code | KAN |
| away_nameCode | Away team code | KKA |
| home_slug | Home team slug | vb-kanchi-veerans |
| away_slug | Away team slug | karaikudi-kaalai |
| home_score | Home team score | 148 |
| away_score | Away team score | 149 |
| winner_code | Winner (1=home, 2=away, 3=draw) | 2 |
| tournament_name | Tournament name | Tamil Nadu Premier League, Regular Season |
| tournament_slug | Tournament slug | tamil-nadu-premier-league |
| category_name | Tournament category | India |
| country_name | Country name | India |
| country_alpha2 | Country code (2-letter) | IN |
| country_alpha3 | Country code (3-letter) | IND |
| season_name | Season name | 2021 |
| note | Match notes | (filtered for relevance) |
| status_description | Match status | Ended |
## Data Source
- **Source**: Trusted cricket data provider (downloaded dataset)
- **Database**: SQLite (`dataset/cricket.db`)
- **Tables Used**:
- `matches` - Core match data
- `teams` - Team information (names, slugs, shortNames, nameCodes)
- `countries` - Country information (names, alpha2, alpha3)
- `tournaments` - Tournament details (names, slugs)
- `categories` - Tournament categories
## Data Quality
### Quality Checks Passed
- ✅ All entries are valid JSON
- ✅ No empty instructions, inputs, or outputs
- ✅ No unreplaced placeholders
- ✅ Integer margins (not floats like "1.0")
### Known Limitations
- **Missing Country Data**: 28% of matches (15,682) lack country information (country_name, alpha2, alpha3)
- **Missing Scores**: 7% of matches lack home_score or away_score
- **Note Field**: Notes are filtered to only include relevant ones (mentioning teams in the match). Original notes may not match the actual match teams in the source database.
## Usage
### Loading with Datasets Library
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("your-username/cricket-alpaca")
# Access splits
train_data = dataset['train']
valid_data = dataset['validation']
test_data = dataset['test']
print(f"Train: {len(train_data)} examples")
print(f"Valid: {len(valid_data)} examples")
print(f"Test: {len(test_data)} examples")
# Access an example
example = train_data[0]
print(f"Instruction: {example['instruction']}")
print(f"Input: {example['input']}")
print(f"Output: {example['output']}")
```
### Loading from JSONL Files
```python
import json
def load_jsonl(file_path):
data = []
with open(file_path, 'r') as f:
for line in f:
data.append(json.loads(line))
return data
# Load training data
train_data = load_jsonl('train.jsonl')
print(f"Loaded {len(train_data)} training examples")
```
### Fine-tuning Format
The dataset is already in Alpaca format, compatible with:
- LLaMA Fine-tuning
- Alpaca-style instruction tuning
- Most instruction-tuning frameworks
## Repository Structure
```
.
├── train.jsonl # Training split (5.35M entries)
├── valid.jsonl # Validation split (669K entries)
├── test.jsonl # Test split (669K entries)
├── README.md # This file (dataset card)
└── dataset/
└── cricket.db # SQLite database with raw data
```
## Scripts
The dataset was generated using the following scripts:
| Script | Description |
|--------|-------------|
| `generate_split_dataset.py` | Generates train/val/test splits from database |
| `generate_dataset.py` | Generates single dataset file |
| `export_clean_csv.py` | Exports clean CSV from database |
| `quality_check.py` | Runs quality checks on generated data |
## Reproducibility
- **Python Version**: 3.x
- **Dependencies**: sqlite3, json, random, datasets, huggingface-hub
- **Random Seed**: 42 (for reproducible splits)
- **Database**: `dataset/cricket.db` (provided)
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{cricket_alpaca_2024,
author = {Your Name},
title = {Cricket Match Alpaca Dataset},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/your-username/cricket-alpaca}}
}
```
## License
MIT License
## Contact
For questions or issues, please open an issue in the repository.
提供机构:
technicalheist



