five

technicalheist/cricket-alpaca

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/technicalheist/cricket-alpaca
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit task_categories: - text-generation - instruction-tuning size_categories: - 100M<n<1B source_datasets: - original annotations_creators: - no-annotation dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: train: num_examples: 5353920 num_bytes: ~2.5GB validation: num_examples: 669240 num_bytes: ~300MB test: num_examples: 669360 num_bytes: ~300MB dataset_size: ~3.1GB --- # Cricket Match Alpaca Dataset This dataset contains cricket match information formatted for instruction-tuning of Large Language Models (LLM) in Alpaca format. ## Dataset Description - **Total Matches**: 55,771 - **Total Entries**: 6,692,520 - **Questions per Match**: 120 (20+ per category, 6 categories) - **Date Range**: 2015-2024 - **Format**: Alpaca (instruction-input-output) ## Dataset Splits | Split | Matches | Entries | Percentage | |-------|---------|---------|------------| | Train | 44,616 | 5,353,920 | 80% | | Valid | 5,577 | 669,240 | 10% | | Test | 5,578 | 669,360 | 10% | - **Split Method**: Match-level split (all 120 questions for a match go to the same split) - **Random Seed**: 42 - **No Data Leakage**: Matches are not shared across splits ## Data Format Each entry is a JSON object with the following structure: ```json { "instruction": "Question text with match-specific details", "input": "Context: Tournament: X, Teams: Y vs Z, Scores: A-B...", "output": "Answer based on the match data" } ``` ### Example Entry ```json { "instruction": "What is the outcome of the cricket match between VB Kanchi Veerans and Karaikudi Kaalai on 2021-07-27?", "input": "Tournament: Tamil Nadu Premier League, Regular Season, Home: VB Kanchi Veerans (Kanchi Veerans), Away: Karaikudi Kaalai (Karaikudi Kaalai), Scores: 148-149, Country: India (IN), Winner: Karaikudi Kaalai, Note: ", "output": "Karaikudi Kaalai won against VB Kanchi Veerans by 1 runs on 2021-07-27. Scores: Karaikudi Kaalai 149, VB Kanchi Veerans 148." } ``` ## Question Categories ### 1. Basic Outcome (20 questions) Questions about match results, winners, and outcomes. ### 2. Team Information (20 questions) Questions about team names, short names, codes, and slugs. ### 3. Tournament Details (20 questions) Questions about tournament names, slugs, and categories. ### 4. Geographic (20 questions) Questions about country names, alpha2, and alpha3 codes. ### 5. Match Metadata (20 questions) Questions about match dates, status, seasons, and notes. ### 6. Scores and Margin (20 questions) Questions about scores, margins, and run differences. ## Fields Available | Field | Description | Example | |-------|-------------|---------| | match_date | Date of the match | 2021-07-27 | | home_team_name | Home team full name | VB Kanchi Veerans | | away_team_name | Away team full name | Karaikudi Kaalai | | home_shortName | Home team short name | Kanchi Veerans | | away_shortName | Away team short name | Karaikudi Kaalai | | home_nameCode | Home team code | KAN | | away_nameCode | Away team code | KKA | | home_slug | Home team slug | vb-kanchi-veerans | | away_slug | Away team slug | karaikudi-kaalai | | home_score | Home team score | 148 | | away_score | Away team score | 149 | | winner_code | Winner (1=home, 2=away, 3=draw) | 2 | | tournament_name | Tournament name | Tamil Nadu Premier League, Regular Season | | tournament_slug | Tournament slug | tamil-nadu-premier-league | | category_name | Tournament category | India | | country_name | Country name | India | | country_alpha2 | Country code (2-letter) | IN | | country_alpha3 | Country code (3-letter) | IND | | season_name | Season name | 2021 | | note | Match notes | (filtered for relevance) | | status_description | Match status | Ended | ## Data Source - **Source**: Trusted cricket data provider (downloaded dataset) - **Database**: SQLite (`dataset/cricket.db`) - **Tables Used**: - `matches` - Core match data - `teams` - Team information (names, slugs, shortNames, nameCodes) - `countries` - Country information (names, alpha2, alpha3) - `tournaments` - Tournament details (names, slugs) - `categories` - Tournament categories ## Data Quality ### Quality Checks Passed - ✅ All entries are valid JSON - ✅ No empty instructions, inputs, or outputs - ✅ No unreplaced placeholders - ✅ Integer margins (not floats like "1.0") ### Known Limitations - **Missing Country Data**: 28% of matches (15,682) lack country information (country_name, alpha2, alpha3) - **Missing Scores**: 7% of matches lack home_score or away_score - **Note Field**: Notes are filtered to only include relevant ones (mentioning teams in the match). Original notes may not match the actual match teams in the source database. ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("your-username/cricket-alpaca") # Access splits train_data = dataset['train'] valid_data = dataset['validation'] test_data = dataset['test'] print(f"Train: {len(train_data)} examples") print(f"Valid: {len(valid_data)} examples") print(f"Test: {len(test_data)} examples") # Access an example example = train_data[0] print(f"Instruction: {example['instruction']}") print(f"Input: {example['input']}") print(f"Output: {example['output']}") ``` ### Loading from JSONL Files ```python import json def load_jsonl(file_path): data = [] with open(file_path, 'r') as f: for line in f: data.append(json.loads(line)) return data # Load training data train_data = load_jsonl('train.jsonl') print(f"Loaded {len(train_data)} training examples") ``` ### Fine-tuning Format The dataset is already in Alpaca format, compatible with: - LLaMA Fine-tuning - Alpaca-style instruction tuning - Most instruction-tuning frameworks ## Repository Structure ``` . ├── train.jsonl # Training split (5.35M entries) ├── valid.jsonl # Validation split (669K entries) ├── test.jsonl # Test split (669K entries) ├── README.md # This file (dataset card) └── dataset/ └── cricket.db # SQLite database with raw data ``` ## Scripts The dataset was generated using the following scripts: | Script | Description | |--------|-------------| | `generate_split_dataset.py` | Generates train/val/test splits from database | | `generate_dataset.py` | Generates single dataset file | | `export_clean_csv.py` | Exports clean CSV from database | | `quality_check.py` | Runs quality checks on generated data | ## Reproducibility - **Python Version**: 3.x - **Dependencies**: sqlite3, json, random, datasets, huggingface-hub - **Random Seed**: 42 (for reproducible splits) - **Database**: `dataset/cricket.db` (provided) ## Citation If you use this dataset, please cite: ```bibtex @dataset{cricket_alpaca_2024, author = {Your Name}, title = {Cricket Match Alpaca Dataset}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/your-username/cricket-alpaca}} } ``` ## License MIT License ## Contact For questions or issues, please open an issue in the repository.
提供机构:
technicalheist
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作