ammarnasr/the-stack-swift-clean

Name: ammarnasr/the-stack-swift-clean
Creator: ammarnasr
Published: 2023-08-14 21:20:23
License: 暂无描述

Hugging Face2023-08-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ammarnasr/the-stack-swift-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: openrail dataset_info: features: - name: hexsha dtype: string - name: size dtype: int64 - name: content dtype: string - name: avg_line_length dtype: float64 - name: max_line_length dtype: int64 - name: alphanum_fraction dtype: float64 splits: - name: train num_bytes: 3582248477.9086223 num_examples: 806789 - name: test num_bytes: 394048264.9973618 num_examples: 88747 - name: valid num_bytes: 3982797.09401595 num_examples: 897 download_size: 1323156008 dataset_size: 3980279540 task_categories: - text-generation language: - code tags: - code pretty_name: TheStack-Swift size_categories: - 1M<n<10M --- ## Dataset 1: TheStack - Swift - Cleaned **Description**: This dataset is drawn from TheStack Corpus, an open-source code dataset with over 3TB of GitHub data covering 48 programming languages. We selected a small portion of this dataset to optimize smaller language models for Swift, a popular statically typed language. **Target Language**: Swift **Dataset Size**: - Training: 900,000 files - Validation: 50,000 files - Test: 50,000 files **Preprocessing**: 1. Selected Swift as the target language due to its popularity on GitHub. 2. Filtered out files with average line length > 100 characters, maximum line length > 1000 characters, and alphabet ratio < 25%. 3. Split files into 90% training, 5% validation, and 5% test sets. **Tokenizer**: Byte Pair Encoding (BPE) tokenizer with tab and whitespace tokens. GPT-2 vocabulary extended with special tokens. **Training Sequences**: Sequences constructed by joining training data text to reach a context length of 2048 tokens (1024 tokens for full fine-tuning).

提供机构：

ammarnasr

原始信息汇总

数据集概述

数据集名称

名称: TheStack-Swift - Cleaned

数据集描述

描述: 该数据集从TheStack Corpus中提取，这是一个包含超过3TB GitHub数据的开放源代码数据集，涵盖48种编程语言。本数据集专门为优化Swift语言的小型语言模型而选定。

目标语言

目标语言: Swift

数据集大小

训练集: 900,000文件
验证集: 50,000文件
测试集: 50,000文件

预处理步骤

选择Swift作为目标语言，因其GitHub上的流行度。
过滤掉平均行长度超过100个字符、最大行长度超过1000个字符及字母比率低于25%的文件。
将文件按90%训练、5%验证、5%测试的比例分割。

分词器

分词器: Byte Pair Encoding (BPE)，包含制表符和空格标记。GPT-2词汇表扩展了特殊标记。

训练序列

序列构造: 通过连接训练数据文本以达到2048个令牌的上下文长度（全精细调整为1024个令牌）。

5,000+

优质数据集

54 个

任务类型

进入经典数据集