ammarnasr/the-stack-swift-clean
收藏Hugging Face2023-08-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ammarnasr/the-stack-swift-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: openrail
dataset_info:
features:
- name: hexsha
dtype: string
- name: size
dtype: int64
- name: content
dtype: string
- name: avg_line_length
dtype: float64
- name: max_line_length
dtype: int64
- name: alphanum_fraction
dtype: float64
splits:
- name: train
num_bytes: 3582248477.9086223
num_examples: 806789
- name: test
num_bytes: 394048264.9973618
num_examples: 88747
- name: valid
num_bytes: 3982797.09401595
num_examples: 897
download_size: 1323156008
dataset_size: 3980279540
task_categories:
- text-generation
language:
- code
tags:
- code
pretty_name: TheStack-Swift
size_categories:
- 1M<n<10M
---
## Dataset 1: TheStack - Swift - Cleaned
**Description**: This dataset is drawn from TheStack Corpus, an open-source code dataset with over 3TB of GitHub data covering 48 programming languages. We selected a small portion of this dataset to optimize smaller language models for Swift, a popular statically typed language.
**Target Language**: Swift
**Dataset Size**:
- Training: 900,000 files
- Validation: 50,000 files
- Test: 50,000 files
**Preprocessing**:
1. Selected Swift as the target language due to its popularity on GitHub.
2. Filtered out files with average line length > 100 characters, maximum line length > 1000 characters, and alphabet ratio < 25%.
3. Split files into 90% training, 5% validation, and 5% test sets.
**Tokenizer**: Byte Pair Encoding (BPE) tokenizer with tab and whitespace tokens. GPT-2 vocabulary extended with special tokens.
**Training Sequences**: Sequences constructed by joining training data text to reach a context length of 2048 tokens (1024 tokens for full fine-tuning).
提供机构:
ammarnasr
原始信息汇总
数据集概述
数据集名称
- 名称: TheStack-Swift - Cleaned
数据集描述
- 描述: 该数据集从TheStack Corpus中提取,这是一个包含超过3TB GitHub数据的开放源代码数据集,涵盖48种编程语言。本数据集专门为优化Swift语言的小型语言模型而选定。
目标语言
- 目标语言: Swift
数据集大小
- 训练集: 900,000文件
- 验证集: 50,000文件
- 测试集: 50,000文件
预处理步骤
- 选择Swift作为目标语言,因其GitHub上的流行度。
- 过滤掉平均行长度超过100个字符、最大行长度超过1000个字符及字母比率低于25%的文件。
- 将文件按90%训练、5%验证、5%测试的比例分割。
分词器
- 分词器: Byte Pair Encoding (BPE),包含制表符和空格标记。GPT-2词汇表扩展了特殊标记。
训练序列
- 序列构造: 通过连接训练数据文本以达到2048个令牌的上下文长度(全精细调整为1024个令牌)。



