nyuuzyou/ms-codeplex-archive
收藏Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/ms-codeplex-archive
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
language_creators:
- found
language:
- code
- en
license: other
multilinguality:
- multilingual
pretty_name: Microsoft CodePlex Archive Dataset
size_categories:
- 1M<n<10M
source_datasets:
- original
task_categories:
- text-generation
tags:
- code
- codeplex
- microsoft
- archive
configs:
- config_name: default
data_files:
- split: train
path: "data/*.parquet"
default: true
dataset_info:
features:
- name: code
dtype: string
- name: repo_name
dtype: string
- name: path
dtype: string
- name: language
dtype: string
- name: license
dtype: string
- name: size
dtype: int64
---
# Microsoft CodePlex Archive Dataset
## Dataset Description
Source code from the [Microsoft CodePlex Archive](https://archive.org/details/sylirana_ms_codeplex_zips) on the Internet Archive. CodePlex was Microsoft's open-source project hosting service from 2006 to 2017, popular for .NET and Windows projects.
### Dataset Summary
| Statistic | Value |
|-----------|-------|
| **Total Files** | 5,043,730 |
| **Total Repositories** | 38,087 |
| **Total Size** | 3.6 GB (compressed Parquet) |
| **Programming Languages** | 91 |
| **File Format** | Parquet with Zstd compression (10 files) |
### Features
- 38K repositories from the 2006-2017 period
- 91 programming languages, dominated by C# and .NET
- Metadata: repository name, file path, language, license, file size
- Filtered to remove vendor code, build artifacts, and generated files
### Languages
Top 30 of 91 languages by file count:
| Rank | Language | File Count |
|------|----------|------------|
| 1 | C# | 2,671,389 |
| 2 | JavaScript | 511,289 |
| 3 | XML | 210,627 |
| 4 | HTML | 203,106 |
| 5 | C | 200,427 |
| 6 | CSS | 192,974 |
| 7 | HTML+Razor | 150,753 |
| 8 | C++ | 149,283 |
| 9 | ASP.NET | 130,469 |
| 10 | XAML | 114,609 |
| 11 | Visual Basic .NET | 102,835 |
| 12 | PHP | 82,775 |
| 13 | SQL | 77,737 |
| 14 | Java | 70,745 |
| 15 | INI | 22,056 |
| 16 | JSON | 14,507 |
| 17 | Less | 12,785 |
| 18 | Batchfile | 12,327 |
| 19 | Python | 10,957 |
| 20 | PowerShell | 8,446 |
| 21 | F# | 7,879 |
| 22 | Markdown | 7,707 |
| 23 | SCSS | 7,594 |
| 24 | Ruby | 7,120 |
| 25 | Objective-C | 6,573 |
| 26 | Swift | 5,666 |
| 27 | ActionScript | 4,712 |
| 28 | Java Server Pages | 4,347 |
| 29 | TypeScript | 4,044 |
| 30 | reStructuredText | 3,597 |
### Licenses
| License | File Count |
|---------|------------|
| Microsoft Public License (Ms-PL) | 1,361,702 |
| GNU General Public License version 2 (GPLv2) | 757,005 |
| Apache License 2.0 (Apache) | 749,347 |
| The MIT License (MIT) | 577,346 |
| Microsoft Reciprocal License (Ms-RL) | 298,578 |
| New BSD License (BSD) | 240,622 |
| GNU Library General Public License (LGPL) | 204,837 |
| GNU General Public License version 3 (GPLv3) | 200,838 |
| Common Development and Distribution License (CDDL) | 165,029 |
| GNU Lesser General Public License (LGPL) | 132,699 |
| Custom License | 115,605 |
| Mozilla Public License 2.0 (MPL-2.0) | 62,260 |
| Simplified BSD License (BSD) | 58,250 |
| Eclipse Public License (EPL) | 46,882 |
| Microsoft Permissive License (Ms-PL) v1.1 | 43,314 |
| Mozilla Public License 1.1 (MPL) | 26,129 |
| Microsoft Community License (Ms-CL) v1.1 | 3,287 |
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `code` | string | File content (UTF-8) |
| `repo_name` | string | CodePlex project name |
| `path` | string | File path within repository |
| `language` | string | Programming language |
| `license` | string | Repository license |
| `size` | int64 | File size in bytes |
### Data Format
- **Format**: Apache Parquet with Zstd compression
- **File Structure**: 10 files (`codeplex_0000.parquet` to `codeplex_0009.parquet`)
### Data Splits
Train split only.
### Example Data Point
```json
{
"code": "using System;\nusing System.Collections.Generic;\nusing System.ComponentModel;\nusing System.Data;\nusing System.Drawing;\nusing System.Text;\nusing System.Windows.Forms;\n\nnamespace KeygenApp\n{\n public partial class KeygenForm : Form\n {\n public KeygenForm()\n {\n InitializeComponent();\n }\n }\n}",
"repo_name": "2atgroup",
"path": "TiffBrowserTestCSharp/Keygen/KeygenForm.cs",
"language": "C#",
"license": "Microsoft Reciprocal License (Ms-RL)",
"size": 1259
}
```
## Dataset Creation
### Pipeline
1. Load project metadata from `zips.csv` (108,508 repositories)
2. Extract 122 tar archives from the Internet Archive `tars/` directory
3. Extract source code from `sourceCode/sourceCode.zip` within each project
4. Parse `license/license.json` for license metadata
5. Filter non-code files
6. Write to Parquet with Zstd compression
### Language Detection
By file extension.
### License Detection
Parsed from `license/license.json` in each project archive.
### File Filtering
#### Size Limits
| Limit | Value |
|-------|-------|
| Max single file size | 2 MB |
| Max line length | 1,000 characters |
#### Excluded Directories
- Config: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/`
- Vendor: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/`
- Build: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`, `obj/`
- CodePlex metadata: `discussions/`, `issues/`, `releases/`, `wiki/`, `wikiRender/`
- Tests: `test/`, `tests/`, `spec/`, `specs/`, `__tests__/`
#### Excluded Files
- Lock files: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`, `packages.lock.json`
- Minified: files containing `.min.`
- Hidden: files starting with `.` (except `.htaccess`)
- Binary: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`, `.pdb`, `.nupkg`, `.snk`
- System: `.DS_Store`, `thumbs.db`
#### Content Filtering
- **UTF-8 Validation**: must be valid UTF-8
- **Binary Detection**: no null bytes, <30% non-printable characters in first 8KB
- **Generated Files**: excluded if first 500 bytes contain `generated by`, `do not edit`, `auto-generated`, `@generated`, etc.
- **Empty Files**: excluded
- **Long Lines**: excluded if any line >1,000 characters in first 10 lines
### Source Data
[Microsoft CodePlex Archive](https://archive.org/details/sylirana_ms_codeplex_zips) on Internet Archive, `tars/` directory (122 tar archives). 108,508 repositories accessible out of 108,516 listed.
### Archive Structure
Each CodePlex project archive contains:
- `sourceCode/sourceCode.zip` - source code (extracted)
- `license/license.json` - license metadata (parsed)
- `discussions/`, `issues/`, `releases/`, `wiki/` - excluded
## Considerations
### Sensitive Information
May contain email addresses, accidentally committed credentials, or personal information in comments. Filter accordingly.
### Licensing
Use must comply with the original repository licenses. The `license` field indicates each file's source license.
提供机构:
nyuuzyou



