five

nyuuzyou/ms-codeplex-archive

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nyuuzyou/ms-codeplex-archive
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - found language: - code - en license: other multilinguality: - multilingual pretty_name: Microsoft CodePlex Archive Dataset size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation tags: - code - codeplex - microsoft - archive configs: - config_name: default data_files: - split: train path: "data/*.parquet" default: true dataset_info: features: - name: code dtype: string - name: repo_name dtype: string - name: path dtype: string - name: language dtype: string - name: license dtype: string - name: size dtype: int64 --- # Microsoft CodePlex Archive Dataset ## Dataset Description Source code from the [Microsoft CodePlex Archive](https://archive.org/details/sylirana_ms_codeplex_zips) on the Internet Archive. CodePlex was Microsoft's open-source project hosting service from 2006 to 2017, popular for .NET and Windows projects. ### Dataset Summary | Statistic | Value | |-----------|-------| | **Total Files** | 5,043,730 | | **Total Repositories** | 38,087 | | **Total Size** | 3.6 GB (compressed Parquet) | | **Programming Languages** | 91 | | **File Format** | Parquet with Zstd compression (10 files) | ### Features - 38K repositories from the 2006-2017 period - 91 programming languages, dominated by C# and .NET - Metadata: repository name, file path, language, license, file size - Filtered to remove vendor code, build artifacts, and generated files ### Languages Top 30 of 91 languages by file count: | Rank | Language | File Count | |------|----------|------------| | 1 | C# | 2,671,389 | | 2 | JavaScript | 511,289 | | 3 | XML | 210,627 | | 4 | HTML | 203,106 | | 5 | C | 200,427 | | 6 | CSS | 192,974 | | 7 | HTML+Razor | 150,753 | | 8 | C++ | 149,283 | | 9 | ASP.NET | 130,469 | | 10 | XAML | 114,609 | | 11 | Visual Basic .NET | 102,835 | | 12 | PHP | 82,775 | | 13 | SQL | 77,737 | | 14 | Java | 70,745 | | 15 | INI | 22,056 | | 16 | JSON | 14,507 | | 17 | Less | 12,785 | | 18 | Batchfile | 12,327 | | 19 | Python | 10,957 | | 20 | PowerShell | 8,446 | | 21 | F# | 7,879 | | 22 | Markdown | 7,707 | | 23 | SCSS | 7,594 | | 24 | Ruby | 7,120 | | 25 | Objective-C | 6,573 | | 26 | Swift | 5,666 | | 27 | ActionScript | 4,712 | | 28 | Java Server Pages | 4,347 | | 29 | TypeScript | 4,044 | | 30 | reStructuredText | 3,597 | ### Licenses | License | File Count | |---------|------------| | Microsoft Public License (Ms-PL) | 1,361,702 | | GNU General Public License version 2 (GPLv2) | 757,005 | | Apache License 2.0 (Apache) | 749,347 | | The MIT License (MIT) | 577,346 | | Microsoft Reciprocal License (Ms-RL) | 298,578 | | New BSD License (BSD) | 240,622 | | GNU Library General Public License (LGPL) | 204,837 | | GNU General Public License version 3 (GPLv3) | 200,838 | | Common Development and Distribution License (CDDL) | 165,029 | | GNU Lesser General Public License (LGPL) | 132,699 | | Custom License | 115,605 | | Mozilla Public License 2.0 (MPL-2.0) | 62,260 | | Simplified BSD License (BSD) | 58,250 | | Eclipse Public License (EPL) | 46,882 | | Microsoft Permissive License (Ms-PL) v1.1 | 43,314 | | Mozilla Public License 1.1 (MPL) | 26,129 | | Microsoft Community License (Ms-CL) v1.1 | 3,287 | ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `code` | string | File content (UTF-8) | | `repo_name` | string | CodePlex project name | | `path` | string | File path within repository | | `language` | string | Programming language | | `license` | string | Repository license | | `size` | int64 | File size in bytes | ### Data Format - **Format**: Apache Parquet with Zstd compression - **File Structure**: 10 files (`codeplex_0000.parquet` to `codeplex_0009.parquet`) ### Data Splits Train split only. ### Example Data Point ```json { "code": "using System;\nusing System.Collections.Generic;\nusing System.ComponentModel;\nusing System.Data;\nusing System.Drawing;\nusing System.Text;\nusing System.Windows.Forms;\n\nnamespace KeygenApp\n{\n public partial class KeygenForm : Form\n {\n public KeygenForm()\n {\n InitializeComponent();\n }\n }\n}", "repo_name": "2atgroup", "path": "TiffBrowserTestCSharp/Keygen/KeygenForm.cs", "language": "C#", "license": "Microsoft Reciprocal License (Ms-RL)", "size": 1259 } ``` ## Dataset Creation ### Pipeline 1. Load project metadata from `zips.csv` (108,508 repositories) 2. Extract 122 tar archives from the Internet Archive `tars/` directory 3. Extract source code from `sourceCode/sourceCode.zip` within each project 4. Parse `license/license.json` for license metadata 5. Filter non-code files 6. Write to Parquet with Zstd compression ### Language Detection By file extension. ### License Detection Parsed from `license/license.json` in each project archive. ### File Filtering #### Size Limits | Limit | Value | |-------|-------| | Max single file size | 2 MB | | Max line length | 1,000 characters | #### Excluded Directories - Config: `.git/`, `.github/`, `.gitlab/`, `.vscode/`, `.idea/`, `.vs/`, `.settings/`, `.eclipse/`, `.project/`, `.metadata/` - Vendor: `node_modules/`, `bower_components/`, `jspm_packages/`, `vendor/`, `third_party/`, `3rdparty/`, `external/`, `packages/`, `deps/`, `lib/vendor/`, `target/dependency/`, `Pods/` - Build: `build/`, `dist/`, `out/`, `bin/`, `target/`, `release/`, `debug/`, `.next/`, `.nuxt/`, `_site/`, `_build/`, `__pycache__/`, `.pytest_cache/`, `cmake-build-*`, `.gradle/`, `.maven/`, `obj/` - CodePlex metadata: `discussions/`, `issues/`, `releases/`, `wiki/`, `wikiRender/` - Tests: `test/`, `tests/`, `spec/`, `specs/`, `__tests__/` #### Excluded Files - Lock files: `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`, `Gemfile.lock`, `Cargo.lock`, `poetry.lock`, `Pipfile.lock`, `composer.lock`, `go.sum`, `mix.lock`, `packages.lock.json` - Minified: files containing `.min.` - Hidden: files starting with `.` (except `.htaccess`) - Binary: `.exe`, `.dll`, `.so`, `.dylib`, `.a`, `.lib`, `.o`, `.obj`, `.jar`, `.war`, `.ear`, `.class`, `.pyc`, `.pyo`, `.wasm`, `.bin`, `.dat`, `.pdf`, `.doc`, `.docx`, `.xls`, `.xlsx`, `.ppt`, `.pptx`, `.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.ico`, `.svg`, `.mp3`, `.mp4`, `.avi`, `.mov`, `.wav`, `.flac`, `.ttf`, `.otf`, `.woff`, `.woff2`, `.eot`, `.pdb`, `.nupkg`, `.snk` - System: `.DS_Store`, `thumbs.db` #### Content Filtering - **UTF-8 Validation**: must be valid UTF-8 - **Binary Detection**: no null bytes, <30% non-printable characters in first 8KB - **Generated Files**: excluded if first 500 bytes contain `generated by`, `do not edit`, `auto-generated`, `@generated`, etc. - **Empty Files**: excluded - **Long Lines**: excluded if any line >1,000 characters in first 10 lines ### Source Data [Microsoft CodePlex Archive](https://archive.org/details/sylirana_ms_codeplex_zips) on Internet Archive, `tars/` directory (122 tar archives). 108,508 repositories accessible out of 108,516 listed. ### Archive Structure Each CodePlex project archive contains: - `sourceCode/sourceCode.zip` - source code (extracted) - `license/license.json` - license metadata (parsed) - `discussions/`, `issues/`, `releases/`, `wiki/` - excluded ## Considerations ### Sensitive Information May contain email addresses, accidentally committed credentials, or personal information in comments. Filter accordingly. ### Licensing Use must comply with the original repository licenses. The `license` field indicates each file's source license.
提供机构:
nyuuzyou
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作