five

mdonigian/starcoder-curated

收藏
Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mdonigian/starcoder-curated
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - code size_categories: - 1B<n<10B tags: - curated - starcoderdata - code - structured-data - multi-task-filter --- # StarCoderData Curated A curated subset of [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata) optimised for training a 500M parameter model focused on structured data output (JSON generation, function calling, schema compliance). ## Dataset Summary - **Total code files:** 5,203,508 - **Total tokens:** 3.9B (target: 3.5B) - **Classifier-scored files:** 1,553,596 (1.7B tokens) - **Non-classified files:** 3,649,912 (2.2B tokens) — filtered by heuristics, not the classifier - **Source:** bigcode/starcoderdata - **Classifier:** [mdonigian/code-curator-v1](https://huggingface.co/mdonigian/code-curator-v1) (UniXcoder-base, multi-task) - **Curation:** Per-language-slice filtering + compression ratio pre-filter + MinHash deduplication ## Filtering Strategy Different language groups need different curation approaches. Not every slice goes through the GPU classifier — schema languages and GitHub issues are filtered with cheaper heuristics because the classifier was trained on general-purpose code and isn't the right tool for inherently structured formats. **All slices** share these pre-filters: - zlib compression ratio < 0.10 (catches extreme repetition) - MinHash LSH deduplication (128 perms, 5-line shingles, 0.7 Jaccard threshold) ### Classifier-Scored Slices (relevance_filter) These languages were scored by the multi-task classifier. Files were ranked by structured data relevance and filtered to keep only those with relevance ≥ 2.0 and quality ≥ 1.5, sampled down to the per-slice token budget: - **TypeScript**: ~600M tokens — strong type system, filter by SD relevance ≥ 2 - **Python**: ~600M tokens — filter by SD relevance ≥ 2 - **Rust/Go/Java**: ~600M tokens — strongly typed, filter by SD relevance ≥ 2 ### Non-Classified Slices These languages were **not** run through the classifier. Their `quality`, `structured_data`, and `content_type` columns contain default placeholder values (0.0 / "unclassified") and should be ignored: - **Schema languages** (JSON/YAML/SQL/protobuf/thrift/XSLT): ~800M tokens — inherently structured data formats; quality floor + random sample to budget - **GitHub Issues** (technical): ~500M tokens — keyword filter matching structured-data topics (JSON, schema, API, protobuf, gRPC, etc.) - **General code** (78 other languages): ~1B tokens — random sample for language diversity; quality floor only ## Language Slice Distribution | Slice | Strategy | Languages | Target | Actual | % of Target | |-------|----------|-----------|--------|--------|-------------| | schema_languages | light_filter | json, yaml, sql, protocol-buffer +2 more | 800M | 799M | 99.9% | | typescript | relevance_filter | typescript | 600M | 598M | 99.7% | | python | relevance_filter | python | 600M | 594M | 99.1% | | rust_go_java | relevance_filter | rust, go, java | 600M | 485M | 80.8% | | github_issues | keyword_filter | github-issues-filtered-structured | 500M | 426M | 85.2% | | general_code | light_filter | ada, agda, alloy, antlr +74 more | 1000M | 999M | 99.9% | ## Classifier-Scored Slices — Detail The quality and structured data scores below apply **only** to the 1,553,596 files (1.7B tokens) that went through the classifier. Non-classified slices are excluded from these statistics. | Slice | Files | Tokens | Avg Quality | Avg SD Relevance | |-------|-------|--------|-------------|------------------| | typescript | 841,426 | 598M | 3.81 | 2.88 | | python | 567,721 | 594M | 3.71 | 2.73 | | rust_go_java | 144,438 | 485M | 3.97 | 3.07 | ### Content Group Distribution (classifier-scored files only) | Group | % of Classified Tokens | Tokens | Files | |-------|-----------------------|--------|-------| | Library/Package | 64.3% | 1,079,381,502 | 1,075,672 | | Application | 3.4% | 56,360,750 | 118,200 | | Script/CLI | 1.1% | 17,853,662 | 24,871 | | Test Code | 5.5% | 91,370,763 | 48,003 | | Config/Data/Generated/Other | 25.8% | 432,393,146 | 286,850 | ### Structured Data Relevance (classifier-scored files only) The strongest classifier signal (Spearman 0.81 on held-out data). SD2+ files contain significant structured data patterns (API endpoints, JSON parsing, schema definitions, etc.). Quality mean: 3.79, Median: 3.88. | Level | Range | Target % | Actual % | Files | |-------|-------|----------|----------|-------| | SD0 | [0.0, 0.5) | 10.0% | 0.0% | 0 | | SD1 | [0.5, 1.5) | 20.0% | 0.0% | 0 | | SD2 | [1.5, 2.5) | 35.0% | 3.2% | 49,213 | | SD3 | [2.5, 3.5) | 35.0% | 96.8% | 1,504,383 | ### Quality Distribution (classifier-scored files only) | Level | Description | Files | |-------|-------------|-------| | 1 | Broken/gibberish | 0 | | 2 | Functional but poor | 42,668 | | 3 | Decent | 129,945 | | 4 | Good | 1,380,674 | | 5 | Excellent | 309 | ## Non-Classified Slices — Detail These slices were filtered using heuristics. The classifier columns (`quality`, `structured_data`, `content_type`) are set to defaults and **do not reflect actual code quality** — the filtering was done by other means: | Slice | Strategy | Files | Tokens | How Filtered | |-------|----------|-------|--------|-------------| | schema_languages | light_filter | 2,203,233 | 799M | Quality floor (≥1.5) + token budget, randomly sampled | | github_issues | keyword_filter | 485,384 | 426M | Keyword match for structured-data topics + quality floor | | general_code | light_filter | 961,295 | 999M | Quality floor (≥1.5) + token budget, randomly sampled | ## Programming Languages | Language | % Tokens | Files | |----------|----------|-------| | typescript | 15.3% | 841,426 | | python | 15.2% | 567,721 | | github-issues-filtered-structured | 10.9% | 485,384 | | markdown | 8.9% | 351,728 | | json | 8.7% | 1,124,326 | | go | 8.5% | 73,899 | | sql | 5.9% | 121,035 | | javascript | 5.8% | 281,216 | | yaml | 5.8% | 957,872 | | java | 3.2% | 57,787 | | c-sharp | 3.0% | 114,063 | | html | 2.9% | 53,527 | | c | 2.8% | 75,899 | | haskell | 2.2% | 84,862 | | rust | 0.7% | 12,752 | ## Token Count Distribution | Percentile | Tokens | |------------|--------| | P10 | 55 | | P25 | 111 | | P50 (median) | 255 | | P75 | 631 | | P90 | 1,416 | | Mean | 749 | ## Schema Each row contains: | Field | Type | Description | |-------|------|-------------| | `content` | string | Source code text | | `lang` | string | Programming language | | `size` | int | File size in bytes | | `token_count` | int | Estimated token count (size // 4) | | `quality` | float | Code quality score 1-5 (**classifier-scored slices only**; 0.0 for non-classified) | | `structured_data` | float | Structured data relevance 0-3 (**classifier-scored slices only**; 0.0 for non-classified) | | `content_type` | string | Content type — 9 classes (**classifier-scored slices only**; "unclassified" for non-classified) | | `language_slice` | string | Language slice name (use this to filter by curation strategy) | | `relevance_score` | float | Composite relevance score (**classifier-scored slices only**; 0.0 for non-classified) | > **Tip:** To work with only classifier-scored data, filter on > `language_slice` in `{"typescript", "python", "rust_go_java"}`. ## Methodology 1. **Download:** All language folders from `bigcode/starcoderdata`. 2. **Classification:** Multi-task UniXcoder-base model (3 heads: quality, SD relevance, content type) runs on TypeScript, Python, Rust, Go, and Java Schema languages, GitHub issues, and general code skip this step. 3. **Pre-filtering:** zlib compression ratio filter removes repetitive boilerplate before GPU inference. 4. **Filtering:** Per-slice strategy — relevance-based ranking for classified languages, keyword matching for GitHub issues, random sampling for schema/general code. All slices enforce a quality floor. 5. **Deduplication:** MinHash LSH (128 perms, 5-line shingles, 0.7 Jaccard threshold). Highest-relevance file kept from each cluster.
提供机构:
mdonigian
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作