five

ronantakizawa/github-codereview

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ronantakizawa/github-codereview
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - en - code tags: - code-review - code-generation - software-engineering - pull-requests - github size_categories: - 100K<n<1M --- # Code Review Dataset A large-scale dataset of the best human-written code reviews from top GitHub repositories. Each row captures a moment where a human code reviewer left an inline comment on a pull request, and the author subsequently modified the code in response. The dataset also includes **negative examples** — code from the same PRs that passed review without comments — to help models learn when code is acceptable. This provides a natural signal for training models to: - **Generate code review comments** given a code diff - **Apply review feedback** by modifying code based on reviewer suggestions - **Understand code quality patterns** across languages and projects - **Know when not to comment** — recognizing clean code that needs no changes ### Key Features - **167K+ positive triplets** from 725 top GitHub repositories - **51K+ negative examples** (~23% of dataset) of clean code labeled "No issues found." - **37 programming languages** (Python, TypeScript, Go, Rust, C++, JavaScript, C#, Java, Kotlin, Swift, and more) - **Human-only reviews**: AI/bot reviewers (Copilot, linter bots, etc.) are excluded - **Quality-filtered**: noise and auto-generated content removed - **Chunk-focused**: ~50 lines of context around the reviewed code, not entire files - **Permissive licenses only**: all source repos use MIT, Apache-2.0, BSD, or similar licenses - **Verified changes**: only includes triplets where the code chunk actually changed after the review ## Collection Methodology 1. **Repo selection**: Top GitHub repos by stars with permissive licenses, sourced from [ronantakizawa/github-top-projects](https://huggingface.co/datasets/ronantakizawa/github-top-projects) and curated additions 2. **PR discovery**: Paginate merged PRs, filter bot authors, fetch inline review comments 3. **Comment filtering**: Remove bots, noise patterns, auto-generated comments, non-English text, non-code files, reply comments 4. **Triplet extraction**: Fetch file contents at the review commit (before) and PR head (after), extract focused chunks around the comment line 5. **Change verification**: Only keep triplets where the code chunk around the comment actually changed 6. **Negative extraction**: For each reviewed PR, identify source code files that were changed but received no review comments; extract a ~50-line chunk as a negative example labeled "No issues found." ## Splits | Split | Percentage | Description | |-------|-----------|-------------| | train | 90% | Training data | | test | 5% | Test data | | validation | 5% | Validation data | Splits are deterministic by repository — all examples from the same repo appear in the same split. ## Schema | Column | Type | Description | |--------|------|-------------| | `pr_title` | string | Pull request title | | `pr_number` | int | PR number | | `repo_name` | string | Full repo name (owner/repo) | | `repo_stars` | int | GitHub stars | | `repo_language` | string | Primary repo language | | `author_username` | string | PR author's GitHub username | | `reviewer_username` | string | Reviewer's GitHub username | | `before_code` | string | ~50 lines of code around the comment, before the fix | | `reviewer_comment` | string | The inline review comment text (or "No issues found." for negatives) | | `after_code` | string | ~50 lines of code around the comment, after the fix | | `diff_context` | string | The PR diff hunk where the comment was placed | | `file_path` | string | File path within the repo | | `comment_line` | int | Line number within the code chunk (0 for negatives) | | `language` | string | Programming language | | `quality_score` | float | Comment quality score (0.0-1.0; 1.0 for negatives) | | `comment_type` | string | Category: suggestion, question, nitpick, bug, refactor, style, security, performance, none | | `comment_length` | int | Character count of reviewer comment | | `before_lines` | int | Line count of before code | | `after_lines` | int | Line count of after code | | `is_negative` | bool | True if this is a negative example (no reviewer comment) | ## Usage ```python from datasets import load_dataset ds = load_dataset("ronantakizawa/github-codereview") # Get a training example example = ds["train"][0] print(f"Review comment: {example['reviewer_comment']}") print(f"Language: {example['language']}") print(f"Before:\n{example['before_code'][:200]}") print(f"After:\n{example['after_code'][:200]}") ``` ### Filter by language ```python python_reviews = ds["train"].filter(lambda x: x["language"] == "Python") ``` ### Filter by quality ```python high_quality = ds["train"].filter(lambda x: x["quality_score"] >= 0.5) ``` ### Positive examples only ```python positives = ds["train"].filter(lambda x: not x["is_negative"]) ``` ### Negative examples only ```python negatives = ds["train"].filter(lambda x: x["is_negative"]) ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{takizawa2026codereviewdiffs, title={Code Review Diffs: A Large-Scale Dataset of Review-Driven Code Changes}, author={Takizawa, Ronan}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/ronantakizawa/github-codereview} } ```
提供机构:
ronantakizawa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作