five

13point5/swe-grep-rlm-reputable-recent-5plus

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/13point5/swe-grep-rlm-reputable-recent-5plus
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: swe-grep-rlm-reputable-recent-5plus task_categories: - text-retrieval language: - en tags: - code - software-engineering - issue-localization - bug-localization - github - retrieval size_categories: - 100<n<1K --- # swe-grep-rlm-reputable-recent-5plus This dataset is a GitHub-mined collection of issue- or PR-linked retrieval examples for repository-level code search and localization. Each row is built from a merged pull request in a reputable, actively maintained open-source repository. The target labels are the PR's changed files, with a focus on non-test files. ## Summary - Rows: 799 - Repositories: 46 - Query source: - 519 rows use linked issue title/body when GitHub exposed it - 280 rows fall back to PR title/body - Non-test file count: - minimum: 5 - median: 8 - maximum: 264 - Distribution: - 501 rows with 5-10 non-test files - 169 rows with 11-20 non-test files - 129 rows with 21+ non-test files ## Files - `reputable_recent_5plus.jsonl`: primary dataset file - `reputable_recent_5plus.csv`: flattened mirror for quick inspection - `reputable_recent_repos.txt`: repo seed list used for the sweep - `scrape_github_prs.py`: collection script ## Schema Each example includes: - `repo` - `pr_number` - `pr_url` - `pr_title` - `pr_body` - `merged_at` - `query_text` - `query_source` - `linked_issues` - `file_count` - `non_test_file_count` - `test_file_count` - `files` - `non_test_files` - `test_files` - `additions` - `deletions` - `source` ## Construction Notes - Only merged PRs were considered. - Rows were filtered to keep `non_test_file_count >= 5`. - The collector prefers linked issue title/body when available, and otherwise falls back to PR text. - For PRs with more than 100 changed files, additional file pages were fetched so the file lists are not truncated at the initial GraphQL response. - File-type classification is heuristic. In particular, "non-test" is broader than "implementation-only" and may still include docs, config, changelog, or generated artifacts in some projects. ## Intended Use This dataset is designed for: - repository-level code retrieval - issue localization - training or evaluating rerankers and retrieval policies - weak supervision for query-to-files tasks It is not a gold-standard human-annotated benchmark. Labels come from merged PR diffs and linked issue/PR metadata. ## Provenance The data is derived from public GitHub repositories and metadata from their issues and pull requests. Upstream repository licenses vary by project.
提供机构:
13point5
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作