five

Shlok95/spec-verify-dataset-v2

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shlok95/spec-verify-dataset-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en pretty_name: spec-verify v2 (English, quality-filtered spec + code) tags: - software-engineering - specification - code - github - english - quality-filtered size_categories: - 10K<n<100K --- # spec-verify dataset **v2** ## Description This is the **v2** dataset on the Hugging Face Hub (`spec-verify-dataset-v2`), built for **10,000+** high-quality **(specification, code)** pairs. - **Quality filtered**: non-English and low-signal rows are removed using heuristics (ASCII/Latin ratio, repetition, minimum word counts, valid docstring shape for Python rows). - **English-oriented**: specifications are cleaned to **ASCII/Latin**-heavy text; issue/PR and docstring rows must pass English dominance checks. - **Real code for docstring rows**: the `docstring` subset requires **substantial Python implementation** (not signature-only), including logic keywords and length checks. - **GitHub issues / PRs**: natural-language specifications only (`code` may be empty); filtered for English and non-repetitive text. Pairs are mined from public GitHub repositories. Intended for **spec-to-code**, **code-to-spec**, and **verification** research. ## Data fields | Column | Meaning | |--------|---------| | `id` | Stable hash identifier for the row | | `source` | `github_issue` (includes PR descriptions in this schema) or `docstring` | | `repo` | Source repository in `owner/name` form | | `specification` | English-cleaned NL spec (issue/PR text, or signature + docstring + optional structured notes from extraction) | | `code` | Full function implementation for `docstring` rows; often empty for `github_issue` | | `url` | Link to the issue, pull request, or file on GitHub | ## Example record ```json { "id": "d3b066c9609ab8e6", "source": "github_issue", "repo": "aio-libs/aiohttp", "specification": "GunicornWebWorker fails to reset SIGCHLD causing spurious \"Worker exited\" errors\n\n### Describe the bug\n\nThe `aiohttp.worker.GunicornWebWorker` class overrides `init_signals()` but fails to reset SIGCHLD to SIG_DFL. This causes workers to inherit the gunicorn master arbiter's SIGCHLD handler, leading to spurious error logs when application code spawns subprocesses.\n\n## Evidence\n\n### 1. aiohttp/worker.py has incomplete fix\nLines 191-192 contain a comment about resetting signals, but NO implementation:\n\n```python\n# Reset signals so Gunicorn doesn't swallow subprocess return codes\n# See: https://github.com/aio-libs/aiohttp/issues/6130\n# CODE IS MISSING HERE!\n```\n\n### 2. Base Worker resets SIGCHLD correctly\n`gunicorn/workers/base.py` properly resets SIGCHLD:\n\n```python\nSIGNALS = [\"ABRT\", \"HUP\", \"QUIT\", \"INT\", \"TERM\", \"USR1\", \"USR2\", \"WINCH\", \"CHLD\"]\n\ndef init_signals(self):\n # reset signaling\n for s in self.SIGNALS:\n signal.signal(s, signal.SIG_DFL) # Resets SIGCHLD\n```\n\n### 3. aiohttp Worker doesn't call super()\n`aiohttp/worker.py:160-193` overrides `init_signals()` but never calls `super().init_signals()`, so SIGCHLD is never reset.\n\n### To Reproduce\n\n1.Install gunicorn and aiohttp latest versions\n2.Run this code from below, see usage inside\n```\n#!/usr/bin/env python3\n\"\"\"\nReproduction of aiohttp GunicornWebWorker SIGCHLD inheritance bug.\n\nUsage:\n python aiohttp_sigchld_bug.py # Show the bug\n python aiohttp_sigchld_bug.py --fixed # Show with fix\n\nBug: Workers inherit master's SIGCHLD handler, logging spurious errors\nabout subprocess PIDs as if they were worker PIDs.\n\nExpected: Clean worker startup\nActual (without fix): ERROR logs like \"Worker (pid:X) exited with code 1\"\nwhere X never appears in \"Booting worker\" logs.\n\"\"\"\nimport sys\n\n# Run with gunicorn when executed directly\nif __name__ == \"__main__\":\n import os\n import subprocess\n import tempfile\n\n apply_fix = \"--fixed\" in sys.argv\n\n # Create a separate module that spawns subprocesses during import\n submodule_content = \"\"\"\nimport subprocess\nimport time\n\n# Spawn subprocesses during module import to trigger SIGCHLD\n# This simulates real-world code that checks versions, calls git, dpkg, etc.\nprocesses = []\nfor i in range(5):\n # Spawn subprocess that exits with code 1\n p = subprocess.Popen(['false'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n processes.append(p)\n\n# Small delay creates window where:\n# 1. Subprocesses exit and become zombies\n# 2. SIGCHLD is delivered to worker\n# 3. If worker has inherited arbiter's handler, it reaps OTHER workers' zombies\ntime.sleep(0.05)\n\n# Clean up our zombies\nfor p in processes:\n p.wait()\n\"\"\"\n\n # Create the main app module\n app_content = \"\"\"\n# Import submodule that spawns subprocesses (triggers bug)\nimport subprocess_module\n\nfrom aiohttp import web\n\nasync def hello(request):\n return web.Response(text=\"Hello World\\\\n\")\n\nasync def application():\n app = web.Application()\n app.router.add_get(\"/\", hello)\n return app\n\"\"\"\n\n # Create config with optional fix\n config_content = \"\"\"\nworkers = 4\nworker_class = \"aiohttp.worker.GunicornWebWorker\"\nbind = \"127.0.0.1:8765\"\nloglevel = \"info\"\n\"\"\"\n\n if appl … (truncated for card) ``` ## Intended use - Training and evaluating models that **generate formal or behavioral specs from natural language** - **Spec-to-code** or **code-to-spec** alignment with **implementation-heavy** Python for docstring-sourced rows - **Test / verification** research conditioned on specifications ## Limitations - Sources are public GitHub data; quality and licensing follow upstream repositories. - Heuristic filters favor English and Latin script; they do not guarantee semantic correctness. - Older **v1** uploads may use a different schema; this repo is **v2** (`spec-verify-dataset-v2`). ## Citation If you use this dataset, cite the upstream repositories and acknowledge the spec-verify extraction pipeline.

--- license: MIT许可证 language: - 英语 pretty_name: spec-verify v2(英文,经过质量过滤的规范+代码数据集) tags: - 软件工程 - 规范 - 代码 - GitHub - 英语 - 质量过滤 size_categories: - 10K<n<100K --- # spec-verify 数据集 **v2** ## 数据集描述 本数据集为Hugging Face Hub上的v2版本数据集(`spec-verify-dataset-v2`),专为10000余条高质量**(规范,代码)对**构建。 - **质量过滤**:通过启发式规则移除非英语及低信息量的数据行,校验维度包括ASCII/拉丁字符占比、文本重复度、最小词数,以及Python数据行的有效文档字符串(docstring)格式。 - **英语导向**:规范(specification)文本被清理为以ASCII/拉丁字符为主的内容;议题(issue)与合并请求(Pull Request,PR)以及文档字符串(docstring)数据行必须通过英语主导性校验。 - **文档字符串(docstring)子集附带真实代码**:`docstring`子集要求完整的Python代码实现(而非仅函数签名),需包含逻辑关键字并通过长度校验。 - **GitHub议题/合并请求**:仅包含自然语言规范(specification)(`code`字段可为空);经过英语文本与非重复文本过滤。 本数据集的数据对均从公开GitHub仓库爬取,旨在支持**规范转代码**、**代码转规范**以及**验证**相关研究。 ## 数据字段 | 列名 | 含义 | |------|------| | `id` | 数据行的稳定哈希标识符 | | `source` | 数据源类型:`github_issue`(本架构包含PR描述)或`docstring` | | `repo` | 来源仓库,格式为`所有者/仓库名` | | `specification` | 经过英语清理的自然语言规范(specification)(议题/PR文本,或函数签名+文档字符串+提取得到的可选结构化注释) | | `code` | `docstring`数据源对应的完整函数实现;`github_issue`数据源下通常为空 | | `url` | 指向GitHub对应议题、合并请求或文件的链接 | ## 示例数据记录 json { "id": "d3b066c9609ab8e6", "source": "github_issue", "repo": "aio-libs/aiohttp", "specification": "GunicornWebWorker fails to reset SIGCHLD causing spurious "Worker exited" errors ### Describe the bug The `aiohttp.worker.GunicornWebWorker` class overrides `init_signals()` but fails to reset SIGCHLD to SIG_DFL. This causes workers to inherit the gunicorn master arbiter's SIGCHLD handler, leading to spurious error logs when application code spawns subprocesses. ## Evidence ### 1. aiohttp/worker.py has incomplete fix Lines 191-192 contain a comment about resetting signals, but NO implementation: python # Reset signals so Gunicorn doesn't swallow subprocess return codes # See: https://github.com/aio-libs/aiohttp/issues/6130 # CODE IS MISSING HERE! ### 2. Base Worker resets SIGCHLD correctly `gunicorn/workers/base.py` properly resets SIGCHLD: python SIGNALS = ["ABRT", "HUP", "QUIT", "INT", "TERM", "USR1", "USR2", "WINCH", "CHLD"] def init_signals(self): # reset signaling for s in self.SIGNALS: signal.signal(s, signal.SIG_DFL) # Resets SIGCHLD ### 3. aiohttp Worker doesn't call super() `aiohttp/worker.py:160-193` overrides `init_signals()` but never calls `super().init_signals()`, so SIGCHLD is never reset. ### To Reproduce 1.Install gunicorn and aiohttp latest versions 2.Run this code from below, see usage inside #!/usr/bin/env python3 """ Reproduction of aiohttp GunicornWebWorker SIGCHLD inheritance bug. Usage: python aiohttp_sigchld_bug.py # Show the bug python aiohttp_sigchld_bug.py --fixed # Show with fix Bug: Workers inherit master's SIGCHLD handler, logging spurious errors about subprocess PIDs as if they were worker PIDs. Expected: Clean worker startup Actual (without fix): ERROR logs like "Worker (pid:X) exited with code 1" where X never appears in "Booting worker" logs. """ import sys # Run with gunicorn when executed directly if __name__ == "__main__": import os import subprocess import tempfile apply_fix = "--fixed" in sys.argv # Create a separate module that spawns subprocesses during import submodule_content = """ import subprocess import time # Spawn subprocesses during module import to trigger SIGCHLD # This simulates real-world code that checks versions, calls git, dpkg, etc. processes = [] for i in range(5): # Spawn subprocess that exits with code 1 p = subprocess.Popen(['false'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) processes.append(p) # Small delay creates window where: # 1. Subprocesses exit and become zombies # 2. SIGCHLD is delivered to worker # 3. If worker has inherited arbiter's handler, it reaps OTHER workers' zombies time.sleep(0.05) # Clean up our zombies for p in processes: p.wait() """ # Create the main app module app_content = """ # Import submodule that spawns subprocesses (triggers bug) import subprocess_module from aiohttp import web async def hello(request): return web.Response(text="Hello World\n") async def application(): app = web.Application() app.router.add_get("/", hello) return app """ # Create config with optional fix config_content = """ workers = 4 worker_class = "aiohttp.worker.GunicornWebWorker" bind = "127.0.0.1:8765" loglevel = "info" """ if appl …(因卡片篇幅截断) ## 预期用途 - 训练与评估可从自然语言生成形式化或行为规范的模型 - 针对文档字符串(docstring)来源的数据行,基于实现完备的Python代码开展**规范转代码**或**代码转规范**对齐任务 - 以规范为条件的测试/验证相关研究 ## 局限性 - 数据源为公开GitHub数据,其质量与许可证遵循上游仓库的相关规则。 - 启发式过滤规则优先适配英语与拉丁字符集,无法保证语义正确性。 - 旧版v1上传文件可能采用不同的数据架构;本仓库为v2版本(`spec-verify-dataset-v2`)。 ## 引用说明 若使用本数据集,请引用上游仓库并致谢spec-verify提取管道。
提供机构:
Shlok95
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作