Shlok95/spec-verify-dataset-v2
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shlok95/spec-verify-dataset-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
pretty_name: spec-verify v2 (English, quality-filtered spec + code)
tags:
- software-engineering
- specification
- code
- github
- english
- quality-filtered
size_categories:
- 10K<n<100K
---
# spec-verify dataset **v2**
## Description
This is the **v2** dataset on the Hugging Face Hub (`spec-verify-dataset-v2`), built for **10,000+** high-quality **(specification, code)** pairs.
- **Quality filtered**: non-English and low-signal rows are removed using heuristics (ASCII/Latin ratio, repetition, minimum word counts, valid docstring shape for Python rows).
- **English-oriented**: specifications are cleaned to **ASCII/Latin**-heavy text; issue/PR and docstring rows must pass English dominance checks.
- **Real code for docstring rows**: the `docstring` subset requires **substantial Python implementation** (not signature-only), including logic keywords and length checks.
- **GitHub issues / PRs**: natural-language specifications only (`code` may be empty); filtered for English and non-repetitive text.
Pairs are mined from public GitHub repositories. Intended for **spec-to-code**, **code-to-spec**, and **verification** research.
## Data fields
| Column | Meaning |
|--------|---------|
| `id` | Stable hash identifier for the row |
| `source` | `github_issue` (includes PR descriptions in this schema) or `docstring` |
| `repo` | Source repository in `owner/name` form |
| `specification` | English-cleaned NL spec (issue/PR text, or signature + docstring + optional structured notes from extraction) |
| `code` | Full function implementation for `docstring` rows; often empty for `github_issue` |
| `url` | Link to the issue, pull request, or file on GitHub |
## Example record
```json
{
"id": "d3b066c9609ab8e6",
"source": "github_issue",
"repo": "aio-libs/aiohttp",
"specification": "GunicornWebWorker fails to reset SIGCHLD causing spurious \"Worker exited\" errors\n\n### Describe the bug\n\nThe `aiohttp.worker.GunicornWebWorker` class overrides `init_signals()` but fails to reset SIGCHLD to SIG_DFL. This causes workers to inherit the gunicorn master arbiter's SIGCHLD handler, leading to spurious error logs when application code spawns subprocesses.\n\n## Evidence\n\n### 1. aiohttp/worker.py has incomplete fix\nLines 191-192 contain a comment about resetting signals, but NO implementation:\n\n```python\n# Reset signals so Gunicorn doesn't swallow subprocess return codes\n# See: https://github.com/aio-libs/aiohttp/issues/6130\n# CODE IS MISSING HERE!\n```\n\n### 2. Base Worker resets SIGCHLD correctly\n`gunicorn/workers/base.py` properly resets SIGCHLD:\n\n```python\nSIGNALS = [\"ABRT\", \"HUP\", \"QUIT\", \"INT\", \"TERM\", \"USR1\", \"USR2\", \"WINCH\", \"CHLD\"]\n\ndef init_signals(self):\n # reset signaling\n for s in self.SIGNALS:\n signal.signal(s, signal.SIG_DFL) # Resets SIGCHLD\n```\n\n### 3. aiohttp Worker doesn't call super()\n`aiohttp/worker.py:160-193` overrides `init_signals()` but never calls `super().init_signals()`, so SIGCHLD is never reset.\n\n### To Reproduce\n\n1.Install gunicorn and aiohttp latest versions\n2.Run this code from below, see usage inside\n```\n#!/usr/bin/env python3\n\"\"\"\nReproduction of aiohttp GunicornWebWorker SIGCHLD inheritance bug.\n\nUsage:\n python aiohttp_sigchld_bug.py # Show the bug\n python aiohttp_sigchld_bug.py --fixed # Show with fix\n\nBug: Workers inherit master's SIGCHLD handler, logging spurious errors\nabout subprocess PIDs as if they were worker PIDs.\n\nExpected: Clean worker startup\nActual (without fix): ERROR logs like \"Worker (pid:X) exited with code 1\"\nwhere X never appears in \"Booting worker\" logs.\n\"\"\"\nimport sys\n\n# Run with gunicorn when executed directly\nif __name__ == \"__main__\":\n import os\n import subprocess\n import tempfile\n\n apply_fix = \"--fixed\" in sys.argv\n\n # Create a separate module that spawns subprocesses during import\n submodule_content = \"\"\"\nimport subprocess\nimport time\n\n# Spawn subprocesses during module import to trigger SIGCHLD\n# This simulates real-world code that checks versions, calls git, dpkg, etc.\nprocesses = []\nfor i in range(5):\n # Spawn subprocess that exits with code 1\n p = subprocess.Popen(['false'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n processes.append(p)\n\n# Small delay creates window where:\n# 1. Subprocesses exit and become zombies\n# 2. SIGCHLD is delivered to worker\n# 3. If worker has inherited arbiter's handler, it reaps OTHER workers' zombies\ntime.sleep(0.05)\n\n# Clean up our zombies\nfor p in processes:\n p.wait()\n\"\"\"\n\n # Create the main app module\n app_content = \"\"\"\n# Import submodule that spawns subprocesses (triggers bug)\nimport subprocess_module\n\nfrom aiohttp import web\n\nasync def hello(request):\n return web.Response(text=\"Hello World\\\\n\")\n\nasync def application():\n app = web.Application()\n app.router.add_get(\"/\", hello)\n return app\n\"\"\"\n\n # Create config with optional fix\n config_content = \"\"\"\nworkers = 4\nworker_class = \"aiohttp.worker.GunicornWebWorker\"\nbind = \"127.0.0.1:8765\"\nloglevel = \"info\"\n\"\"\"\n\n if appl
… (truncated for card)
```
## Intended use
- Training and evaluating models that **generate formal or behavioral specs from natural language**
- **Spec-to-code** or **code-to-spec** alignment with **implementation-heavy** Python for docstring-sourced rows
- **Test / verification** research conditioned on specifications
## Limitations
- Sources are public GitHub data; quality and licensing follow upstream repositories.
- Heuristic filters favor English and Latin script; they do not guarantee semantic correctness.
- Older **v1** uploads may use a different schema; this repo is **v2** (`spec-verify-dataset-v2`).
## Citation
If you use this dataset, cite the upstream repositories and acknowledge the spec-verify extraction pipeline.
---
license: MIT许可证
language:
- 英语
pretty_name: spec-verify v2(英文,经过质量过滤的规范+代码数据集)
tags:
- 软件工程
- 规范
- 代码
- GitHub
- 英语
- 质量过滤
size_categories:
- 10K<n<100K
---
# spec-verify 数据集 **v2**
## 数据集描述
本数据集为Hugging Face Hub上的v2版本数据集(`spec-verify-dataset-v2`),专为10000余条高质量**(规范,代码)对**构建。
- **质量过滤**:通过启发式规则移除非英语及低信息量的数据行,校验维度包括ASCII/拉丁字符占比、文本重复度、最小词数,以及Python数据行的有效文档字符串(docstring)格式。
- **英语导向**:规范(specification)文本被清理为以ASCII/拉丁字符为主的内容;议题(issue)与合并请求(Pull Request,PR)以及文档字符串(docstring)数据行必须通过英语主导性校验。
- **文档字符串(docstring)子集附带真实代码**:`docstring`子集要求完整的Python代码实现(而非仅函数签名),需包含逻辑关键字并通过长度校验。
- **GitHub议题/合并请求**:仅包含自然语言规范(specification)(`code`字段可为空);经过英语文本与非重复文本过滤。
本数据集的数据对均从公开GitHub仓库爬取,旨在支持**规范转代码**、**代码转规范**以及**验证**相关研究。
## 数据字段
| 列名 | 含义 |
|------|------|
| `id` | 数据行的稳定哈希标识符 |
| `source` | 数据源类型:`github_issue`(本架构包含PR描述)或`docstring` |
| `repo` | 来源仓库,格式为`所有者/仓库名` |
| `specification` | 经过英语清理的自然语言规范(specification)(议题/PR文本,或函数签名+文档字符串+提取得到的可选结构化注释) |
| `code` | `docstring`数据源对应的完整函数实现;`github_issue`数据源下通常为空 |
| `url` | 指向GitHub对应议题、合并请求或文件的链接 |
## 示例数据记录
json
{
"id": "d3b066c9609ab8e6",
"source": "github_issue",
"repo": "aio-libs/aiohttp",
"specification": "GunicornWebWorker fails to reset SIGCHLD causing spurious "Worker exited" errors
### Describe the bug
The `aiohttp.worker.GunicornWebWorker` class overrides `init_signals()` but fails to reset SIGCHLD to SIG_DFL. This causes workers to inherit the gunicorn master arbiter's SIGCHLD handler, leading to spurious error logs when application code spawns subprocesses.
## Evidence
### 1. aiohttp/worker.py has incomplete fix
Lines 191-192 contain a comment about resetting signals, but NO implementation:
python
# Reset signals so Gunicorn doesn't swallow subprocess return codes
# See: https://github.com/aio-libs/aiohttp/issues/6130
# CODE IS MISSING HERE!
### 2. Base Worker resets SIGCHLD correctly
`gunicorn/workers/base.py` properly resets SIGCHLD:
python
SIGNALS = ["ABRT", "HUP", "QUIT", "INT", "TERM", "USR1", "USR2", "WINCH", "CHLD"]
def init_signals(self):
# reset signaling
for s in self.SIGNALS:
signal.signal(s, signal.SIG_DFL) # Resets SIGCHLD
### 3. aiohttp Worker doesn't call super()
`aiohttp/worker.py:160-193` overrides `init_signals()` but never calls `super().init_signals()`, so SIGCHLD is never reset.
### To Reproduce
1.Install gunicorn and aiohttp latest versions
2.Run this code from below, see usage inside
#!/usr/bin/env python3
"""
Reproduction of aiohttp GunicornWebWorker SIGCHLD inheritance bug.
Usage:
python aiohttp_sigchld_bug.py # Show the bug
python aiohttp_sigchld_bug.py --fixed # Show with fix
Bug: Workers inherit master's SIGCHLD handler, logging spurious errors
about subprocess PIDs as if they were worker PIDs.
Expected: Clean worker startup
Actual (without fix): ERROR logs like "Worker (pid:X) exited with code 1"
where X never appears in "Booting worker" logs.
"""
import sys
# Run with gunicorn when executed directly
if __name__ == "__main__":
import os
import subprocess
import tempfile
apply_fix = "--fixed" in sys.argv
# Create a separate module that spawns subprocesses during import
submodule_content = """
import subprocess
import time
# Spawn subprocesses during module import to trigger SIGCHLD
# This simulates real-world code that checks versions, calls git, dpkg, etc.
processes = []
for i in range(5):
# Spawn subprocess that exits with code 1
p = subprocess.Popen(['false'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
processes.append(p)
# Small delay creates window where:
# 1. Subprocesses exit and become zombies
# 2. SIGCHLD is delivered to worker
# 3. If worker has inherited arbiter's handler, it reaps OTHER workers' zombies
time.sleep(0.05)
# Clean up our zombies
for p in processes:
p.wait()
"""
# Create the main app module
app_content = """
# Import submodule that spawns subprocesses (triggers bug)
import subprocess_module
from aiohttp import web
async def hello(request):
return web.Response(text="Hello World\n")
async def application():
app = web.Application()
app.router.add_get("/", hello)
return app
"""
# Create config with optional fix
config_content = """
workers = 4
worker_class = "aiohttp.worker.GunicornWebWorker"
bind = "127.0.0.1:8765"
loglevel = "info"
"""
if appl
…(因卡片篇幅截断)
## 预期用途
- 训练与评估可从自然语言生成形式化或行为规范的模型
- 针对文档字符串(docstring)来源的数据行,基于实现完备的Python代码开展**规范转代码**或**代码转规范**对齐任务
- 以规范为条件的测试/验证相关研究
## 局限性
- 数据源为公开GitHub数据,其质量与许可证遵循上游仓库的相关规则。
- 启发式过滤规则优先适配英语与拉丁字符集,无法保证语义正确性。
- 旧版v1上传文件可能采用不同的数据架构;本仓库为v2版本(`spec-verify-dataset-v2`)。
## 引用说明
若使用本数据集,请引用上游仓库并致谢spec-verify提取管道。
提供机构:
Shlok95



