HypothesisWorks/Hypothesis-Corpus-2026
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/HypothesisWorks/Hypothesis-Corpus-2026
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- other
tags:
- property-based-testing
- python-hypothesis
size_categories:
# ~thousands of repos, ~tens of thousands of tests, ~millions of test cases.
# I'm going with "tests" (ie nodes in our dataset) as the unit of size here.
- 10K<n<100K
---
# Hypothesis Corpus 2026
A comprehensive dataset of 28,928 [Hypothesis](https://hypothesis.works/) tests across 1,529 repositories. Full code at https://github.com/Liam-DeVoe/Hypothesis-Corpus.
Unlike other property-based testing datasets, the Hypothesis Corpus also includes runtime data collected from executing the property-based tests, such as timing information.
## Methodology
Collection date: October 2025.
### Collection methodology
- GitHub repositories are collected by querying the GitHub API for repositories containing any of these three strings: `import hypothesis`, `from hypothesis import`, or `from hypothesis.`.
- Filter out any repository over 1gb in size.
- Filter out any repository which is a fork with less than 5 stars.
- Filter out any repository with no `test_*.py` or `conftest.py` files, or with a `site-packages` directory containing vendored code.
- Filter out any duplicate repositories using [MinHash](https://en.wikipedia.org/wiki/MinHash)[^1].
- Collect a list of python files in the repository.
- Filter out files inside of `node_modules`, `.venv`, build artifacts, `site-packages`, etc.
- Remove files with < 25 lines.
- Randomly downsample to 500 files.
- Shingle to 1-line shingles. Normalize by stripping whitespace and removing empty lines.
- Generate minhashes for each file, with `n=128` permutations.
- Given two repositories `r1` and `r2` and a file `f` from `r1`, define `file_is_duplicate(r1, r2, f)` as true if the jaccard similarity of `minhash(f)` and any file in `r2` is above `0.75`. `r1` and `r2` are duplicates if `file_is_duplicate(r1, r2, f)` is true for at least 30% of the files in `r1`, and `file_is_duplicate(r2, r1, f)` is true for at least 30% of the files in `r2`.
- If `r1` and `r2` are duplicates, filter out the one with fewer stars, breaking ties arbitrarily.
- Filter out any repository whose tests cannot be executed.
- We attempt to automatically resolve dependencies for each repository.
- Attempt to install the repository itself with `pip install`.
- Attempt to install the `[dev]`, `[test]`, and `[tests]` extras.
- Automatically search for `.txt` files which look like requirements files, and `pip install -r` those.
- Always install `pytest==8.4.2` and `hypothesis==6.140.3`.
- After installation, we perform test collection with `pytest --collect-only`. Tests are identified as Hypothesis tests by `hypothesis.is_hypothesis_test`.
- Repositories where `pytest --collect-only` fails, finds no Hypothesis tests, or times out are filtered out.
### Runtime methodology
We generate 500 test cases for each node, and record information about the runtime of each test case and the overall test result. Note that some nodes have less than 500 test cases, either if the test failed on one of the test cases, or if Hypothesis stopped early because it detected it had exhausted the input space.
Details:
- Node execution occurred in an isolated Docker container, with 4 repositories running in parallel. Tests were run on a 2021 MacBook Pro with an M1 Pro chip and 16 GB of RAM.
- This was done on my personal computer. While many nodes were executed overnight, many were not, and as a result timing information might be somewhat inconsistent depending on system load at that time.
- We use the following Hypothesis settings during execution: `@settings(max_examples=500, deadline=None, database=None, suppress_health_check=list(HealthCheck), phases=["generate"])`.
- Each node execution has a 5-minute timeout. Nodes which time out are recorded as `runtime_summary.status="error"` with a `TimeoutExpired` traceback in `runtime_summary.error_message`.
## Database files
The full dataset is split across three files, so you don't have to download data you don't need.
- **`data.db`**: Core tables. Most likely all you want.
- **`data_test_cases.db`**: Per-test-case runtime data.
- **`data_minhashes.db`**: MinHash data used for repository deduplication.
`data.db` is required. The other files can be downloaded alongside `data.db`, and attached in-process using [ATTACH](https://sqlite.org/lang_attach.html):
```sql
-- run inside a connection to data.db
ATTACH DATABASE 'data_test_cases.db' AS test_cases;
ATTACH DATABASE 'data_minhashes.db' AS minhashes;
SELECT * FROM runtime_test_case rt
JOIN core_node cn ON rt.node_id = cn.id;
```
### `core_repository`
*In database file: `data.db`*
One row per GitHub repository.
| Column | Type | Description |
|---------------------|---------|-------------|
| full_name | TEXT | GitHub url, in the form "{owner}/{repository}". |
| size_bytes | INTEGER | Repository size in bytes. |
| stargazers_count | INTEGER | GitHub stars at time of dataset collection. |
| is_fork | BOOLEAN | Whether the repo is a GitHub fork. |
| status | TEXT | One of "valid" or "invalid". "invalid" represents repositories which were filtered out at some stage. |
| status_reason | TEXT | Reason if invalid. One of "invalid_repo", "invalid_install (no_hypothesis_tests)", "invalid_install (timed_out)", "minhash_duplicate ({kept_repo}, {similarity_a}%/{similarity_b}%)", "minhash_error", "install_error", "repo_404". |
| requirements | TEXT | String contents of a requirements.txt file containing dependencies required to run the tests in the repository. |
| node_ids | JSON | List of Hypothesis test node ids from `pytest --collect-only`. |
| other_node_ids | JSON | List of non-Hypothesis test node ids from `pytest --collect-only`. |
| commit_hash | TEXT | Repository Git commit hash at time of dataset collection. |
| collection_returncode | INTEGER | Exit code from `pytest --collect-only`. NULL for repositories filtered before installation. 0=success, 1=test failures, 2=interrupted, 3=internal error, 4=usage error, 5=no tests collected. |
| collection_output | TEXT | Full Docker container log output from the installation and test collection step. |
| experiments_ran | JSON | Internal bookkeeping. |
### `core_node`
*In database file: `data.db`*
One row per node.
A note on parametrization: Pytest has the concept of "parametrization", where you can expand a test function into multiple tests by parametrizing the test arguments over a set of values. We therefore avoid the word "test" in this dataset, because it is ambiguous whether we mean the combination of a test function and its set of parametrizations, or a test after parametrization is applied.
Like Pytest, we refer to a test after parametrization is applied, as well as a test without any parametrization, as a "node". The "node id" is the fully qualified name of the test function, combined with a string representation of the chosen parametrization, if applicable. For example: `tests/test_math.py::test_addition[1-2-3]` is the node corresponding to the parametrization of the `tests/test_math.py::test_addition` test with the arguments `1`, `2`, and `3`.
It is sometimes useful to talk about a node, independent of its parametrizations. One might want to compute some statistic over the tests while ignoring parametrizations, for example the average source code size. To facilitate this, we pick an arbitrary node in each parametrization group and declare it the "canonical node" by setting the `canonical_parametrization` column to true, including nodes with no parametrization. This `canonical_parametrization` node is the node which should be used for all parametrization-agnostic queries.
| Column | Type | Description |
|---------------------------|---------|-------------|
| repo_id | INTEGER | Foreign key to `core_repository.id`. |
| node_id | TEXT | Pytest node ID. |
| canonical_parametrization | BOOLEAN | True for an arbitrary node in each parametrize group. |
| source_code | TEXT | Source code of the node function body, via `inspect.getsource()`. |
| is_stateful | BOOLEAN | Whether this node is a [Hypothesis stateful test](https://hypothesis.readthedocs.io/en/latest/stateful.html). |
### `runtime_summary`
*In database file: `data.db`*
One row per node. Aggregates data from all test cases for a node.
| Column | Type | Description |
|-----------------------|---------|-------------|
| node_id | INTEGER | Foreign key to `core_node.id`. |
| status | TEXT | One of: `"passed"`, `"failed"`, `"skipped"`, `"error"`. |
| execution_time | REAL | Wall-clock time of all test cases, in seconds. |
| error_message | TEXT | Traceback text iff status is `"failed"` or `"error"`. |
| count_test_cases | INTEGER | Number of test cases executed. |
| coverage | JSON | Aggregate line coverage. Format: `{"file_path": [line_numbers]}`. |
| line_execution_counts | JSON | Per-line hit counts. Format: `{"file_path": {"line_num": count}}`. |
| unique_lines_covered | INTEGER | Sum of unique lines covered across all files. |
| settings | JSON | Mapping of Hypothesis setting values of the node. Keys: `max_examples`, `deadline`, `derandomize`, `stateful_step_count`, `suppress_health_check`, `database`, `backend`, `phases`, `verbosity`, `print_blob`, `report_multiple_bugs`. |
### `runtime_test_case`
*In database file: `data_test_cases.db`*
One row per test case.
| Column | Type | Description |
|-----------------|---------|-------------|
| node_id | INTEGER | Foreign key to `core_node.id`. |
| test_case_number | INTEGER | Test case number, in 0-indexed execution order. |
| coverage | JSON | Line coverage for this test case. Format: `{"file_path": [line_numbers]}` |
| timing | JSON | Timing breakdown, in seconds. Corresponds to the `observation["timing"]` field in [observability](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#observability). |
| predicates | JSON | Predicates like `assume()` and `.filter` and whether they succeeded. Corresponds to the `observation["predicates"]` field in [observability](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#observability). |
| features | JSON | Test case features, including `event()` and `note()`. Corresponds to the `observation["features"]` field in [observability](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#observability). |
| data_status | INTEGER | Status of the test case after finishing. Possible values: `0` (exceeded entropy cap), `1` (filtered by `assume()` or `.filter`), `2` (valid), `3` (caused a failure). Corresponds to the `observation["data_status"]` field in [observability](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#observability). |
| status_reason | TEXT | Human-readable reason for `data_status`. Corresponds to the `observation["status_reason"]` field in [observability](https://hypothesis.readthedocs.io/en/latest/reference/integrations.html#observability). |
| choices_size | INTEGER | Amount of entropy consumed to generate this test case. Proxy for input complexity. |
### `facets_nodes`
*In database file: `data.db`*
One row per node-level facet. Each canonical node has one summary, and one or more patterns and domains.
| Column | Type | Description |
|---------|---------|-------------|
| node_id | INTEGER | Foreign key to `core_node.id`. Guaranteed to be a foreign key to a row with `canonical_node=True`. |
| type | TEXT | One of: `"summary"`, `"pattern"`, or `"domain"` |
| facet | TEXT | For "summary": 1-2 sentence description of what the node checks. For "pattern": the abstract property type (for example, "inverse relationship between two functions", "idempotence of repeated operations"). For "domain": the technical area being tested (for example, "JSON serialization", "cryptographic operations"). Generated by Claude Haiku 4.5. |
### `node_aggregate_metrics`
*In database file: `data.db`*
One row per node. Derives aggregate metrics from `runtime_test_case`, for query performance. All data in this table is derived and can be dropped or regenerated at will.
| Column | Type | Description |
|---------------------------|---------|-------------|
| node_id | INTEGER | Primary key. Foreign key to `core_node.id`. |
| median_execution_time | REAL | Median `execute:test` timing across all test cases, in seconds. |
| median_generation_percent | REAL | Median percentage of time spent in generation (vs execution) per test case. |
| generation_percent | REAL | Overall percentage of time spent in generation across all test cases. |
| execution_time_cv | REAL | Coefficient of variation of per-test-case execution time. |
| percent_overrun | REAL | Percentage of test cases with `data_status=0` (overrun). |
| percent_invalid | REAL | Percentage of test cases with `data_status=1` (filtered). |
| median_feature_count | REAL | Median number of features per test case. |
| min_choices_size | INTEGER | Minimum `choices_size` across all test cases. |
| median_choices_size | REAL | Median `choices_size` across all test cases. |
| max_choices_size | INTEGER | Maximum `choices_size` across all test cases. |
| generation_curve | JSON | Per-node generation curve: maps percentage through the run (0-100) to mean generation percentage at that point. Format: `{"0": 95.1, "1": 89.2, ...}`. |
### `core_minhashes`
*In database file: `data_minhashes.db`*
MinHash data for a repository. One row per file.
| Column | Type | Description |
|--------------|---------|-------------|
| repo_id | INTEGER | Foreign key to `core_repository.id`. |
| minhash_data | BLOB | Serialized `datasketch.MinHash` object. Can be deserialized with `pickle.loads(minhash_data)`. |
## Caveats
The format of some data in this dataset might be surprising. This section clarifies a few kinds of confusing rows.
### Empty valid test case timing
A small number of `runtime_test_case` rows with `data_status=2` have `timing={}`, which is a surprising combination. This was caused by internal errors in Hypothesis during test case execution, for example `to_jsonable()` or `_repr_pretty_` erroring during observability reporting.
While we might hope to make Hypothesis more robust to these errors in the future, these errors are expected problems with the code under test and are not a problem in Hypothesis. For analysis purposes, these test cases can be excluded, or treated as "test body did not execute".
### Empty overrun and filtered test case timing
Many, but not all, of `runtime_test_case` rows with `data_status=0` and `data_status=1` have `timing={}`. This is expected: if a test case reaches its entropy cap (`data_status=0`) or is filtered out (`data_status=1`) before the test case arguments are generated, no timing information is reported. If a test case does so in the test body, after the test case arguments are generated, then timing information is reported.
[^1]: I manually examined 30 random repositorie which we identified as duplicate and found 0 false positives. The two main categories were (1) Re-upload of a repository with minor changes, for example experimenting with a bugfix, feature, or CI workflow, and (2) vendoring the entire source code of a dependency.
---
许可证:MIT
任务类别:
- 其他
标签:
- 基于属性的测试(property-based testing)
- python-hypothesis
规模类别:
# 约数千个仓库、数万个测试、数百万个测试用例。
# 本数据集以“测试”(即本数据集中的节点)作为规模单位。
- 10000 < 规模 < 100000
---
# Hypothesis语料库2026
本数据集涵盖1529个代码仓库中的28928个Hypothesis测试,完整代码仓库地址为https://github.com/Liam-DeVoe/Hypothesis-Corpus。与其他基于属性的测试数据集不同,本Hypothesis语料库还包含从执行基于属性的测试过程中收集的运行时数据,例如计时信息。
## 方法论
数据收集日期:2025年10月。
### 收集方法
- 通过GitHub API查询包含`import hypothesis`、`from hypothesis import`或`from hypothesis.`三个字符串之一的GitHub代码仓库。
- 过滤掉大小超过1GB的仓库。
- 过滤掉星标数少于5的复刻仓库。
- 过滤掉不包含`test_*.py`或`conftest.py`文件,或包含内嵌第三方代码的`site-packages`目录的仓库。
- 使用MinHash(MinHash)过滤掉重复的代码仓库[^1]。
- 收集仓库中的Python文件列表。
- 过滤掉`node_modules`、`.venv`、构建产物、`site-packages`等目录下的文件。
- 移除行数少于25的文件。
- 随机下采样至500个文件。
- 使用1-元分片进行处理,通过去除空白符和空行进行归一化。
- 为每个文件生成MinHash,使用128个置换排列。
- 给定两个仓库`r1`和`r2`,以及来自`r1`的文件`f`,若`minhash(f)`与`r2`中任意文件的MinHash的杰卡德相似度超过0.75,则定义`file_is_duplicate(r1, r2, f)`为真。当`r1`中至少30%的文件满足该条件,且`r2`中至少30%的文件满足`file_is_duplicate(r2, r1, f)`时,则判定`r1`和`r2`为重复仓库。
- 若`r1`和`r2`为重复仓库,则过滤掉星标数更少的一方,平局则随机处理。
- 过滤掉无法执行测试的仓库。
- 尝试自动为每个仓库解决依赖问题。
- 尝试使用`pip install`安装仓库本身。
- 尝试安装`[dev]`、`[test]`和`[tests]`额外依赖包。
- 自动查找符合要求的`.txt`需求文件,并通过`pip install -r`安装这些依赖。
- 强制安装`pytest==8.4.2`和`hypothesis==6.140.3`。
- 安装完成后,使用`pytest --collect-only`收集测试用例,通过`hypothesis.is_hypothesis_test`函数识别Hypothesis测试。
- 若`pytest --collect-only`执行失败、未找到Hypothesis测试或执行超时,则过滤该仓库。
### 运行时数据收集方法
我们为每个测试节点生成500个测试用例,并记录每个测试用例的运行时信息以及整体测试结果。请注意,部分节点的测试用例数量少于500个,原因可能是测试在某个测试用例上失败,或是Hypothesis检测到输入空间已耗尽而提前停止生成测试用例。
细节:
- 测试节点的执行在隔离的Docker容器中进行,并行运行4个仓库的测试。测试环境为2021款MacBook Pro,搭载M1 Pro芯片和16GB内存。
- 本次实验使用个人电脑完成。尽管多数节点在夜间执行,但仍有部分节点未在夜间执行,因此计时信息可能受当时系统负载影响存在一定不一致。
- 执行测试时使用以下Hypothesis设置:`@settings(max_examples=500, deadline=None, database=None, suppress_health_check=list(HealthCheck), phases=["generate"])`。
- 每个节点的执行设置5分钟超时。超时的节点会被记录为`runtime_summary.status="error"`,并在`runtime_summary.error_message`字段中包含`TimeoutExpired`回溯信息。
## 数据库文件
完整数据集分为三个文件,您无需下载不需要的数据。
- **`data.db`**:核心数据表,通常为您所需的全部文件。
- **`data_test_cases.db`**:单测试用例的运行时数据。
- **`data_minhashes.db`**:用于仓库去重的MinHash数据。
`data.db`为必需文件。其余文件可与`data.db`一同下载,并通过SQL的[ATTACH](https://sqlite.org/lang_attach.html)语句在进程内附加:
sql
-- 在连接至data.db的会话中执行
ATTACH DATABASE 'data_test_cases.db' AS test_cases;
ATTACH DATABASE 'data_minhashes.db' AS minhashes;
SELECT * FROM runtime_test_case rt
JOIN core_node cn ON rt.node_id = cn.id;
### `core_repository` 表
*位于数据库文件:`data.db`中*
每个GitHub仓库对应一行。
| 列名 | 数据类型 | 描述 |
|---------------------|---------|-------------|
| full_name | TEXT | GitHub仓库完整名称,格式为"{所有者}/{仓库名}"。 |
| size_bytes | INTEGER | 仓库字节大小。 |
| stargazers_count | INTEGER | 数据集收集时的GitHub星标数。 |
| is_fork | BOOLEAN | 是否为GitHub复刻仓库。 |
| status | TEXT | 取值为"valid"或"invalid"。"invalid"代表在某个阶段被过滤的仓库。 |
| status_reason | TEXT | 无效原因,取值为"invalid_repo"、"invalid_install (no_hypothesis_tests)"、"invalid_install (timed_out)"、"minhash_duplicate ({kept_repo}, {similarity_a}%/{similarity_b}%)"、"minhash_error"、"install_error"、"repo_404"之一。 |
| requirements | TEXT | 运行仓库测试所需依赖的requirements.txt文件内容字符串。 |
| node_ids | JSON | 来自`pytest --collect-only`的Hypothesis测试节点ID列表。 |
| other_node_ids | JSON | 来自`pytest --collect-only`的非Hypothesis测试节点ID列表。 |
| commit_hash | TEXT | 数据集收集时的仓库Git提交哈希。 |
| collection_returncode | INTEGER | `pytest --collect-only`的退出码。未在安装前被过滤的仓库为NULL,0=成功,1=测试失败,2=中断,3=内部错误,4=使用错误,5=未收集到测试。 |
| collection_output | TEXT | 安装和测试收集步骤的完整Docker容器日志输出。 |
| experiments_ran | JSON | 内部记账信息。 |
### `core_node` 表
*位于数据库文件:`data.db`中*
每个测试节点对应一行。
关于参数化的说明:Pytest支持“参数化”,即通过将测试参数在一组值上遍历,将单个测试函数扩展为多个测试。因此本数据集避免使用“测试”一词,因为其含义模糊,可能指代测试函数及其所有参数化组合,或参数化后的单个测试。与Pytest一致,本数据集将参数化后的测试(以及无参数化的测试)统称为“节点”。“节点ID”是测试函数的全限定名称,结合所选参数化的字符串表示(若有),例如:`tests/test_math.py::test_addition[1-2-3]`对应将`tests/test_math.py::test_addition`测试的参数设为1、2、3的节点。
有时需要脱离参数化讨论节点,例如希望在忽略参数化的情况下计算测试的统计量(如平均源代码大小),为此我们为每个参数化组选择一个任意节点,并将其`canonical_parametrization`列设为true,包括无参数化的节点。该`canonical_parametrization`节点应被用于所有与参数化无关的查询。
| 列名 | 数据类型 | 描述 |
|---------------------------|---------|-------------|
| repo_id | INTEGER | 外键,指向`core_repository.id`。 |
| node_id | TEXT | Pytest节点ID。 |
| canonical_parametrization | BOOLEAN | 是否为每个参数化组中的任意一个节点。 |
| source_code | TEXT | 通过`inspect.getsource()`获取的节点函数体源代码。 |
| is_stateful | BOOLEAN | 该节点是否为Hypothesis状态机测试。 |
### `runtime_summary` 表
*位于数据库文件:`data.db`中*
每个测试节点对应一行,汇总该节点所有测试用例的数据。
| 列名 | 数据类型 | 描述 |
|-----------------------|---------|-------------|
| node_id | INTEGER | 外键,指向`core_node.id`。 |
| status | TEXT | 取值为"passed"、"failed"、"skipped"、"error"之一。 |
| execution_time | REAL | 所有测试用例的挂钟时间,单位为秒。 |
| error_message | TEXT | 当status为"failed"或"error"时的回溯文本。 |
| count_test_cases | INTEGER | 执行的测试用例数量。 |
| coverage | JSON | 聚合的行覆盖率,格式为`{"file_path": [line_numbers]}`。 |
| line_execution_counts | JSON | 每行命中次数,格式为`{"file_path": {"line_num": count}}`。 |
| unique_lines_covered | INTEGER | 所有文件中被覆盖的唯一行数之和。 |
| settings | JSON | 节点的Hypothesis设置映射,键包括`max_examples`、`deadline`、`derandomize`、`stateful_step_count`、`suppress_health_check`、`database`、`backend`、`phases`、`verbosity`、`print_blob`、`report_multiple_bugs`。 |
### `runtime_test_case` 表
*位于数据库文件:`data_test_cases.db`中*
每个测试用例对应一行。
| 列名 | 数据类型 | 描述 |
|-----------------|---------|-------------|
| node_id | INTEGER | 外键,指向`core_node.id`。 |
| test_case_number | INTEGER | 测试用例编号,按执行顺序从0开始索引。 |
| coverage | JSON | 该测试用例的行覆盖率,格式同前。 |
| timing | JSON | 计时细分,单位为秒,对应Hypothesis可观测性文档中`observation["timing"]`字段。 |
| predicates | JSON | `assume()`和`.filter`等断言及其成功情况,对应`observation["predicates"]`字段。 |
| features | JSON | 测试用例特征,包括`event()`和`note()`,对应`observation["features"]`字段。 |
| data_status | INTEGER | 测试用例完成后的状态,可能的值:0(超出熵上限)、1(被`assume()`或`.filter`过滤)、2(有效)、3(导致失败),对应`observation["data_status"]`字段。 |
| status_reason | TEXT | `data_status`的人类可读原因,对应`observation["status_reason"]`字段。 |
| choices_size | INTEGER | 生成该测试用例所消耗的熵量,作为输入复杂度的代理指标。 |
### `facets_nodes` 表
*位于数据库文件:`data.db`中*
每个节点级特征对应一行。每个规范节点有一个摘要、一个或多个模式和领域。
| 列名 | 数据类型 | 描述 |
|---------|---------|-------------|
| node_id | INTEGER | 外键,指向`core_node.id`,且保证指向`canonical_node=True`的行。 |
| type | TEXT | 取值为"summary"、"pattern"或"domain"之一。 |
| facet | TEXT | 对于"summary",是1-2句描述该节点检查内容的文本;对于"pattern",是抽象属性类型(例如“两个函数之间的逆关系”、“重复操作的幂等性”);对于"domain",是被测试的技术领域(例如“JSON序列化”、“加密操作”)。这些特征由Claude Haiku 4.5生成。 |
### `node_aggregate_metrics` 表
*位于数据库文件:`data.db`中*
每个测试节点对应一行,从`runtime_test_case`派生聚合指标,用于提升查询性能。表中所有数据均为派生数据,可随时删除或重新生成。
| 列名 | 数据类型 | 描述 |
|---------------------------|---------|-------------|
| node_id | INTEGER | 主键,外键指向`core_node.id`。 |
| median_execution_time | REAL | 所有测试用例的`execute:test`计时的中位数,单位为秒。 |
| median_generation_percent | REAL | 每个测试用例中生成阶段(相对于执行阶段)耗时百分比的中位数。 |
| generation_percent | REAL | 所有测试用例中生成阶段耗时的总百分比。 |
| execution_time_cv | REAL | 每个测试用例执行时间的变异系数。 |
| percent_overrun | REAL | `data_status=0`(超出上限)的测试用例百分比。 |
| percent_invalid | REAL | `data_status=1`(被过滤)的测试用例百分比。 |
| median_feature_count | REAL | 每个测试用例特征数量的中位数。 |
| min_choices_size | INTEGER | 所有测试用例中`choices_size`的最小值。 |
| median_choices_size | REAL | `choices_size`的中位数。 |
| max_choices_size | INTEGER | 所有测试用例中`choices_size`的最大值。 |
| generation_curve | JSON | 每个节点的生成曲线:将运行进度百分比(0-100)映射至该进度点的平均生成耗时百分比,格式为`{"0": 95.1, "1": 89.2, ...}`。 |
### `core_minhashes` 表
*位于数据库文件:`data_minhashes.db`中*
每个仓库的文件对应一行,存储MinHash数据。
| 列名 | 数据类型 | 描述 |
|--------------|---------|-------------|
| repo_id | INTEGER | 外键,指向`core_repository.id`。 |
| minhash_data | BLOB | 序列化的`datasketch.MinHash`对象,可通过`pickle.loads(minhash_data)`反序列化。 |
## 注意事项
本数据集中部分数据格式可能令人意外,本节澄清几种易混淆的行。
### 空的有效测试用例计时信息
少量`data_status=2`的`runtime_test_case`行的`timing={}`,这一组合可能令人意外。这是由于Hypothesis在测试用例执行期间的内部错误导致的,例如`to_jsonable()`或`_repr_pretty_`在可观测性报告期间出错。
尽管我们未来可能希望改进Hypothesis对这些错误的鲁棒性,但这些错误属于被测代码的问题,而非Hypothesis本身的问题。在分析时,可以排除这些测试用例,或将其视为“测试体未执行”。
### 空的超出上限和被过滤的测试用例计时信息
多数(但非全部)`data_status=0`和`data_status=1`的`runtime_test_case`行的`timing={}`,这是预期行为:如果测试用例在生成测试用例参数前就达到熵上限(`data_status=0`)或被过滤(`data_status=1`),则不会报告计时信息。如果测试用例在测试体中、生成参数后触发上述情况,则会报告计时信息。
[^1]: 我手动检查了30个被判定为重复的随机仓库,未发现假阳性。主要分为两类:(1) 仅进行小幅修改后重新上传的仓库,例如尝试修复bug、添加功能或调整CI工作流;(2) 完整复刻依赖源代码的仓库。
提供机构:
HypothesisWorks



