dataunitylab/json-schema
收藏Hugging Face2024-05-31 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/dataunitylab/json-schema
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license:
- unknown
---
# JSON Schema Dataset
This dataset consists of a collection of JSON Schema documents collected from GitHub by searching using the Sourcegraph API.
# Step 1: Find a list of JSON Schema paths
The [Sourcegraph](https://sourcegraph.com/) code search API is used to find files with a .json extension and containing `{\n "$schema": "https://json-schema.org/"`.
This is somewhat restrictive, but still manages to find a large number of schemas.
pipenv run python slurp.py --outfile repos.csv
# Step 2: Fetch the history information for each file
We fetch every revision of each JSON Schema file.
Before downloading the files, we use the GitHub API to get the list of commit hashes.
The resulting data is saved to `commits.json`.
pipenv run python fetch_history.py > commits.json
# Step 3: Download the JSON Schema files
This script will download each schema which comes from GitHub and save it into subfolders in the `data` directory.
./fetch_files.sh
# Step 4: Validate each JSON Schema
The following script will read each schema in the `data` directory and confirm that it is a valid JSON Schema.
A copy of all valid schemas will be placed in the `valid_data` directory.
Note that schemas are parsed as [JSON5](https://json5.org/) to be more permissive on what syntax is allowed but the final schemas are written as standard JSON.
pipenv run python validate_schemas.py
# Step 5: Retrieve additional metadata
We also collect language information using [Fasttext](https://fasttext.cc/docs/en/language-identification.html) and fetch the associated license from the GitHub API.
pipenv run python get_languages.py > languages.json
pipenv run python get_licenses.py > licenses.json
# Step 6: Split into train, test, and validation
Finally data is split into training, test, and validation sets.
Schemas are always grouped together in the same set based on the GitHub organization they are from.
Schemas can also be checked for similarity so that very similar schemas are grouped together.
pipenv run python train_split.py
提供机构:
dataunitylab
原始信息汇总
JSON Schema Dataset 概述
数据集来源
- 数据集由GitHub上的JSON Schema文档组成,通过Sourcegraph API搜索获取。
数据集构建步骤
- 路径查找:使用Sourcegraph代码搜索API查找扩展名为.json且包含特定JSON Schema引用的文件。
- 历史信息获取:通过GitHub API获取每个JSON Schema文件的修订历史,并将结果保存到
commits.json。 - 文件下载:下载来自GitHub的每个Schema文件,并保存到
data目录的子文件夹中。 - Schema验证:验证
data目录中的每个Schema文件,确保其为有效的JSON Schema,并将有效Schema复制到valid_data目录。 - 元数据收集:使用Fasttext收集语言信息,并通过GitHub API获取相关许可证信息。
- 数据集分割:将数据集分割为训练、测试和验证集,确保同一GitHub组织的Schema被分在同一组,并检查Schema间的相似性。
数据集特点
- Schema验证过程中,使用JSON5进行解析以允许更宽松的语法,但最终保存为标准JSON格式。



