five

dataunitylab/json-schema

收藏
Hugging Face2024-05-31 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/dataunitylab/json-schema
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: - unknown --- # JSON Schema Dataset This dataset consists of a collection of JSON Schema documents collected from GitHub by searching using the Sourcegraph API. # Step 1: Find a list of JSON Schema paths The [Sourcegraph](https://sourcegraph.com/) code search API is used to find files with a .json extension and containing `{\n "$schema": "https://json-schema.org/"`. This is somewhat restrictive, but still manages to find a large number of schemas. pipenv run python slurp.py --outfile repos.csv # Step 2: Fetch the history information for each file We fetch every revision of each JSON Schema file. Before downloading the files, we use the GitHub API to get the list of commit hashes. The resulting data is saved to `commits.json`. pipenv run python fetch_history.py > commits.json # Step 3: Download the JSON Schema files This script will download each schema which comes from GitHub and save it into subfolders in the `data` directory. ./fetch_files.sh # Step 4: Validate each JSON Schema The following script will read each schema in the `data` directory and confirm that it is a valid JSON Schema. A copy of all valid schemas will be placed in the `valid_data` directory. Note that schemas are parsed as [JSON5](https://json5.org/) to be more permissive on what syntax is allowed but the final schemas are written as standard JSON. pipenv run python validate_schemas.py # Step 5: Retrieve additional metadata We also collect language information using [Fasttext](https://fasttext.cc/docs/en/language-identification.html) and fetch the associated license from the GitHub API. pipenv run python get_languages.py > languages.json pipenv run python get_licenses.py > licenses.json # Step 6: Split into train, test, and validation Finally data is split into training, test, and validation sets. Schemas are always grouped together in the same set based on the GitHub organization they are from. Schemas can also be checked for similarity so that very similar schemas are grouped together. pipenv run python train_split.py
提供机构:
dataunitylab
原始信息汇总

JSON Schema Dataset 概述

数据集来源

  • 数据集由GitHub上的JSON Schema文档组成,通过Sourcegraph API搜索获取。

数据集构建步骤

  1. 路径查找:使用Sourcegraph代码搜索API查找扩展名为.json且包含特定JSON Schema引用的文件。
  2. 历史信息获取:通过GitHub API获取每个JSON Schema文件的修订历史,并将结果保存到commits.json
  3. 文件下载:下载来自GitHub的每个Schema文件,并保存到data目录的子文件夹中。
  4. Schema验证:验证data目录中的每个Schema文件,确保其为有效的JSON Schema,并将有效Schema复制到valid_data目录。
  5. 元数据收集:使用Fasttext收集语言信息,并通过GitHub API获取相关许可证信息。
  6. 数据集分割:将数据集分割为训练、测试和验证集,确保同一GitHub组织的Schema被分在同一组,并检查Schema间的相似性。

数据集特点

  • Schema验证过程中,使用JSON5进行解析以允许更宽松的语法,但最终保存为标准JSON格式。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作