five

ReactorJet/coq-facts-props-proofs-gen0-v1

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ReactorJet/coq-facts-props-proofs-gen0-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other language: - code task_categories: - text-generation tags: - mathematics - formal-proof pretty_name: Coq Facts, Propositions and Proofs size_categories: - 100K<n<1M source_datasets: - https://github.com/coq-community/autosubst.git - https://gitlab.inria.fr/fbesson/itauto.git - https://github.com/coq-community/apery.git - https://github.com/coq-contribs/dep-map.git - https://github.com/coq-community/math-classes.git - https://github.com/coq-contribs/amm11262.git - https://gitlab.inria.fr/flocq/flocq.git - https://github.com/uds-psl/coq-library-undecidability.git - https://github.com/vafeiadis/hahn.git - https://github.com/coq-community/trocq.git - https://github.com/coq-community/tarjan.git - https://github.com/coq-community/generic-environments.git - https://github.com/imdea-software/htt.git - https://gitlab.mpi-sws.org/iris/stdpp.git - https://github.com/coq-contribs/coinductive-reals.git - https://github.com/coq-community/bertrand.git - https://github.com/coq-contribs/hoare-tut.git - https://github.com/coq-contribs/cfgv.git - https://github.com/artagnon/bonak.git - https://github.com/coq-community/coq-ext-lib.git - https://github.com/coq-contribs/checker.git - https://gitlab.inria.fr/coqinterval/interval.git - https://gitlab.inria.fr/gappa/coq.git - https://github.com/imdea-software/fcsl-pcm.git - https://github.com/coq-community/proviola.git - https://gitlab.com/umb-svl/turing.git - https://github.com/damien-pous/relation-algebra.git - https://github.com/coq-community/HighSchoolGeometry.git - https://github.com/coq-community/coqeal.git - https://github.com/coq-community/semantics.git - https://github.com/coq-contribs/algebra.git - https://github.com/coq-contribs/abp.git - https://github.com/coq/coq.git - https://github.com/coq-contribs/demos.git - https://github.com/coq-community/pocklington.git - https://github.com/coq-contribs/cantor.git - https://github.com/coq-community/notation-gallery.git - https://github.com/coq-contribs/ctltctl.git - https://github.com/HoTT/Coq-HoTT.git - https://github.com/coq-community/corn.git - https://github.com/affeldt-aist/monae.git - https://github.com/coq-community/stalmarck.git - https://github.com/coq-community/qarith-stern-brocot.git - https://github.com/coq-contribs/bdds.git - https://github.com/DeepSpec/InteractionTrees.git - https://github.com/coq-community/lemma-overloading.git - https://github.com/coq-community/coq-art.git - https://github.com/coq-community/coq-performance-tests.git - https://github.com/impermeable/coq-waterproof.git - https://github.com/coq-contribs/dblib.git - https://github.com/math-comp/algebra-tactics.git - https://github.com/coq-community/zorns-lemma.git - https://github.com/smtcoq/smtcoq.git - https://gitlab.inria.fr/coquelicot/coquelicot.git - https://github.com/coq-community/sudoku.git - https://github.com/QuickChick/QuickChick.git - https://github.com/coq-contribs/ails.git - https://github.com/math-comp/math-comp.git - https://github.com/plclub/metalib.git - https://github.com/xavierleroy/cdf-mech-sem.git - https://github.com/CertiGraph/CertiGraph.git - https://github.com/math-comp/Coq-Combi.git - https://github.com/coq-contribs/additions.git - https://github.com/coq-community/regexp-Brzozowski.git - https://github.com/math-comp/finmap.git - https://github.com/coq-community/huffman.git - https://github.com/mattam82/Coq-Equations.git - https://github.com/math-comp/multinomials.git - https://github.com/roglo/puiseuxth.git - https://github.com/thery/FlocqLecture.git - https://github.com/math-comp/odd-order.git - https://github.com/uwplse/verdi.git - https://github.com/mit-plv/fiat.git - https://github.com/adampetcher/fcf.git - https://github.com/uncomputable/natural-number-game.git - https://github.com/mit-plv/bbv.git - https://github.com/coq-community/fourcolor.git - https://github.com/coq-community/coqtail-math.git - https://github.com/coq-community/almost-full.git - https://github.com/coq-contribs/dictionaries.git - https://github.com/clarksmr/sf-lectures.git - https://github.com/affeldt-aist/infotheo.git - https://github.com/coq-community/gaia.git - https://github.com/coq-contribs/circuits.git - https://github.com/coq-contribs/coq-in-coq.git - https://github.com/coq-community/hoare-tut.git - https://github.com/thery/minirubik.git - https://github.com/coq-contribs/coinductive-examples.git - https://github.com/coq-community/bits.git - https://github.com/coq-community/graph-theory.git - https://github.com/coq-community/hydra-battles.git - https://github.com/coq-community/coq-100-theorems.git - https://github.com/coq-community/metaprogramming-rosetta-stone.git - https://github.com/coq-community/buchberger.git - https://github.com/math-comp/Abel.git - https://github.com/thery/mathcomp-extra.git - https://github.com/coq-contribs/coalgebras.git - https://github.com/thery/coqprime.git - https://github.com/coq-community/atbr.git - https://github.com/coq-community/bignums.git - https://github.com/coq-community/aac-tactics.git - https://github.com/codyroux/name-the-biggest-number.git - https://github.com/math-comp/trajectories.git - https://github.com/coq-contribs/concat.git - https://github.com/math-comp/real-closed.git - https://github.com/lukaszcz/coqhammer.git - https://github.com/jwiegley/category-theory.git - https://github.com/math-comp/dioid.git - https://github.com/jwiegley/coq-haskell.git - https://github.com/math-comp/analysis.git - https://github.com/coq-community/dblib.git - https://github.com/coq-community/topology.git - https://github.com/coq-contribs/automata.git - https://github.com/coq-community/reglang.git - https://github.com/thery/T2048.git - https://github.com/lthms/FreeSpec.git - https://github.com/charguer/tlc.git - https://github.com/tchajed/coq-record-update.git - https://github.com/coq-contribs/classical-realizability.git - https://github.com/coq-community/comp-dec-modal.git - https://github.com/coq-community/coqoban.git - https://github.com/coq-community/parseque.git - https://github.com/UniMath/UniMath.git - https://github.com/coq-contribs/distributed-reference-counting.git - https://github.com/coq-community/jmlcoq.git - https://github.com/SSProve/ssprove.git - https://github.com/coq-community/chapar.git - https://github.com/coq-community/alea.git - https://github.com/coq-community/paramcoq.git - https://github.com/snu-sf/paco.git - https://github.com/math-comp/mczify.git - https://github.com/ilyasergey/pnp.git - https://github.com/Mtac2/Mtac2.git - https://github.com/Deducteam/coq-hol-light.git - https://github.com/coq-community/coqffi.git - https://github.com/math-comp/tutorial_material.git - https://github.com/GeoCoq/GeoCoq.git - https://github.com/math-comp/bigenough.git - https://github.com/thery/hanoi.git - https://github.com/coq-community/coq-mmaps.git - https://github.com/fblanqui/color.git - https://github.com/Matafou/LibHyps.git --- # Dataset Name: Coq Facts, Propositions and Proofs ## Dataset Description The CoqFactsPropsProofs dataset aims to enhance Large Language Models' (LLMs) proficiency in interpreting and generating Coq code by providing a comprehensive collection of over 10,000 Coq source files. It encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. This dataset is designed to facilitate the development of models capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing automated theorem proving. A detailed description can be found in the paper [Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code](https://arxiv.org/abs/2403.12627). ## Composition * Files: Over 10,000 Coq source files (`.v` files). * Tables: Three distinct tables: facts (definitions or notations), propositions (theorems and lemmas alongside proofs), and licensing/repository information. * Entries: 103,446 facts and 166,035 propositions with proofs. * Size: Character length ranging from as short as 11 to as long as 177,585 characters. * Source and Collection Method: The Coq source files were collected from various internet sources, focusing on repositories pivotal within the Coq community. These sources range from foundational libraries and formalized mathematical theorems to computer science theories and algorithm implementations. The collection process prioritized quality, relevance, and the contribution of each source to the Coq ecosystem. ## Licenses The dataset includes a diverse range of open-source licenses, reflecting the variety in the Coq community and the broader open-source ecosystem. Some of the licenses included are MIT, GPL (versions 2.0 and 3.0), LGPL (versions 2.1 and 3.0), Apache 2.0, BSD (2-Clause and 3-Clause), CECILL (versions 1.0, 2.1, B, C), MPL-2.0, and the UniMath License. Each entry in the dataset links to detailed license information, ensuring compliance and redistribution legality. ## Usage This dataset is provided in three parquet files and can be employed in a variety of ways depending on the use case. An example usage case includes training or fine-tuning models to focus on proofs rather than definitions and notations. The dataset also allows for filtering based on specific licenses using the `info.parquet` file. ```python import pandas as pd df_facts_raw = pd.read_parquet("facts.parquet") df_info = pd.read_parquet("info.parquet") # This is the list of licenses which might be seen as permissive permissive_licenses_list = [ 'Apache-2.0', 'BSD-2-Clause', 'BSD-3-Clause', 'CECILL-B', 'CECILL-C', 'LGPL-2.1-only', 'LGPL-2.1-or-later', 'LGPL-3.0-only', 'LGPL-3.0-or-later', 'MIT', 'MPL-2.0', 'UniMath' ] # Set the license-type to permissive based on the list df_info['license-type'] = df_info['spdx-id'].apply( lambda x: 'permissive' if x in permissive_licenses_list else 'not permissive') # Merge df_facts with df_info to get the license-type information # 'symbolic_name' is the common key in both DataFrames df_facts_merged = pd.merge(df_facts_raw, df_info, on='symbolic_name', how='left') # Filter the merged DataFrame to only include entries with a permissive license df_facts = df_facts_merged[df_facts_merged['license-type'] == 'permissive'] ``` ## Experiments and Findings Initial experiments with the dataset have demonstrated its potential in improving the accuracy of LLMs in generating Coq code. Fine-tuning an existing base model with this dataset resulted in outputs predominantly in Coq syntax, highlighting the dataset's efficacy in specialized model training for formal theorem proving. ## Challenges and Limitations * High standard deviations in fact, proposition, and proof lengths indicate the presence of outliers with significantly long content. * The complexity of licensing and the manual process of license identification for each repository. ## Cite ``` @misc{florath2024enhancing, title={Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code}, author={Andreas Florath}, year={2024}, eprint={2403.12627}, archivePrefix={arXiv}, primaryClass={cs.AI} } ``` ## Legal Disclaimer This dataset is provided 'as is' and without any warranty or guarantee of accuracy, completeness, or compliance with any specific legal regime. While every effort has been made to ensure that license information is accurate and up-to-date, users of this dataset are responsible for verifying the licensing information of each snippet and complying with all applicable licenses and copyright laws. Users should consider seeking legal advice to ensure their use of these snippets complies with the original authors' licensing terms and any other applicable regulations. The creators of this dataset shall not be held liable for any infringements or legal challenges arising from the use of or reliance on any materials contained within this dataset.
提供机构:
ReactorJet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作