five

Venkatdatta/fol-data

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Venkatdatta/fol-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 language: - en task_categories: - translation - question-answering - text-generation tags: - logical-reasoning - first-order-logic - proofwriter - symbolic-reasoning - natural-language-inference pretty_name: FOL Reasoning Dataset (Vocabulary-Augmented ProofWriter) size_categories: - 100K<n<1M --- # FOL Reasoning Dataset A preprocessed and vocabulary-augmented dataset derived from the [ProofWriter (Kaggle)](https://www.kaggle.com/datasets/mathurinache/proofwriter) OWA splits, built for training a Natural Language → First-Order Logic translation model. The source dataset contains natural-language premises and questions in English along with structured proof metadata. Our preprocessing adds two things that the original does not provide: 1. **FOL translations** — each natural-language statement is converted to First-Order Logic via a rule-based translator (100% coverage). ProofWriter's NL maps deterministically to FOL (e.g. "Anne is kind" → `Kind(anne)`, "If someone is kind then they are furry" → `forall x (Kind(x) -> Furry(x))`). Proof chains and Unknown failure traces are also converted to FOL form. 2. **Vocabulary substitution** — entity and predicate names are replaced per-question with random draws from large NLTK/WordNet pools, forcing models to learn structural FOL mapping rather than surface name memorisation. ## Vocabulary substitution The original ProofWriter uses a very small fixed vocabulary (20 entity names such as Anne, Bob, bear, dog; 20 property words such as kind, furry, blue; 6 relation words such as visits, chases). This dataset replaces every entity and predicate name per-question with a random draw from: - **7,372 entity names** — NLTK `names` corpus (first names, filtered to 3–9 chars, alpha only) - **13,006 property words** — WordNet adjective synset lemmas (4–10 chars, alpha only) - **7,463 relation words** — WordNet verb synset lemmas (4–10 chars, alpha only) All original ProofWriter vocabulary is excluded from the replacement pools. The substitution is consistent within each question (same entity always maps to the same replacement). ## Dataset Structure Each split is a JSONL file. One example per line: ```json { "premises": "Venkat is perseverant. If someone is perseverant they discover. <extra_id_0> Venkat discovers.", "logic": "<extra_id_1>\nPerseverant(venkat)\nforall x (Perseverant(x) -> Discover(x))\n<extra_id_2>\nDiscover(venkat)\n<extra_id_3>\nPerseverant(venkat) and forall x (Perseverant(x) -> Discover(x)) -> therefore Discover(venkat)\n<extra_id_4>\nTrue", "qdep": 1, "answer": "True", "source": "depth-2/meta-train-1234" } ``` ### Fields | Field | Type | Description | |-------|------|-------------| | `premises` | string | Encoder input: NL facts and rules, then `<extra_id_0>`, then NL question — all vocabulary-substituted | | `logic` | string | Full decoder target: FOL premises → FOL question → proof chain → answer | | `qdep` | int | Question depth (0–7): minimum reasoning steps to answer | | `answer` | string | Ground truth: `"True"`, `"False"`, or `"Unknown"` | | `source` | string | Original ProofWriter example ID | ### `logic` field sentinel structure ``` <extra_id_1> ← start of FOL premises block Kind(anne) forall x (Kind(x) -> Furry(x)) <extra_id_2> ← start of FOL question Furry(anne) <extra_id_3> ← start of proof chain Kind(anne) and forall x (Kind(x) -> Furry(x)) -> therefore Furry(anne) <extra_id_4> ← answer token True ``` For `Unknown` examples, the proof is a failure chain: ``` <extra_id_3> forall x (Big(x) and Round(x) -> White(x)) <- Rough(fiona) -> Big(fiona) <- [no base fact] Cannot be determined from given premises. <extra_id_4> Unknown ``` ## Splits | Split | Examples | File size | |-------|----------|-----------| | train | 229,832 | ~302 MB | | dev | 33,042 | ~45 MB | | test | 66,084 | ~88 MB | ### Class distribution (train) | Class | Count | % | |-------|-------|---| | pos_True (non-negated → True) | 58,034 | 25.3% | | neg_False (negated → False) | 57,984 | 25.2% | | pos_Unknown | 51,808 | 22.5% | | neg_Unknown | 51,808 | 22.5% | | pos_False (non-negated → False) | 5,124 | 2.2% | | neg_True (negated → True) | 5,074 | 2.2% | `pos_False` and `neg_True` are rare (underrepresented ~11×) — training uses a weighted sampler to compensate. ## Source Built from ProofWriter OWA depth-2, depth-3, and depth-3ext splits, sourced from [Kaggle (mathurinache/proofwriter)](https://www.kaggle.com/datasets/mathurinache/proofwriter). ## Citation If you use this dataset, please cite: ```bibtex @misc{fol-data-2026, author = {Venkat Datta Bommena}, title = {FOL Reasoning Dataset: ProofWriter with FOL Annotations and Vocabulary Augmentation}, year = {2026}, url = {https://huggingface.co/datasets/Venkatdatta/fol-data} } ``` Please also cite the original data source: ```bibtex @misc{mathurinache-proofwriter-kaggle, author = {mathurinache}, title = {ProofWriter}, year = {2021}, url = {https://www.kaggle.com/datasets/mathurinache/proofwriter}, note = {Kaggle dataset} } ``` ## License This dataset is a derivative of [Kaggle (mathurinache/proofwriter)](https://www.kaggle.com/datasets/mathurinache/proofwriter) and is released under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) — non-commercial use only, with attribution and share-alike.
提供机构:
Venkatdatta
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作