five

SynthCypher

收藏
魔搭社区2026-01-06 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/ServiceNow-AI/SynthCypher
下载链接
链接失效反馈
官方服务:
资源简介:
# SynthCypher Dataset Repository ## Overview This repository hosts **SynthCypher**, a novel synthetic dataset designed to bridge the gap in **Text-to-Cypher (Text2Cypher)** tasks. SynthCypher leverages state-of-the-art **large language models (LLMs)** to automatically generate and validate high-quality data for training and evaluating models that convert natural language questions into Cypher queries for graph databases like Neo4j. Our dataset and pipeline contribute significantly to advancing Text2Cypher research by offering a large, diverse, and rigorously validated dataset across a wide range of query types and domains. --- ## Highlights of SynthCypher - **Comprehensive Coverage**: - 25.8k training samples and 4k test samples. - Spanning **109 query types** (e.g., Simple Retrieval, Aggregation, Sub-Graph Queries). - Derived from **528 training schemas** and **165 testing schemas**. - **Synthetic Data Generation Pipeline**: - **Schema Generation**: 700 diverse domains expanded using **Mixtral**. - **Natural Language Question Creation**: 109 query types with corresponding dummy ground truths. - **Neo4j Database Population**: Populated with synthetic data to validate schema and Cypher queries. - **Cypher Query Generation**: Iterative chain-of-thought reasoning by LLMs for high-quality query generation. - **Validation**: Rigorous validation of Cypher execution and correctness using LLMs and Neo4j. - **Performance Gains**: - LLMs fine-tuned on SynthCypher achieve **40% improvement** over baseline datasets and outperform off-the-shelf models. --- ## Dataset Details The dataset consists of: - **Schemas**: Representing real-world domains (e.g., e-commerce, inventory). - **Natural Language Questions**: Diverse queries crafted for each schema. - **Cypher Queries**: High-quality queries aligned with natural language questions. --- ## Experimental Results Key observations from our experiments: 1. **Performance Gap**: Existing models trained on generic instruction datasets show low accuracy on Text2Cypher tasks. 2. **SynthCypher Effectiveness**: Fine-tuning with SynthCypher improves model performance by up to 40% absolute over baseline datasets. 3. **Controlled Data Generation**: Our pipeline demonstrates superior quality and coverage compared to naive GPT-based approaches. --- ## Limitations - **Synthetic Data Bias**: Synthetic strategies may not fully reflect real-world distributions and could reinforce biases. - **Real-World Applicability**: Performance on real-world scenarios may vary. --- ## Citation If you use SynthCypher in your work, please cite: ``` @misc{tiwari2024synthcypherfullysyntheticdata, title={Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework}, author={Aman Tiwari and Shiva Krishna Reddy Malay and Vikas Yadav and Masoud Hashemi and Sathwik Tejaswi Madhusudhan}, year={2024}, eprint={2412.12612}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.12612}, } ``` --- ## License This dataset is licensed under the *Creative Commons Attribution Non Commercial Share Alike 4.0*. Please review the terms before use. ---

# SynthCypher 数据集仓库 ## 概述 本仓库托管**SynthCypher**——一款全新的合成数据集,旨在填补**文本到Cypher(Text-to-Cypher, Text2Cypher)**任务领域的空白。SynthCypher依托当前最先进的**大语言模型(LLMs)**,自动生成并验证高质量数据,用于训练和评估将自然语言问题转换为图数据库(如Neo4j)所用Cypher查询语句的模型。 本数据集与生成流水线为Text2Cypher领域研究提供了重要助力,其涵盖海量多样且经过严格验证的样本,覆盖广泛的查询类型与应用领域。 --- ## SynthCypher 核心优势 - **全面覆盖**: - 包含25.8k条训练样本与4k条测试样本。 - 覆盖**109种查询类型**(例如:简单检索、聚合查询、子图查询)。 - 源自**528个训练图模式(schema)**与**165个测试图模式(schema)**。 - **合成数据生成流水线**: - **图模式生成**:通过Mixtral扩展出700个多样化的应用领域。 - **自然语言问题构建**:为109种查询类型生成对应的伪真实标注。 - **Neo4j数据库填充**:导入合成数据以验证图模式与Cypher查询语句。 - **Cypher查询生成**:通过大语言模型的**迭代思维链(Chain-of-thought)**推理生成高质量查询语句。 - **验证环节**:借助大语言模型与Neo4j对Cypher语句的执行效果与正确性进行严格校验。 - **性能提升**: - 在SynthCypher上微调的大语言模型,相较于基线数据集,性能提升达40%,且优于通用预训练模型。 --- ## 数据集详情 本数据集包含: - **图模式(schema)**:代表真实世界的应用领域(例如:电子商务、库存管理)。 - **自然语言问题**:针对各类图模式设计的多样化查询语句。 - **Cypher查询语句**:与自然语言问题高度匹配的高质量查询代码。 --- ## 实验结果 我们的实验得出以下关键结论: 1. **性能差距**:基于通用指令数据集训练的现有模型,在Text2Cypher任务上的准确率较低。 2. **SynthCypher的有效性**:使用SynthCypher进行微调,可使模型性能较基线数据集提升最高达40个百分点(绝对提升)。 3. **可控化数据生成**:相较于朴素GPT类方法,我们的流水线生成的数据在质量与覆盖范围上均更具优势。 --- ## 局限性 - **合成数据偏差**:合成生成策略无法完全复刻真实世界的数据分布,可能会放大既有偏差。 - **实际应用适配性**:在真实场景中的性能表现可能存在差异。 --- ## 引用 若您在研究中使用SynthCypher,请引用以下文献: @misc{tiwari2024synthcypherfullysyntheticdata, title="Auto-Cypher: Improving LLMs on Cypher generation via LLM-supervised generation-verification framework", author={Aman Tiwari and Shiva Krishna Reddy Malay and Vikas Yadav and Masoud Hashemi and Sathwik Tejaswi Madhusudhan}, year={2024}, eprint={2412.12612}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.12612}, } --- ## 授权协议 本数据集采用*知识共享署名-非商业性使用-相同方式共享4.0(Creative Commons Attribution Non Commercial Share Alike 4.0)*协议进行授权,使用前请仔细阅读协议条款。
提供机构:
maas
创建时间:
2025-01-29
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
SynthCypher是一个用于Text-to-Cypher任务的合成数据集,包含25.8k训练样本和4k测试样本,覆盖109种查询类型和多种领域。通过LLMs生成和验证,显著提升模型性能40%。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作