five

Synth-APIGen-v0.1

收藏
魔搭社区2026-01-09 更新2024-10-19 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Synth-APIGen-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/> </a> </p> # Dataset card for Synth-APIGen-v0.1 This dataset has been created with [distilabel](https://distilabel.argilla.io/). Pipeline script: [pipeline_apigen_train.py](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1/blob/main/pipeline_apigen_train.py). ## Dataset creation It has been created with `distilabel==1.4.0` version. This dataset is an implementation of [APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets](https://arxiv.org/abs/2406.18518) in `distilabel`, generated from *synthetic* functions. The process can be summarized as follows: - Generate (or in this case modify) python functions: Using a simple pipeline, the functions obtained in [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) were modified to make them easily transformable to a json schema tool. - Dataset: [argilla-warehouse/python-lib-tools-v0.1](https://huggingface.co/datasets/argilla-warehouse/python-lib-tools-v0.1) - Prepare the seed tools for the `APIGen` pipeline: We need a dataset with function names, their description and tool representation to be feed to our prompt, so the previous dataset hast to be transformed. Also, a "golden source" can be used to feed examples as few-shot prompts: [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k). - Dataset: [argilla-warehouse/python-seed-tools](https://huggingface.co/datasets/argilla-warehouse/python-seed-tools) - Run the `APIGen` pipeline on the previous dataset: The pipeline was run 2 times on the previous dataset, both to ensure diversity, and to try to obtain a higher number of functions: - Dataset: [argilla-warehouse/synth-apigen-llama](https://huggingface.co/datasets/argilla-warehouse/synth-apigen-llama) - Dataset: [argilla-warehouse/synth-qwen-llama](https://huggingface.co/datasets/argilla-warehouse/synth-apigen-qwen) - Merge both datasets, deduplicate, and prepare for training This is the final stage. The previous datasest have been merged, duplicates removed using a hash of the `query+answers` fields, near deduplication using `MinHashDedup` from `distilabel`, and 6K examples "ill-formed" are generated to generate relevance detection data points, following the section B.3 of the paper to showcase examples as in Berkeley Function-Calling Benchmark. ## Fine Tuning The process can be seen [here](https://huggingface.co/argilla/Llama-3.2-1B-Instruct-APIGen-FC-v0.1#training-procedure). Basically, one needs to format the data for SFT following a script like the one that can be found in the model card, which would yield a dataset like: [argilla-warehouse/apigen-synth-trl](https://huggingface.co/datasets/argilla-warehouse/apigen-synth-trl), and run the training on top of it (in this case we've used [TRL](https://github.com/huggingface/trl), you can use the framework of your choice). Take a look at [argilla/apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling) for a merge of this dataset with the Salesforce one, it just needs to be formatted to fine-tune a model on top of it. ## Example of use The functions used to create this dataset are in this zipped folder [libs.zip](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1/blob/main/synth_lib_data/libs.zip). If unzipped, it contains the 40K+ python files with the corresponding number of functions, which can be called like the following: ```python from datasets import load_dataset from distilabel.steps.tasks.apigen.utils import load_module_from_path # Load the dataset ds_apigen_synth = load_dataset("argilla/Synth-APIGen-v0.1", split="train") # Select a row (from the valid ones, this dataset contains examples of queries that cannot be solved with the given information) row = ds_apigen_synth.filter(lambda x: x["answers"].startswith("["))[1] # {'func_name': 'sort_dict_by_values', # 'func_desc': 'Returns a new dictionary sorted by the values in descending order, with keys in ascending order.', # 'tools': '[{"type":"function","function":{"name":"int_to_bitwise_string","description":"Converts a positive integer to a bitwise string.","parameters":{"type":"object","properties":{"num":{"type":"integer","description":"The positive integer to convert."}},"required":["num"]}}},{"type":"function","function":{"name":"sort_dict_by_values","description":"Returns a new dictionary sorted by the values in descending order, with keys in ascending order.","parameters":{"type":"object","properties":{"dictionary":{"type":"object","additionalProperties":{},"description":"The dictionary to be sorted."}},"required":["dictionary"]}}}]', # 'query': "You are organizing a book fair and have a list of books with their respective ratings. The list is as follows: {'The Great Gatsby': 4.5, '1984': 4.2, 'To Kill a Mockingbird': 4.7, 'Moby Dick': 3.8, 'Pride and Prejudice': 4.6}. Sort the books by their ratings in descending order, and if there are books with the same rating, sort them alphabetically by their titles.", # 'answers': '[{"name": "sort_dict_by_values", "arguments": {"dictionary": {"The Great Gatsby": 4.5, "1984": 4.2, "To Kill a Mockingbird": 4.7, "Moby Dick": 3.8, "Pride and Prejudice": 4.6}}}]', # 'model_name': 'Qwen/Qwen2.5-72B-Instruct', # 'hash_id': '32fe6c61e9b2e94f6b0f18c2c6740c57be45cd20490949aa053f5689d1bd942f'} # Extract the function name function_name = row["func_name"] # Load the function from the available python modules toolbox = load_module_from_path( str(Path(libpath) / f"{function_name}.py") ) # Extract the function code function = getattr(toolbox, function_name, None) # And call it with the generated data from the model function(*json.loads(row["answers"])[0]["arguments"].values()) # {'To Kill a Mockingbird': 4.7, # 'Pride and Prejudice': 4.6, # 'The Great Gatsby': 4.5, # '1984': 4.2, # 'Moby Dick': 3.8} ``` ## Dataset content From the original datasets, we gathered the following valid generations: argilla-warehouse/synth-apigen-llama: 25727 argilla-warehouse/synth-apigen-qwen: 17678 Total: 43405 Following Section B.3 of the paper, we've added 6000 rows of incorrect function calls, making a total of 49402. Additionally, instead of letting only the correct generated tool, the script generated the following distribution of available tools: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6435d564a4bd75c62cc03701/A6Z8DsQumS0QbaK7RMzzh.png) ### Fields description - `func_name`: Name of the function - `func_desc`: Description of the function behaviour. - `tools`: Available tools to answer the query. - `query`: Synthetic instruction generated from a user. - `answers`: Arguments generated from the model that should be passed to the corresponding function to run and obtain the answer from. - `model_name`: Model name used to obtain the row. - `hash_id`: Hash generated from the query and answers to remove duplicates Example: ```json { "func_name": "complex_to_polar", "func_desc": "Converts a complex number to its polar coordinate representation.", "tools": "[{\"type\":\"function\",\"function\":{\"name\":\"complex_to_polar\",\"description\":\"Converts a complex number to its polar coordinate representation.\",\"parameters\":{\"type\":\"object\",\"properties\":{\"complex_number\":{\"type\":\"object\",\"description\":\"A complex number in the form of `real + imaginary * 1j`.\"}},\"required\":[\"complex_number\"]}}}]", "query": "I'd like to convert the complex number 3 + 4j and 1 - 2j to polar coordinates.", "answers": "[{\"name\": \"complex_to_polar\", \"arguments\": {\"complex_number\": \"3 + 4j\"}}, {\"name\": \"complex_to_polar\", \"arguments\": {\"complex_number\": \"1 - 2j\"}}]", "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct", "hash_id": "f873783c04bbddd9d79f47287fa3b6705b3eaea0e5bc126fba91366f7b8b07e9", } ``` ## References ``` @article{liu2024apigen, title={APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets}, author={Liu, Zuxin and Hoang, Thai and Zhang, Jianguo and Zhu, Ming and Lan, Tian and Kokane, Shirley and Tan, Juntao and Yao, Weiran and Liu, Zhiwei and Feng, Yihao and others}, journal={arXiv preprint arXiv:2406.18518}, year={2024} } ```

<p align="left"> <a href="https://github.com/argilla-io/distilabel"> <img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="基于Distilabel构建" width="200" height="32"/> </a> </p> # Synth-APIGen-v0.1 数据集卡片 本数据集基于distilabel构建,相关信息可访问[distilabel](https://distilabel.argilla.io/)。 流水线脚本:[pipeline_apigen_train.py](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1/blob/main/pipeline_apigen_train.py)。 ## 数据集构建流程 本数据集基于distilabel 1.4.0版本构建。 本数据集是论文[APIGen:面向可验证且多样化函数调用数据集的自动化生成流水线](https://arxiv.org/abs/2406.18518)在distilabel中的实现,基于合成函数生成。构建流程可概括如下: - 生成(或本案例中为修改)Python函数: 我们通过简易流水线,对[bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k)中的函数进行修改,使其可轻松转换为JSON Schema格式的工具。 - 所用数据集:[argilla-warehouse/python-lib-tools-v0.1](https://huggingface.co/datasets/argilla-warehouse/python-lib-tools-v0.1) - 为APIGen流水线准备种子工具集: 我们需要一份包含函数名称、功能描述与工具表示的数据集,用于送入提示词(prompt),因此需对前述数据集进行转换。此外,可使用「黄金源数据集」作为少样本(Few-shot)提示的示例来源:[Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)。 - 所用数据集:[argilla-warehouse/python-seed-tools](https://huggingface.co/datasets/argilla-warehouse/python-seed-tools) - 在上述数据集上运行APIGen流水线: 我们在该数据集上两次运行APIGen流水线,以确保生成结果的多样性并获取更多函数样本: - 所用数据集:[argilla-warehouse/synth-apigen-llama](https://huggingface.co/datasets/argilla-warehouse/synth-apigen-llama) - 所用数据集:[argilla-warehouse/synth-qwen-llama](https://huggingface.co/datasets/argilla-warehouse/synth-apigen-qwen) - 合并双数据集、去重并完成训练预处理 此为最终步骤:合并前述两个数据集,通过`query+answers`字段的哈希值移除重复样本,使用distilabel提供的`MinHashDedup`进行近似去重;同时参考论文B.3节,仿照伯克利函数调用基准测试(Berkeley Function-Calling Benchmark)生成6000条「格式错误」的样本,以构建相关性检测数据集。 ## 微调流程 完整流程可参见[此处](https://huggingface.co/argilla/Llama-3.2-1B-Instruct-APIGen-FC-v0.1#training-procedure)。简而言之,需按照模型卡片中的脚本格式对数据进行监督微调(SFT),生成的数据集为[argilla-warehouse/apigen-synth-trl](https://huggingface.co/datasets/argilla-warehouse/apigen-synth-trl),随后基于该数据集开展训练(本案例中使用了[TRL](https://github.com/huggingface/trl)框架,您也可自行选择适配的训练框架)。 您可访问[argilla/apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling)查看本数据集与Salesforce原有数据集的合并版本,仅需对其进行格式转换即可用于模型微调。 ## 使用示例 构建本数据集所用的函数存放在压缩文件夹[libs.zip](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1/blob/main/synth_lib_data/libs.zip)中。 解压后,该文件夹包含4万余个Python文件,内含对应数量的函数,调用方式如下: python from datasets import load_dataset from distilabel.steps.tasks.apigen.utils import load_module_from_path # Load the dataset ds_apigen_synth = load_dataset("argilla/Synth-APIGen-v0.1", split="train") # Select a row (from the valid ones, this dataset contains examples of queries that cannot be solved with the given information) row = ds_apigen_synth.filter(lambda x: x["answers"].startswith("["))[1] # {'func_name': 'sort_dict_by_values', # 'func_desc': 'Returns a new dictionary sorted by the values in descending order, with keys in ascending order.', # 'tools': '[{"type":"function","function":{"name":"int_to_bitwise_string","description":"Converts a positive integer to a bitwise string.","parameters":{"type":"object","properties":{"num":{"type":"integer","description":"The positive integer to convert."}},"required":["num"]}}},{"type":"function","function":{"name":"sort_dict_by_values","description":"Returns a new dictionary sorted by the values in descending order, with keys in ascending order.","parameters":{"type":"object","properties":{"dictionary":{"type":"object","additionalProperties":{},"description":"The dictionary to be sorted."}},"required":["dictionary"]}}}]", # 'query': "You are organizing a book fair and have a list of books with their respective ratings. The list is as follows: {'The Great Gatsby': 4.5, '1984': 4.2, 'To Kill a Mockingbird': 4.7, 'Moby Dick': 3.8, 'Pride and Prejudice': 4.6}. Sort the books by their ratings in descending order, and if there are books with the same rating, sort them alphabetically by their titles.", # 'answers': '[{"name": "sort_dict_by_values", "arguments": {"dictionary": {"The Great Gatsby": 4.5, "1984": 4.2, "To Kill a Mockingbird": 4.7, "Moby Dick": 3.8, "Pride and Prejudice": 4.6}}}]', # 'model_name': 'Qwen/Qwen2.5-72B-Instruct', # 'hash_id': '32fe6c61e9b2e94f6b0f18c2c6740c57be45cd20490949aa053f5689d1bd942f'} # Extract the function name function_name = row["func_name"] # Load the function from the available python modules toolbox = load_module_from_path( str(Path(libpath) / f"{function_name}.py") ) # Extract the function code function = getattr(toolbox, function_name, None) # And call it with the generated data from the model function(*json.loads(row["answers"])[0]["arguments"].values()) # {'To Kill a Mockingbird': 4.7, # 'Pride and Prejudice': 4.6, # 'The Great Gatsby': 4.5, # '1984': 4.2, # 'Moby Dick': 3.8} ## 数据集内容 我们从原始数据集中整理得到以下有效生成样本: argilla-warehouse/synth-apigen-llama: 25727 argilla-warehouse/synth-apigen-qwen: 17678 Total: 43405 参考论文B.3节,我们额外添加了6000条格式错误的函数调用样本,总样本量达到49402条。 此外,本脚本不仅生成了正确的工具调用样本,还生成了如下所示的可用工具分布情况: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6435d564a4bd75c62cc03701/A6Z8DsQumS0QbaK7RMzzh.png) ### 字段说明 - `func_name`: 函数名称 - `func_desc`: 函数功能描述 - `tools`: 可用于响应用户查询的可用工具集 - `query`: 模拟用户生成的合成指令 - `answers`: 模型生成的参数,需传入对应函数以执行并获取最终结果 - `model_name`: 生成该条样本所用的模型名称 - `hash_id`: 基于query与answers字段生成的哈希值,用于去重 示例: json { "func_name": "complex_to_polar", "func_desc": "Converts a complex number to its polar coordinate representation.", "tools": "[{"type":"function","function":{"name":"complex_to_polar","description":"Converts a complex number to its polar coordinate representation.","parameters":{"type":"object","properties":{"complex_number":{"type":"object","description":"A complex number in the form of `real + imaginary * 1j`."}},"required":["complex_number"]}}}]", "query": "I'd like to convert the complex number 3 + 4j and 1 - 2j to polar coordinates.", "answers": "[{"name": "complex_to_polar", "arguments": {"complex_number": "3 + 4j"}}, {"name": "complex_to_polar", "arguments": {"complex_number": "1 - 2j"}}]", "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct", "hash_id": "f873783c04bbddd9d79f47287fa3b6705b3eaea0e5bc126fba91366f7b8b07e9", } ## 参考文献 @article{liu2024apigen, title={APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets}, author={Liu, Zuxin and Hoang, Thai and Zhang, Jianguo and Zhu, Ming and Lan, Tian and Kokane, Shirley and Tan, Juntao and Yao, Weiran and Liu, Zhiwei and Feng, Yihao and others}, journal={arXiv preprint arXiv:2406.18518}, year={2024} }
提供机构:
maas
创建时间:
2024-10-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作