five

marclove/llama_functions

收藏
Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/marclove/llama_functions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - conversational - text-generation language: - en pretty_name: Llama Functions size_categories: - 10K<n<100K --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** https://marclove.com - **Repository:** https://huggingface.co/datasets/marclove/llama_functions ### Dataset Summary ‼️ This dataset is still in a beta state. Its contents, and likely its format, will change. If you need to depend on it in its current state, please create your own fork and provide attribution to this original repository. ‼️ Llama Functions is a synthetic dataset generated from a mix of manual curation of OpenAPI endpoints and prompting of OpenAI models. It is further mixed with chat completions from the Guanaco subset of the OASST1 chat dialogue dataset. It is a total of 18,000 rows, 9,000 rows from the synthetic dataset of function calls and 9,000 rows from the Guanaco dataset. The dataset is mixed with Guanaco in order to maintain accuracy and helpfulness when calling a function is not the appropriate response. I plan to remove the Guanaco portion of the dataset and instead provide fine-tuning recommendations, guidelines for use, more detailed information regarding limitations, and eval stats of 7B, 13B, and 70B models. There is no existing evaluation benchmark to measure the accuracy of function calls, which makes it hard during training to identify when we've maximized the balance of function calling accuracy and chat model performance. I'm working on a custom HF eval for this purpose, but until then I have chosen to mix the two datasets in equal parts to get a proxy of performance for both tasks in the eval & test stats during fine-tuning. ### Languages English primarily, though since it has been mixed with the multilingual Guanaco dataset, other languages are included. ## Dataset Structure ### Data Fields | Field | Description | |-------|-------------| | `input` |A prompt in Llama-2 Chat format, including an appropriate system instruction and chat history. | | `output` | The expected completion. | ### Data Splits There are currently no splits, but future versions will likely have train, eval, and test splits. ## Dataset Creation ### Curation Rationale In an effort to enable tool-using chat agents and autonomous agents, I developed this synthetic dataset to bring [OpenAI-style function calling](https://openai.com/blog/function-calling-and-other-api-updates#function-calling) to the Llama family and to fully open source models. ### Source Data The data was sourced by prompting OpenAI models to generate function calls of: 1. Real OpenAPI endpoints collected and filtered from the web 2. Manually written (but artificial) OpenAPI endpoints, and 3. Prompted iterations of 1 & 2. Prompted iterations were generated by ChatGPT-4 (July 20, 2023 version). Generated function calls and their natural language counterparts were generated by iterative prompting of `gpt-3.5-turbo-0301`. A blog post detailing the generation process will be published in the next few days. OpenAI's TOS give me ownership of this synthetic dataset. I am licensing it under [Creative Commons' Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/). I have used the dataset to fine tune a research-only model, [marclove/llama-2-7b-chat-functions](https://huggingface.co/marclove/llama-2-7b-chat-functions), per OpenAI TOS. You are responsible for determining whether you can use the dataset for your particular use case. I take no responsibility and make no guarantees beyond licensing my own rights under the designated CC license. #### Who are the source language producers? - Marc Love - Prompting of ChatGPT-4 & API calls to gpt-3.5-turbo-0301 ### Personal and Sensitive Information None. ## Considerations for Using the Data ### Social Impact of Dataset Unknown, beyond those of the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/viewer/timdettmers--openassistant-guanaco/). ### Discussion of Biases Unknown, beyond those of the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/viewer/timdettmers--openassistant-guanaco/). ### Other Known Limitations Fine-tuning on this dataset can lead to hallucinated function calls. This is more pronounced in smaller models. ## Additional Information ### Dataset Curators Marc Love ### Licensing Information [Creative Commons' Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/). Please note that the synthetic data portion of the dataset was generated using OpenAI models, which may or may not impact your ability to use the dataset, depending on your use case. ### Citation Information If you use this dataset, please cite: ``` @misc{LlamaFunctions, title = {LlamaFunctions: An Open Dataset of Structured API Calls From Natural Language Prompts}, author = {Marc Love}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://https://huggingface.co/marclove/llama_functions}, } ```
提供机构:
marclove
原始信息汇总

数据集概述

名称: Llama Functions

状态: 测试版

语言: 主要为英语,包含多语言的Guanaco数据集混合

大小: 18,000行,其中9,000行来自合成数据集的功能调用,9,000行来自Guanaco数据集

用途: 用于开发工具使用聊天代理和自主代理,支持OpenAI风格的功能调用

数据来源:

  • 真实的OpenAPI端点
  • 人工编写的OpenAPI端点
  • 通过ChatGPT-4和gpt-3.5-turbo-0301生成的迭代

数据字段:

  • input: 包含适当的系统指令和聊天历史的Llama-2 Chat格式提示
  • output: 预期的完成

许可证: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

数据集创建者: Marc Love

注意事项:

  • 数据集目前没有分割,未来版本可能包含训练、评估和测试分割
  • 没有现有的评估基准来衡量功能调用的准确性
  • 微调此数据集可能导致较小的模型产生幻觉功能调用

引用信息:

@misc{LlamaFunctions, title = {LlamaFunctions: An Open Dataset of Structured API Calls From Natural Language Prompts}, author = {Marc Love}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/marclove/llama_functions}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作