marclove/llama_functions
收藏Hugging Face2023-08-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/marclove/llama_functions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- conversational
- text-generation
language:
- en
pretty_name: Llama Functions
size_categories:
- 10K<n<100K
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:** https://marclove.com
- **Repository:** https://huggingface.co/datasets/marclove/llama_functions
### Dataset Summary
‼️ This dataset is still in a beta state. Its contents, and likely its format, will change. If you need to depend on it in its current state, please create your own fork and provide attribution to this original repository. ‼️
Llama Functions is a synthetic dataset generated from a mix of manual curation of OpenAPI endpoints and prompting of OpenAI models. It is further mixed with chat completions from the Guanaco subset of the OASST1 chat dialogue dataset. It is a total of 18,000 rows, 9,000 rows from the synthetic dataset of function calls and 9,000 rows from the Guanaco dataset.
The dataset is mixed with Guanaco in order to maintain accuracy and helpfulness when calling a function is not the appropriate response. I plan to remove the Guanaco portion of the dataset and instead provide fine-tuning recommendations, guidelines for use, more detailed information regarding limitations, and eval stats of 7B, 13B, and 70B models.
There is no existing evaluation benchmark to measure the accuracy of function calls, which makes it hard during training to identify when we've maximized the balance of function calling accuracy and chat model performance. I'm working on a custom HF eval for this purpose, but until then I have chosen to mix the two datasets in equal parts to get a proxy of performance for both tasks in the eval & test stats during fine-tuning.
### Languages
English primarily, though since it has been mixed with the multilingual Guanaco dataset, other languages are included.
## Dataset Structure
### Data Fields
| Field | Description |
|-------|-------------|
| `input` |A prompt in Llama-2 Chat format, including an appropriate system instruction and chat history. |
| `output` | The expected completion. |
### Data Splits
There are currently no splits, but future versions will likely have train, eval, and test splits.
## Dataset Creation
### Curation Rationale
In an effort to enable tool-using chat agents and autonomous agents, I developed this synthetic dataset to bring [OpenAI-style function calling](https://openai.com/blog/function-calling-and-other-api-updates#function-calling) to the Llama family and to fully open source models.
### Source Data
The data was sourced by prompting OpenAI models to generate function calls of:
1. Real OpenAPI endpoints collected and filtered from the web
2. Manually written (but artificial) OpenAPI endpoints, and
3. Prompted iterations of 1 & 2.
Prompted iterations were generated by ChatGPT-4 (July 20, 2023 version). Generated function calls and their natural language counterparts were generated by iterative prompting of `gpt-3.5-turbo-0301`. A blog post detailing the generation process will be published in the next few days.
OpenAI's TOS give me ownership of this synthetic dataset. I am licensing it under [Creative Commons' Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/). I have used the dataset to fine tune a research-only model, [marclove/llama-2-7b-chat-functions](https://huggingface.co/marclove/llama-2-7b-chat-functions), per OpenAI TOS. You are responsible for determining whether you can use the dataset for your particular use case. I take no responsibility and make no guarantees beyond licensing my own rights under the designated CC license.
#### Who are the source language producers?
- Marc Love
- Prompting of ChatGPT-4 & API calls to gpt-3.5-turbo-0301
### Personal and Sensitive Information
None.
## Considerations for Using the Data
### Social Impact of Dataset
Unknown, beyond those of the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/viewer/timdettmers--openassistant-guanaco/).
### Discussion of Biases
Unknown, beyond those of the [Guanaco subset of the OASST1 dataset](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/viewer/timdettmers--openassistant-guanaco/).
### Other Known Limitations
Fine-tuning on this dataset can lead to hallucinated function calls. This is more pronounced in smaller models.
## Additional Information
### Dataset Curators
Marc Love
### Licensing Information
[Creative Commons' Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license](https://creativecommons.org/licenses/by-sa/4.0/). Please note that the synthetic data portion of the dataset was generated using OpenAI models, which may or may not impact your ability to use the dataset, depending on your use case.
### Citation Information
If you use this dataset, please cite:
```
@misc{LlamaFunctions,
title = {LlamaFunctions: An Open Dataset of Structured API Calls From Natural Language Prompts},
author = {Marc Love},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://https://huggingface.co/marclove/llama_functions},
}
```
提供机构:
marclove
原始信息汇总
数据集概述
名称: Llama Functions
状态: 测试版
语言: 主要为英语,包含多语言的Guanaco数据集混合
大小: 18,000行,其中9,000行来自合成数据集的功能调用,9,000行来自Guanaco数据集
用途: 用于开发工具使用聊天代理和自主代理,支持OpenAI风格的功能调用
数据来源:
- 真实的OpenAPI端点
- 人工编写的OpenAPI端点
- 通过ChatGPT-4和gpt-3.5-turbo-0301生成的迭代
数据字段:
input: 包含适当的系统指令和聊天历史的Llama-2 Chat格式提示output: 预期的完成
许可证: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
数据集创建者: Marc Love
注意事项:
- 数据集目前没有分割,未来版本可能包含训练、评估和测试分割
- 没有现有的评估基准来衡量功能调用的准确性
- 微调此数据集可能导致较小的模型产生幻觉功能调用
引用信息:
@misc{LlamaFunctions, title = {LlamaFunctions: An Open Dataset of Structured API Calls From Natural Language Prompts}, author = {Marc Love}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/marclove/llama_functions}, }



