five

hackethon

收藏
魔搭社区2025-11-06 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/shaochuqi/Hackethon
下载链接
链接失效反馈
官方服务:
资源简介:
### 背景 建立一个能够像人类一样进行多模态对话交互的智能系统一直都是实现人工通用智能的长远目标。尽管在文本对话上已经取得了很高的水平,但是在语音、图片、视频等更多模态的交互上仍有不足之处。因此,我们需要在大模型基础上增强这些能力,通过大小模型协同、小模型插件API等方式集成语音、图片、视频等功能。 ModelScope社区致力于打造下一代开源的模型即服务共享平台,为泛AI开发者提供丰富、易用、低成本的一站式模型服务产品,让模型应用更加简单。该社区提供的服务包括丰富的预训练SOTA模型、多元开放的数据集、一行代码使用模型推理能力、十行代码快速构建专属行业模型、即开即用的在线开发平台、灵活的模型框架与部署方式以及丰富的教学内容与技术资源。 在LLM和魔搭社区众多模型的基础上,我们提出ModelScope Agent,让LLM作为中枢,来集成调用多种ModelScope的众多AI模型能力,通过API插件的方式集成多种AI小模型,提升大模型的NLP任务上的文本生成能力,同时也能够弥补缺失的多模态生成能力。 ### 数据集描述 我们在ModelScope社区上提供了插件的训练数据集,包括3600条的训练集和对应的验证集,测试集。数据集主要包括了通用的sft数据和4种插件。 数据集文件包含了train/dev/test.json三个文件共接近4000条样本。数据格式为每行一个样本,里面包含了id和converstions两个字段,其中conversations里面包含了system,user,assistant三种字段。其中: system: 表示给模型前置的人设输入,其中有告诉模型如何调用插件以及生成请求 user: 表示用户的输入prompt,分为两种,通用生成的prompt和调用插件需求的prompt assistant: 为模型的回复。其中会包括插件调用代码和执行代码,调用代码是要LLM生成的,而执行代码是调用服务来生成结果的。如下面例子,调用部分代码会通过<|startofthink|>和<|endofthink|>包起来,>然后执行部分代码是api执行完结果后,把执行结果通过<|startofexec|>和<|endofexec|>包起来再输入给模型生成后面的回复 ```json { "id":"modelscope_merge_api_527", "conversations":[ { "from":"system", "value":"你是达摩院的ModelScopeGPT(魔搭助手),你是个大语言模型, 是2023年达摩院的工程师训练得到的。你有多种能力,可以通过插件集成魔搭社区的模型api来回复用户的问题,还能解答用户使用模型遇到的问题和模型知识相关问答。1. {\"plugin_name\": \"modelscope_text-ie\", \"plugin_owner\": \"ModelScopeGPT\", \"plugin_type\": \"default\", \"plugin_schema_for_model\": {\"name\": \"modelscope_text-ie\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"url\": \"http://109.199.101.10:1485/\", \"paths\": [{\"name\": \"modelscope_text-ie\", \"model_id\": \"/damo/nlp_structbert_siamese-uie_chinese-base\", \"method\": \"post\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"parameters\": [{\"name\": \"text\", \"description\": \"用户输入的文本\", \"required\": \"True\"}, {\"name\": \"schema\", \"description\": \"要抽取信息的json表示\", \"required\": \"True\"}]}]}}\n\n2. {\"plugin_name\": \"modelscope_text-ie\", \"plugin_owner\": \"ModelScopeGPT\", \"plugin_type\": \"default\", \"plugin_schema_for_model\": {\"name\": \"modelscope_text-ie\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"url\": \"http://9.32.64.200:5873/\", \"paths\": [{\"name\": \"modelscope_text-ie\", \"model_id\": \"/damo/nlp_structbert_siamese-uie_chinese-base\", \"method\": \"post\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"parameters\": [{\"name\": \"text\", \"description\": \"用户输入的文本\", \"required\": \"True\"}, {\"name\": \"schema\", \"description\": \"要抽取信息的json表示\", \"required\": \"True\"}]}]}}\n\n3. {\"plugin_name\": \"modelscope_text-ie\", \"plugin_owner\": \"ModelScopeGPT\", \"plugin_type\": \"default\", \"plugin_schema_for_model\": {\"name\": \"modelscope_text-ie\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"url\": \"http://54.149.78.185:3979/\", \"paths\": [{\"name\": \"modelscope_text-ie\", \"model_id\": \"/damo/nlp_structbert_siamese-uie_chinese-base\", \"method\": \"post\", \"description\": \"针对中文的文本,根据schema要抽取的内容,找出其中对应信息,并用json格式展示\", \"parameters\": [{\"name\": \"text\", \"description\": \"用户输入的文本\", \"required\": \"True\"}, {\"name\": \"schema\", \"description\": \"要抽取信息的json表示\", \"required\": \"True\"}]}]}}" }, { "from":"user", "value":"按照给定的schema抽取出下面文本对应的信息\nschema:{\"人物\": null, \"地理位置\": null, \"组织机构\": null}\n近日,美国政府宣布将对中国1000多种商品加征关税,并威胁进一步加征关税。" }, { "from":"assistant", "value":"<|startofthink|>```JSON\n{\"api_name\": \"modelscope_text-ie\", \"url\": \"http://9.32.64.200:5873/damo/nlp_structbert_siamese-uie_chinese-base\", \"parameters\": {\"text\": \"近日,美国政府宣布将对中国1000多种商品加征关税,并威胁进一步加征关税。\", \"schema\": \"{\\\"人物\\\": null, \\\"地理位置\\\": null, \\\"组织机构\\\": null}\"}}\n```<|endofthink|>\n\n<|startofexec|>```JSON\n{\"人物\": [], \"地理位置\": [\"中国\", \"美国\"], \"组织机构\": []}\n```<|endofexec|>\n信息抽取结果:{\"人物\": [], \"地理位置\": [\"中国\", \"美国\"], \"组织机构\": []}。我使用的模型是ModelScope的'damo/nlp_structbert_siamese-uie_chinese-base'模型。这是一个基于StructBERT预训练模型微调训练的通用信息抽取模型。" } ] } ``` ### 方案实现 利用提供的API插件数据,在开源大模型上进行微调,因为数据里面只提供了少量的sft数据,所以最好是基于已经在sft上训练过的生成大模型,这样可以很快的泛化到训练集的API插件数据。 涉及到的几个模型API在魔搭社区都有相应的服务提供,需要测试调用链路是否顺畅,输入输出结果是否符合数据集的要求,比如信息抽取api,可以进入modelscope模型主页,从 “部署” -> “api推理”来调用。 ### 推理链路搭建设计到LLM的推理和API的调用 1. 模型生成完整的<|startofthink|>和<|endofthink|>后,需要我们实时的去请求对应的API 2. 返回结果后<|startofexec|>和<|endofexec|>拼接到现有的输入 3. 然后再让大模型继续生成回复 ### 实验评估 评估包含下面四个维度: 1. 插件调用的准确率:识别api_name后面的是否正确, 2. 插件url的准确率:url的地址是否正确 3. 插件传入参数的准确率:parameters对应的参数是否正确 4. 插件整体的准确率:生成的 function calling是否完全正确,整个json可以被load的格式 ### 数据集加载方式 #### 方法1:使用sdk加载 ```python import ast from modelscope.msdatasets import MsDataset ds = MsDataset.load('modelscope/ms_hackathon_23_agent_train_dev', split='train') # or split='validation' one_ds = next(iter(ds)) print(one_ds) # to parse conversations value conv = one_ds['conversations'] conv = ast.literal_eval(conv) print(conv[0]['from']) print(conv[0]['value']) ``` #### 方法2:直接页面下载 ```text 进入 数据集文件--元数据文件,直接点击下载按钮下载对应文件。 ``` ## 数据集版权信息 数据集已经开源,license为Apache License 2.0,如有违反相关条款,随时联系modelscope删除。

### Background Building an intelligent system capable of multimodal conversational interaction like humans has long been the overarching goal of achieving Artificial General Intelligence (AGI). While remarkable progress has been made in textual dialogue, there are still deficiencies in interactions across modalities such as speech, images, and videos. Therefore, it is necessary to enhance these capabilities based on Large Language Models (LLMs), and integrate functions like speech, images, and videos through methods such as collaboration between large and small models and small-model plugin APIs. The ModelScope community is dedicated to building the next-generation open-source Model-as-a-Service sharing platform, providing comprehensive, easy-to-use, and low-cost one-stop model service products for general AI developers to simplify model deployment. The services offered by the community include abundant pre-trained state-of-the-art (SOTA) models, diverse open datasets, one-line code access to model inference capabilities, ten-line code quick construction of customized industry models, out-of-the-box online development platforms, flexible model frameworks and deployment methods, as well as rich teaching materials and technical resources. Based on LLMs and numerous models in the ModelScope community, we propose ModelScope Agent, which uses LLMs as the core to integrate and invoke capabilities of various AI models from ModelScope. By integrating multiple small AI models via API plugins, it improves the text generation performance of LLMs on NLP tasks while also compensating for the missing multimodal generation capabilities. ### Dataset Description We have released a plugin training dataset on the ModelScope community, which includes a training set of 3600 samples, along with corresponding validation and test sets. The dataset mainly consists of general Supervised Fine-Tuning (SFT) data and 4 types of plugins. The dataset files include three files: train.json, dev.json, and test.json, with a total of nearly 4000 samples. Each line contains one sample, which includes two fields: `id` and `conversations`. The `conversations` field contains three sub-fields: `system`, `user`, and `assistant`: - `system`: Represents the preset persona input for the model, which instructs the model on how to invoke plugins and generate requests - `user`: Represents the user's input prompt, which is divided into two types: general generation prompts and plugin invocation prompts - `assistant`: Represents the model's reply, which includes both plugin invocation code and execution code. The invocation code is generated by the LLM, while the execution code generates results by calling services. For example, the invocation part of the code is wrapped between `<|startofthink|>` and `<|endofthink|>`, and the execution part wraps the service execution results between `<|startofexec|>` and `<|endofexec|>` before inputting them back to the model for generating subsequent replies. json { "id":"modelscope_merge_api_527", "conversations":[ { "from":"system", "value":"You are ModelScopeGPT (ModelScope Assistant) from DAMO Academy, a large language model trained by engineers from DAMO Academy in 2023. You possess multiple capabilities: you can reply to user queries by integrating model APIs from the ModelScope community via plugins, and also answer questions related to model usage and model-specific knowledge. 1. {"plugin_name": "modelscope_text-ie", "plugin_owner": "ModelScopeGPT", "plugin_type": "default", "plugin_schema_for_model": {"name": "modelscope_text-ie", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "url": "http://109.199.101.10:1485/", "paths": [{"name": "modelscope_text-ie", "model_id": "/damo/nlp_structbert_siamese-uie_chinese-base", "method": "post", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "parameters": [{"name": "text", "description": "Text input by the user", "required": "True"}, {"name": "schema", "description": "JSON representation of the information to be extracted", "required": "True"}]}]}} 2. {"plugin_name": "modelscope_text-ie", "plugin_owner": "ModelScopeGPT", "plugin_type": "default", "plugin_schema_for_model": {"name": "modelscope_text-ie", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "url": "http://9.32.64.200:5873/", "paths": [{"name": "modelscope_text-ie", "model_id": "/damo/nlp_structbert_siamese-uie_chinese-base", "method": "post", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "parameters": [{"name": "text", "description": "Text input by the user", "required": "True"}, {"name": "schema", "description": "JSON representation of the information to be extracted", "required": "True"}]}]}} 3. {"plugin_name": "modelscope_text-ie", "plugin_owner": "ModelScopeGPT", "plugin_type": "default", "plugin_schema_for_model": {"name": "modelscope_text-ie", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "url": "http://54.149.78.185:3979/", "paths": [{"name": "modelscope_text-ie", "model_id": "/damo/nlp_structbert_siamese-uie_chinese-base", "method": "post", "description": "Extract corresponding information from Chinese text based on the given schema and present it in JSON format", "parameters": [{"name": "text", "description": "Text input by the user", "required": "True"}, {"name": "schema", "description": "JSON representation of the information to be extracted", "required": "True"}]}]}}" }, { "from":"user", "value":"Extract the corresponding information from the following text according to the given schema schema: {"Person": null, "Location": null, "Organization": null} Recently, the U.S. government announced that it will impose tariffs on more than 1,000 types of Chinese goods, and threatened to impose additional tariffs." }, { "from":"assistant", "value":"<|startofthink|>JSON {"api_name": "modelscope_text-ie", "url": "http://9.32.64.200:5873/damo/nlp_structbert_siamese-uie_chinese-base", "parameters": {"text": "Recently, the U.S. government announced that it will impose tariffs on more than 1,000 types of Chinese goods, and threatened to impose additional tariffs.", "schema": "{\"Person\": null, \"Location\": null, \"Organization\": null}"}} <|endofthink|> <|startofexec|>JSON {"Person": [], "Location": ["China", "United States"], "Organization": []} <|endofexec|> Information extraction result: {"Person": [], "Location": ["China", "United States"], "Organization": []}. The model I used is the 'damo/nlp_structbert_siamese-uie_chinese-base' model from ModelScope. This is a general information extraction model fine-tuned based on the StructBERT pre-trained model." } ] } ### Implementation Plan We use the provided API plugin data to fine-tune open-source LLMs. Since only a small amount of SFT data is available in the dataset, it is recommended to base the fine-tuning on a generative LLM that has already been pre-trained on SFT data, which enables rapid generalization to the API plugin data in the training set. Several involved model APIs have corresponding service offerings on the ModelScope community. It is necessary to test whether the invocation link works smoothly and whether the input and output results meet the dataset requirements. For example, for the information extraction API, you can access the ModelScope model homepage and invoke it via "Deployment" -> "API Inference". ### Inference Pipeline Design The inference pipeline covers LLM inference and API invocation: 1. After the model generates the complete content wrapped between `<|startofthink|>` and `<|endofthink|>`, send real-time requests to the corresponding API 2. After receiving the API response, wrap the result between `<|startofexec|>` and `<|endofexec|>` and append it to the existing input 3. Prompt the LLM to continue generating the reply ### Experimental Evaluation The evaluation covers four dimensions: 1. Plugin invocation accuracy: Whether the identified `api_name` is correct 2. Plugin URL accuracy: Whether the URL address is correct 3. Plugin parameter accuracy: Whether the parameters corresponding to the `parameters` field are correct 4. Overall plugin accuracy: Whether the generated function calling is fully correct, and the entire JSON can be properly parsed ### Dataset Loading Methods #### Method 1: Load via SDK python import ast from modelscope.msdatasets import MsDataset ds = MsDataset.load('modelscope/ms_hackathon_23_agent_train_dev', split='train') # or split='validation' one_ds = next(iter(ds)) print(one_ds) # Parse conversations field conv = one_ds['conversations'] conv = ast.literal_eval(conv) print(conv[0]['from']) print(conv[0]['value']) #### Method 2: Direct Download via Webpage Enter the "Dataset Files - Metadata Files" section, and click the download button directly to obtain the corresponding files. ## Dataset Copyright Information This dataset is open-sourced under the Apache License 2.0. If any violations of the relevant terms occur, please contact ModelScope at any time to request deletion.
提供机构:
maas
创建时间:
2023-11-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作