paolorechia/medium-size-generated-tasks

Name: paolorechia/medium-size-generated-tasks
Creator: paolorechia
Published: 2023-05-12 21:45:52
License: 暂无描述

Hugging Face2023-05-12 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/paolorechia/medium-size-generated-tasks

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - en tags: - ReAct - LLM - Agent - langchain size_categories: - 1K<n<10K --- # LICENSE This is a dataset generated with the help of WizardLM. Therefore, the terms of use are restricted to research/academic only. # What is this This is a collection of .txt files with a prompt and the expected output. For instance: ``` #####PROMPT: Question: Make sure the task is unique and adds value to the original list. Thought:#####OUTPUT: I should check if the task is already in the list. Action: Python REPL Action Input: if task not in tasks: print("Task not found.") else: print("Task found.") ``` # What is it for This is meant to help training LLama based models at using the Langchain ReAct tooling, specifically with the Python REPL. # How good is it? Not much, the dataset is quite dirty at the moment. Still fine-tuning the first LoRA, so no tests have been made. # Next steps 1. Redo steps using a base model that has a more permissive license 2. Fix problems in the dataset generation phase, e.g. * model tries to install packages and fail * langchain agent tooling sometimes seem buggy and don't return the stdout correctly * model likes to ask for user input * model likes to exit the chain by calling sys.exit() * once model gets stuck with installation steps, it's just an infinite loop 3. Clean dataset better # How was it created There are a f ew steps involved in the generation of this dataset. 1. created a mechanism to log pair of prompt/output generated by a running Langchain Agent on a local server Server link: https://github.com/paolorechia/learn-langchain/blob/a3c288c43845d19692478f06757ed326c222f095/servers/vicuna_server.py#L39 ```python class PromptLogger: _instances = {} @staticmethod def get(session): if session not in PromptLogger._instances: PromptLogger._instances[session] = PromptLogger(session) return PromptLogger._instances[session] def __init__(self, session) -> None: self.input_step = 0 self.output_step = 0 self.session = session self._dir = f"logged_prompts/session_{session}/" try: os.makedirs(self._dir) except FileExistsError: pass def log(self, input_str, prefix="input"): filename = os.path.join(self._dir, f"{prefix}_{self.input_step}") with open(filename, "w") as fp: if prefix == "input": input_str = input_str.split("Now begin for real!\n")[1] fp.write(input_str) if prefix == "input": self.input_step += 1 elif prefix == "output": self.output_step += 1 else: raise ValueError("Invalid prefix") @app.post("/prompt") def process_prompt(prompt_request: PromptRequest): params = { "prompt": prompt_request.prompt, "temperature": prompt_request.temperature, "max_new_tokens": prompt_request.max_new_tokens, "stop": prompt_request.stop, } print("Received prompt: ", params["prompt"]) output = compute_until_stop(model, tokenizer, params, config.device) print("Output: ", output) if prompt_request.logging_session is not None: prompt_logger = PromptLogger.get(prompt_request.logging_session) prompt_logger.log(prompt_request.prompt, prefix="input") prompt_logger.log(output, prefix="output") return {"response": output} ``` 2. created a short list of tasks and then extended it with the help of a LLM until about 390 tasks were generated Script link: https://github.com/paolorechia/learn-langchain/blob/main/task_generation/generate_tasks.py ```python from langchain_app.models.llama_http_llm import build_llama_base_llm output = None # Now let's test it out! while True: params = {"temperature": 1.3, "max_new_tokens": 1024, "stop": []} llm = build_llama_base_llm(parameters=params) # Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use. output = llm._call(""" You are given a list of tasks. Please extend it with new unique tasks: 1. "Print hello world to the terminal", 2. "Fetch a Chuck Norris joke from this endpoint https://api.chucknorris.io/jokes/random", 3. "Parse this HTML page https://api.chucknorris.io/ and find all the API endpoints ", 4. "Generate 10 unique cat jokes and store them in a CSV file with two columns, punch line and joke finisher", 5. "Connect to a Postgres database and return the existing databases names. Use the following credentials: \n\nhost localhost\nport 7036\nuser admin\npassword admin", 6. List the existing files in the current directory", 7. "Find out your existing working directory" , 8. "Fix the syntax error of this code snippet:\ndef myfunc():\n\tprint(“hello", 9. "Find the keys of the JSON payload stored in the variable response_json", 10. "Extract the key called 'address' from the JSON stored in the variable json_ and store into a variable called address", 11. "Create a joke about AI bots and save it in a local text file", 12. "Create an unit test for the following snippet of code:\ndef sum_2(x, y):\n\treturn x + y", 13. "Create random data and plot it using matplotlib and store the result as a .PNG image", 14. "Download a CSV file about suicide from the webpage https://catalog.data.gov/dataset/?res_format=CSV and plot a bar chart comparing the suicide numbers of male vs ,female", 15. "Design a Todo list system. Write the explanation in a file called 'todo_list_system_design.txt'", 16. Search for the source code called 'example.py' in the directory, inspect the file, write unit tests for it and execute them to make sure everything is correct.", 17. "Write a data pipeline that ingests data from the Crime Data from 2020 to present from https://catalog.data.gov/dataset/?res_format=CSV. Use the requests and pandas, save the csv to the local disk. Create a directory if necessary, give an appropriate name" """) with open("generated_tasks.txt", "a") as fp: fp.write(output) ``` The output can then be filtered with a simple bash script: ```bash cat generated_tasks.txt | tr -s ' ' | grep -oE '\s*[0-9]+\.[A-Za-z, ]+[A-Za-z, ]+\.' | awk 'length >= 50' | sed -e 's/[0-9\. ]*//' > filtered_generated.txt ``` And then deduplicated with a few lines of code: ```python import json with open("filtered_generated.txt", "r") as fp: tasks = fp.readlines() with open("dedup_generated_tasks.json", "w") as fp: json.dump(list(set(tasks)), fp, indent=4) ``` Result: https://github.com/paolorechia/learn-langchain/blob/main/task_generation/dedup_generated_tasks.json 3. used a prompted WizardLM 7b unquantized version to execute each task in the last, using the logger from item 1 https://github.com/paolorechia/learn-langchain/blob/main/langchain_app/agents/log_task_prompts_agent.py ``` from langchain.agents import Tool, initialize_agent, AgentType from langchain.tools.python.tool import PythonAstREPLTool from langchain_app.models.llama_http_llm import build_llama_base_llm import json prompt_template = """python For instance: Question: Find out how much 2 plus 2 is. Thought: I must use the Python shell to calculate 2 + 2 Action: Python REPL Action Input: 2 + 2 Observation: 4 Thought: I now know the answer Final Answer: 4 Example 2: Question: You have a variable age in your scope. If it's greater or equal than 21, say OK. Else, say Nay. Thought: I should write an if/else block in the Python shell. Action: Python REPL Action Input: if age >= 21: print("OK") # this line has four spaces at the beginning else: print("Nay") # this line has four spaces at the beginning Observation: OK Thought: I have executed the task successfully. Final Answer: I have executed the task successfully. Example 3: Question: Write and execute a script that sleeps for 2 seconds and prints 'Hello, World' Thought: I should import the sleep function. Action: Python REPL Action Input: from time import sleep Observation: Thought: I should call the sleep function passing 2 as parameter Action: Python REPL Action Input: sleep(2) Observation: Thought: I should use the 'print' function to print 'Hello, World' Action: Python REPL Action Input: print('Hello, World') Observation: Thought: I now finished the script Final Answer: I executed the following script successfully: from time import sleep sleep(2) print('Hello, World') Additional Hints: 1. If an error thrown along the way, try to understand what happened and retry with a new code version that fixes the error. 2. DO NOT IGNORE ERRORS. 3. If an object does not have an attribute, call dir(object) to debug it. 4. SUPER IMPORTANT: ALWAYS respect the indentation in Python. Loops demand an idendentation. For example: for i in range(10): print(i) # this line has four spaces at the beginning Same for ifs: if True: print("hello") # this line has four spaces at the beginning An error be thrown because of the indentation, something like... "expected an indented block after 'for' statement on line..." To fix, make sure to indent the lines! 5. Do not use \ in variable names, otherwise you'll see the syntax error "unexpected character after line continuation character..." 6. If the variable is not defined, use vars() to see the defined variables. 7. Do not repeat the same statement twice without a new reason. 8. NEVER print the HTML directly. Now begin for real! Question: {} """ offset = 0 with open("task_generation/dedup_generated_tasks.json", "r") as fp: tasks = json.load(fp) tasks = tasks[offset:] for idx, task in enumerate(tasks): params = {"temperature": 0, "max_new_tokens": 2048, "stop": ["Observation:"], "logging_session": f"medium_size_dataset{idx+offset}"} llm = build_llama_base_llm(parameters=params) python_tool = PythonAstREPLTool() tools = [ Tool( name="Python REPL", func=python_tool, description="useful for when you need to execute Python code", ), ] agent = initialize_agent( tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True ) first_task = tasks[idx] try: agent.run(prompt_template.format(first_task)) except Exception: pass ``` 5. extract all logs and consolidate into txt files inside a directory ```python import os dataset_folder = "medium_size_generated_tasks" # -1 means no number of max_actions max_actions_per_task = -1 if __name__ == "__main__": try: os.makedirs(dataset_folder) except FileExistsError: pass dir_ = "logged_prompts/" sessions = os.listdir(dir_) datapoints = 0 for session in sessions: session_dir = os.path.join(dir_, session) logs_files = os.listdir(session_dir) inputs_step_tuple = [log.split("_") for log in logs_files if "input" in log] outputs_step_tuple = [log.split("_") for log in logs_files if "output" in log] inputs_step_tuple.sort(key=lambda x: x[1]) outputs_step_tuple.sort(key=lambda x: x[1]) i = 0 for input_tuple, output_tuple in zip(inputs_step_tuple, outputs_step_tuple): input_filename = input_tuple[0]+"_"+input_tuple[1] output_filename = output_tuple[0]+"_"+output_tuple[1] input_ = os.path.join(session_dir, input_filename) output_ = os.path.join(session_dir, output_filename) with open(input_, "r") as fp: prompt = fp.read() with open(output_, "r") as fp: output = fp.read() datapoint_filename = os.path.join(dataset_folder, f"{datapoints}.txt") with open(datapoint_filename, "w") as fp: fp.write(f"#####PROMPT: {prompt}") fp.write(f"#####OUTPUT: {output}") datapoints+=1 i += 1 if i == max_actions_per_task: break ``` 6. Use the dataset! For instance, to convert it to JSON ```python dataset_list = [] # dir_ = "easy_task_mini_dataset_cleaned" dir_ = "medium_size_generated_tasks" files_ = os.listdir(dir_) for f in files_: filename = os.path.join(dir_, f) print(filename) with open(filename, "r") as fp: txt = fp.read() prompt = txt.split("#####PROMPT:")[1].split("#####OUTPUT:")[0].strip() output = txt.split("#####OUTPUT:")[1].strip() dataset_list.append({ "prompt":prompt, "output": output, }) with open("data.json", "w") as fp: json.dump(dataset_list, fp, indent=4) ``` You can also use my fork directly to train a LoRA: https://github.com/paolorechia/vicuna-react-lora/blob/main/finetune_wizard_react.py

提供机构：

paolorechia

原始信息汇总

数据集概述

数据集基本信息

许可证: 仅限研究/学术使用
语言: 英语
标签: ReAct, LLM, Agent, langchain
大小类别: 1K<n<10K

数据集内容

包含一系列的.txt文件，每个文件包含一个提示（prompt）和预期的输出。
示例格式如下：

#####PROMPT: Question: Make sure the task is unique and adds value to the original list.

Thought:#####OUTPUT: I should check if the task is already in the list. Action: Python REPL Action Input: if task not in tasks: print("Task not found.") else: print("Task found.")

数据集用途

用于训练基于LLama的模型使用Langchain ReAct工具，特别是Python REPL。

数据集质量

目前数据集质量较低，存在较多问题，正在调整第一个LoRA模型。

数据集创建过程

创建了一个机制来记录Langchain Agent在本地服务器上运行时生成的提示/输出对。
创建了一个简短的任务列表，并通过LLM的帮助扩展到约390个任务。
使用WizardLM 7b未量化版本来执行每个任务，并使用步骤1中的记录器。
提取所有日志并将其整合到目录中的.txt文件中。

未来改进计划

使用具有更宽松许可的基础模型重新执行步骤。
解决数据集生成阶段的问题，如模型尝试安装包失败、langchain代理工具偶尔出现错误等。
更好地清理数据集。

5,000+

优质数据集

54 个

任务类型

进入经典数据集