aipracticecafe/test_rp

Name: aipracticecafe/test_rp
Creator: aipracticecafe
Published: 2024-05-24 14:33:04
License: 暂无描述

Hugging Face2024-05-24 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/aipracticecafe/test_rp

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit configs: - config_name: default data_files: - split: train path: "dataset_2_only_diags.jsonl" - split: test path: "test_dataset.jsonl" - config_name: detailed_descriptions data_files: - split: train path: "dataset_1.jsonl" - split: test path: "test_dataset.jsonl" task_categories: - text-generation language: - ja size_categories: - n<1K --- Format to have User, Assistant in order. ```python def merge_roles(data): merged_data = [] current_role = None current_content = [] for entry in data["messages"]: # print(entry) role = entry['role'] if role == "system": role = "user" content = entry['content'] if role == current_role: current_content.append(content) else: if current_role is not None: merged_data.append({"role": current_role, "content": "\n".join(current_content)}) current_role = role current_content = [content] # 最後のエントリーを追加 if current_role is not None: merged_data.append({"role": current_role, "content": "\n".join(current_content)}) return {"merged_messages": merged_data} dataset_test = dataset.map(merge_roles, batched = False) dataset_test ``` ## Chat Template For using 'cyberagent/calm2-7b-chat' then the following template is useful. ```python calm_template = \ "{% for message in messages %}"\ "{% if message['role'] == 'user' or message['role'] == 'system' %}"\ "{{ 'USER: ' + message['content'] + '<|endoftext|>' + '\n' }}"\ "{% elif message['role'] == 'assistant' %}"\ "{{ 'ASSISTANT: ' + message['content'] + '<|endoftext|>' + '\n' }}"\ "{% endif %}"\ "{% endfor %}"\ "{% if add_generation_prompt %}"\ "{{ 'ASSISTANT: ' }}"\ "{% endif %}" tokenizer_new = tokenizer tokenizer_new.chat_template = calm_template ``` ## Usage After merging the messages. ```python def formatting_prompts_func(examples): convos = examples["merged_messages"] texts = [tokenizer_new.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } dataset_test = dataset_test.map(formatting_prompts_func, batched = True,) dataset_tokens = dataset_test.map(lambda x: tokenizer(x["text"], return_length=True, max_length=max_seq_length)) dataset_tokens = dataset_tokens.remove_columns(['messages', 'user_name', 'assistant_name', 'ncode', 'file_name', 'text', 'merged_messages']) dataset_tokens ``` The 'return_length' parameter is used to batch samples by the same length, to avoid excessive padding.

提供机构：

aipracticecafe

原始信息汇总

数据集概述

数据集配置

默认配置
- 训练集文件: dataset_2_only_diags.jsonl
- 测试集文件: test_dataset.jsonl
详细描述配置
- 训练集文件: dataset_1.jsonl
- 测试集文件: test_dataset.jsonl

任务类别

文本生成

语言

日语

数据集大小

小于1K条记录

5,000+

优质数据集

54 个

任务类型

进入经典数据集