five

aipracticecafe/test_rp

收藏
Hugging Face2024-05-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/aipracticecafe/test_rp
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit configs: - config_name: default data_files: - split: train path: "dataset_2_only_diags.jsonl" - split: test path: "test_dataset.jsonl" - config_name: detailed_descriptions data_files: - split: train path: "dataset_1.jsonl" - split: test path: "test_dataset.jsonl" task_categories: - text-generation language: - ja size_categories: - n<1K --- Format to have User, Assistant in order. ```python def merge_roles(data): merged_data = [] current_role = None current_content = [] for entry in data["messages"]: # print(entry) role = entry['role'] if role == "system": role = "user" content = entry['content'] if role == current_role: current_content.append(content) else: if current_role is not None: merged_data.append({"role": current_role, "content": "\n".join(current_content)}) current_role = role current_content = [content] # 最後のエントリーを追加 if current_role is not None: merged_data.append({"role": current_role, "content": "\n".join(current_content)}) return {"merged_messages": merged_data} dataset_test = dataset.map(merge_roles, batched = False) dataset_test ``` ## Chat Template For using 'cyberagent/calm2-7b-chat' then the following template is useful. ```python calm_template = \ "{% for message in messages %}"\ "{% if message['role'] == 'user' or message['role'] == 'system' %}"\ "{{ 'USER: ' + message['content'] + '<|endoftext|>' + '\n' }}"\ "{% elif message['role'] == 'assistant' %}"\ "{{ 'ASSISTANT: ' + message['content'] + '<|endoftext|>' + '\n' }}"\ "{% endif %}"\ "{% endfor %}"\ "{% if add_generation_prompt %}"\ "{{ 'ASSISTANT: ' }}"\ "{% endif %}" tokenizer_new = tokenizer tokenizer_new.chat_template = calm_template ``` ## Usage After merging the messages. ```python def formatting_prompts_func(examples): convos = examples["merged_messages"] texts = [tokenizer_new.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } dataset_test = dataset_test.map(formatting_prompts_func, batched = True,) dataset_tokens = dataset_test.map(lambda x: tokenizer(x["text"], return_length=True, max_length=max_seq_length)) dataset_tokens = dataset_tokens.remove_columns(['messages', 'user_name', 'assistant_name', 'ncode', 'file_name', 'text', 'merged_messages']) dataset_tokens ``` The 'return_length' parameter is used to batch samples by the same length, to avoid excessive padding.
提供机构:
aipracticecafe
原始信息汇总

数据集概述

数据集配置

  • 默认配置

    • 训练集文件: dataset_2_only_diags.jsonl
    • 测试集文件: test_dataset.jsonl
  • 详细描述配置

    • 训练集文件: dataset_1.jsonl
    • 测试集文件: test_dataset.jsonl

任务类别

  • 文本生成

语言

  • 日语

数据集大小

  • 小于1K条记录
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作