five

Felladrin/ChatML-WebGLM-QA

收藏
Hugging Face2024-02-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Felladrin/ChatML-WebGLM-QA
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation language: - en size_categories: - 10K<n<100K --- [THUDM/webglm-qa](https://huggingface.co/datasets/THUDM/webglm-qa) in ChatML format. Python code used for conversion: ```python from datasets import load_dataset import pandas import re import random from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" ) dataset = load_dataset("THUDM/webglm-qa", split="train") def format(columns): references = "\n".join( [ f"- {columns['references'][i].strip()}" for i in range(len(columns["references"])) ] ) question = columns["question"].strip() answer = columns["answer"].strip() assistant_message = re.sub(r"\[\d\]", "", answer) if random.random() < 0.5: user_message = f"Question:\n{question}\n\nContext:\n{references}" else: user_message = f"Context:\n{references}\n\nQuestion:\n{question}" messages = [ { "role": "user", "content": user_message, }, { "role": "assistant", "content": assistant_message, }, ] return tokenizer.apply_chat_template(messages, tokenize=False) pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_parquet("train.parquet", index=False) ```

The dataset THUDM/webglm-qa is a dataset for question-answering and text generation, containing English text, with a size between 10K and 100K. This dataset is converted to ChatML format and processed using Python code, which includes loading the dataset, formatting the data, and saving the results as a parquet file.
提供机构:
Felladrin
原始信息汇总

数据集概述

许可证

  • Apache 2.0

任务类别

  • 问答
  • 文本生成

语言

  • 英语

数据集大小

  • 10K<n<100K

数据集格式

  • ChatML

数据集转换代码

python from datasets import load_dataset import pandas import re import random from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )

dataset = load_dataset("THUDM/webglm-qa", split="train")

def format(columns): references = " ".join( [ f"- {columns[references][i].strip()}" for i in range(len(columns["references"])) ] ) question = columns["question"].strip() answer = columns["answer"].strip() assistant_message = re.sub(r"[d]", "", answer)

if random.random() < 0.5:
    user_message = f"Question:

{question}

Context: {references}" else: user_message = f"Context: {references}

Question: {question}"

messages = [
    {
        "role": "user",
        "content": user_message,
    },
    {
        "role": "assistant",
        "content": assistant_message,
    },
]

return tokenizer.apply_chat_template(messages, tokenize=False)

pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_parquet("train.parquet", index=False)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作