five

Felladrin/ChatML-reddit-instruct-curated

收藏
Hugging Face2024-02-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Felladrin/ChatML-reddit-instruct-curated
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en size_categories: - 10K<n<100K task_categories: - question-answering - text-generation --- [euclaise/reddit-instruct-curated](https://huggingface.co/datasets/euclaise/reddit-instruct-curated) in ChatML format, ready to use in [HuggingFace TRL's SFT Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer). Python code used for conversion: ```python from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1") dataset = load_dataset("euclaise/reddit-instruct-curated", split="train") def format(columns): post_title = columns["post_title"].strip() post_text = columns["post_text"].strip() comment_text = columns["comment_text"].strip() if post_text: user_message = f"{post_title}\n{post_text}" else: user_message = post_title messages = [ { "role": "user", "content": user_message, }, { "role": "assistant", "content": comment_text, }, ] return { "text": tokenizer.apply_chat_template(messages, tokenize=False) } dataset.map(format).select_columns(['text', 'post_score', 'comment_score']).to_parquet("train.parquet") ```
提供机构:
Felladrin
原始信息汇总

数据集概述

许可证

  • MIT许可证

语言

  • 英语

数据规模

  • 数据量介于10K到100K之间

任务类别

  • 问答
  • 文本生成

数据集名称

  • euclaise/reddit-instruct-curated

数据格式

  • ChatML格式

适用场景

  • 适用于HuggingFace TRL的SFT Trainer

数据处理代码

python from datasets import load_dataset from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")

dataset = load_dataset("euclaise/reddit-instruct-curated", split="train")

def format(columns): post_title = columns["post_title"].strip() post_text = columns["post_text"].strip() comment_text = columns["comment_text"].strip()

if post_text:
    user_message = f"{post_title}

{post_text}" else: user_message = post_title

messages = [
    {
        "role": "user",
        "content": user_message,
    },
    {
        "role": "assistant",
        "content": comment_text,
    },
]

return { "text": tokenizer.apply_chat_template(messages, tokenize=False) }

dataset.map(format).select_columns([text, post_score, comment_score]).to_parquet("train.parquet")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作