five

ruslandev/tagengo-subset-gpt-4o

收藏
Hugging Face2024-06-12 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/ruslandev/tagengo-subset-gpt-4o
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversation_id dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string - name: language dtype: string - name: lang_detect_result struct: - name: lang dtype: string - name: score dtype: float64 - name: response sequence: string splits: - name: train num_bytes: 11451161 num_examples: 3000 download_size: 6194853 dataset_size: 11451161 configs: - config_name: default data_files: - split: train path: data/train-* --- This dataset is generated by sampling 3000 prompts from the [Tagengo](https://huggingface.co/datasets/lightblue/tagengo-gpt4) dataset in English, Chinese, and Russian, and generating responses with GPT-4o. ## The script to run prompts: ``` import pandas as pd from openai import OpenAI from datasets import load_dataset, Dataset from dotenv import load_dotenv from glob import glob from tqdm.auto import tqdm from tenacity import ( retry, stop_after_attempt, wait_random_exponential, ) # for exponential backoff load_dotenv() client = OpenAI() @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6)) def get_openai_response(input_text, model_name): try: response = client.chat.completions.create( model=model_name, messages=[ { "role": "user", "content": input_text } ], temperature=0, max_tokens=2048, ) print( str( round( float(response.usage.completion_tokens * (15 / 1_000_000)) + float( response.usage.prompt_tokens * (5 / 1_000_000)), 3 ) ) + "$" ) output_text = response.choices[0].message.content finish_reason = response.choices[0].finish_reason return output_text, finish_reason except Exception as e: print("ERROR!") print(e) return None, None def run_prompt(): prompt_dataset = load_dataset("lightblue/tagengo-gpt4", split="train") lang_parts = [] def filter_cond(e): if e['conversations'][1]['value']: # This is optional. # Most likely these patterns mean censored response for filtered_marker in "I'm sorry", 'Извините', '对不起': if e['conversations'][1]['value'].startswith(filtered_marker): return False return True sample_size = 1000 for lang in 'Russian', 'English', 'Chinese': part = prompt_dataset.filter(lambda e: e['language'] == lang) part = part.filter(filter_cond) lang_parts.append(part.shuffle().select(range(sample_size))) for i, sample in enumerate(lang_parts): batch_dataset = sample.map( lambda x: { "response": get_openai_response(x["conversations"][0]["value"], "gpt-4o") }, num_proc=12 ) batch_dataset.to_json(f"./gpt_multiling_saved/{str(i).zfill(6)}.json", force_ascii=False) def upload_dataset(): paths = glob("gpt_multiling_saved/*.json") df = pd.concat([pd.read_json(p, lines=True) for p in tqdm(paths)]) keep_col = ["conversation_id", "conversations", "language", "lang_detect_result", "response"] df = df[keep_col] Dataset.from_pandas(df).select_columns(keep_col).push_to_hub("ruslandev/tagengo-subset-gpt-4o", private=True) if __name__ == '__main__': run_prompt() upload_dataset() ```
提供机构:
ruslandev
原始信息汇总

数据集信息

特征

  • conversation_id: 字符串类型
  • conversations: 列表类型
    • from: 字符串类型
    • value: 字符串类型
  • language: 字符串类型
  • lang_detect_result: 结构类型
    • lang: 字符串类型
    • score: float64类型
  • response: 序列类型,字符串

数据分割

  • train:
    • 字节数: 11451161
    • 样本数: 3000

数据大小

  • 下载大小: 6194853
  • 数据集大小: 11451161

配置

  • default:
    • 数据文件:
      • 分割: train
      • 路径: data/train-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作