下载链接：

https://modelscope.cn/datasets/swift/tagengo-gpt4

下载链接

链接失效反馈

官方服务：

资源简介：

# Tagengo - the world's largest high quality multilingual chat dataset [[Paper](https://arxiv.org/abs/2405.12612)] This dataset consists of more than 75,000 single-turn conversations between humans and GPT-4 (`gpt-4-0125-preview`). While there is a good amount of high quality English chat datasets between humans and state-of-the-art AI assistants such as GPT-4, this is severely lacking in other languages. For this reason, we created what we believe to be the world's largest multilingual chat dataset between humans and a high quality AI assistant such as GPT-4. This dataset consists of conversations in 74 languages, with high quality output from one of the best state-of-the-art assistant AIs available just now. # How we made this dataset ### Prompt selection [Code colab](https://drive.google.com/file/d/1gb2bYdwxanDd80rLw8BYQ3GG7XGmfvSD/view?usp=sharing) ([GitHub backup of code](https://github.com/lightblue-tech/tagengo/blob/main/tagengo_prompt_preparation.ipynb)) 1. Read prompts from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) 2. Remove all OpenAI moderated messages 3. Remove any languages that are listed as one of: `["unknown", "Klingon", "xx", "zp", "zzp"]` 4. Remove any anonymised messages or mentions of a language model in the input (Many messages ask questions to the model about it being a LLM, which we do not regard as particularly useful.) 5. Remove any messages which have a low confidence language detection score (<80%) using the `ftlangdetect.detect` method. 6. To reduce data generation costs, we remove any messages in which the first message and response amount to more than 512 tokens. 7. We randomly sample 25,000 prompts from each language (effectively only sampling 25,000 from English, as every other language had less than this in the dataset) 8. We embed each language's prompts with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) and remove one of any pairs that has a dot product of more than 0.8. This was done to remove too similar prompts from the dataset. This resulted in a dataset with the following number of conversations for each language:  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/KeKoZ9Kzex_yrqoTpbaji.png) | language | count | |---------------|-------| | English | 15771 | | Portuguese | 12564 | | Spanish | 8318 | | Russian | 8056 | | Italian | 7063 | | German | 5739 | | French | 5369 | | Chinese | 5338 | | Japanese | 2521 | | Korean | 1609 | | Polish | 1090 | | Arabic | 789 | | Vietnamese | 429 | | Turkish | 406 | | Dutch | 383 | | Ukrainian | 323 | | Greek | 308 | | Swedish | 256 | | Indonesian | 240 | | Hungarian | 214 | | Persian | 184 | | Czech | 179 | | Thai | 133 | | Hebrew | 120 | | Finnish | 92 | | Catalan | 73 | | Romanian | 71 | | Danish | 67 | | Bulgarian | 56 | | Bangla | 29 | | Norwegian | 26 | | Tagalog | 22 | | Latvian | 22 | | Hindi | 20 | | Estonian | 18 | | Esperanto | 17 | | Slovak | 17 | | Croatian | 11 | | Lithuanian | 11 | | Slovenian | 10 | | Basque | 6 | | Serbian | 6 | | Mongolian | 6 | | Sinhala | 5 | | Icelandic | 5 | | Malay | 5 | | Macedonian | 5 | | Tamil | 5 | | Albanian | 5 | | Latin | 4 | | Azerbaijani | 4 | | Urdu | 3 | | Amharic | 3 | | Armenian | 3 | | Afrikaans | 2 | | Uyghur | 2 | | Burmese | 2 | | Kazakh | 2 | | Yiddish | 2 | | Waray | 2 | | Malayalam | 2 | | Belarusian | 2 | | Tibetan | 1 | | Lao | 1 | | Turkmen | 1 | | Kannada | 1 | | Georgian | 1 | | Sanskrit | 1 | | Khmer | 1 | | Breton | 1 | | Odia | 1 | | Luxembourgish | 1 | | Marathi | 1 | | Uzbek | 1 | ### Prompt running ```python import pandas as pd from openai import AzureOpenAI from datasets import load_dataset, Dataset from glob import glob from tqdm.auto import tqdm client = AzureOpenAI( api_key="API_KEY", api_version="2024-02-01", azure_endpoint ="ENDPOINT" ) def get_openai_response(input_text, model_name): try: response = client.chat.completions.create( model=model_name, messages=[ { "role": "user", "content": input_text } ], temperature=0, max_tokens=2048, ) print( str( round( float(response.usage.completion_tokens * (30 / 1_000_000)) + float(response.usage.prompt_tokens * (10 / 1_000_000)), 3 ) ) + "$" ) output_text = response.choices[0].message.content finish_reason = response.choices[0].finish_reason return output_text, finish_reason except Exception as e: print("ERROR!") print(e) return None, None prompt_dataset = load_dataset("lightblue/multilingual_prompts_25k_max", split="train") step_size = 1000 for i in range(0, len(prompt_dataset), step_size): batch_dataset = prompt_dataset.select( range(i, min(i+step_size, len(prompt_dataset))) ).map( lambda x: { "response": get_openai_response(x["conversation"][0]["content"], "gpt-4-0125-preview") }, num_proc=12 ) batch_dataset.to_json(f"/home/jupyter/gpt_multiling_saved/{str(i).zfill(6)}.json") ### Load ### paths = glob("gpt_multiling_saved/*.json") df = pd.concat([pd.read_json(p, lines=True) for p in tqdm(paths)]) df["conversations"] = df.apply(lambda x: [ {"from": "human", "value": x["conversation"][0]["content"]}, {"from": "gpt", "value": x["response"][0]}, ], axis=1) keep_col = ["conversation_id", "conversations", "language", "lang_detect_result", "response"] df = df[keep_col] Dataset.from_pandas(df).select_columns(keep_col).push_to_hub("lightblue/tagengo-gpt4", private=True) ``` # How to cite Please cite [this paper](https://arxiv.org/abs/2405.12612) when referencing this model. ```tex @article{devine2024tagengo, title={Tagengo: A Multilingual Chat Dataset}, author={Devine, Peter}, journal={arXiv preprint arXiv:2405.12612}, year={2024} } ``` # Developer Peter Devine - ([ptrdvn](https://huggingface.co/ptrdvn))

# Tagengo —— 全球规模最大的高质量多语言对话数据集 [[论文](https://arxiv.org/abs/2405.12612)] 本数据集包含超过75,000条人类与GPT-4（`gpt-4-0125-preview`）的单轮对话。目前，尽管已存在不少高质量的人类与顶尖AI助手（如GPT-4）间的英文对话数据集，但其他语言的同类数据集却严重匮乏。为此，我们构建了我们认为是全球规模最大的人类与优质AI助手（如GPT-4）间的多语言对话数据集。本数据集涵盖74种语言的对话，采用当前顶尖助手AI之一生成的高质量输出。 # 数据集构建流程 ## 提示词筛选 [Colab代码](https://drive.google.com/file/d/1gb2bYdwxanDd80rLw8BYQ3GG7XGmfvSD/view?usp=sharing) （[GitHub代码备份](https://github.com/lightblue-tech/tagengo/blob/main/tagengo_prompt_preparation.ipynb)） 1. 从[lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)加载提示词 2. 移除所有经OpenAI审核的消息 3. 移除语言被列为`["unknown", "Klingon", "xx", "zp", "zzp"]`的样本 4. 移除任何匿名化消息或输入中提及语言模型的内容（许多消息会询问模型关于其作为大语言模型（LLM）的相关问题，我们认为此类内容并无太大实用价值） 5. 使用`ftlangdetect.detect`方法，移除语言检测置信度得分低于80%的样本 6. 为降低数据生成成本，移除首轮用户提示与AI回复总Token数超过512的样本 7. 从每种语言中随机采样25,000条提示词（由于英语以外的其他语言样本量均不足该数值，实际仅从英语中采样了25,000条） 8. 使用[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)对每种语言的提示词进行嵌入，移除点积得分超过0.8的成对提示词，以去除过于相似的提示词。上述流程最终得到的数据集，各语言对话数量如下：  ![图片/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/KeKoZ9Kzex_yrqoTpbaji.png) | 语言 | 样本数 | |---------------|-------| | 英语 | 15771 | | 葡萄牙语 | 12564 | | 西班牙语 | 8318 | | 俄语 | 8056 | | 意大利语 | 7063 | | 德语 | 5739 | | 法语 | 5369 | | 中文 | 5338 | | 日语 | 2521 | | 韩语 | 1609 | | 波兰语 | 1090 | | 阿拉伯语 | 789 | | 越南语 | 429 | | 土耳其语 | 406 | | 荷兰语 | 383 | | 乌克兰语 | 323 | | 希腊语 | 308 | | 瑞典语 | 256 | | 印尼语 | 240 | | 匈牙利语 | 214 | | 波斯语 | 184 | | 捷克语 | 179 | | 泰语 | 133 | | 希伯来语 | 120 | | 芬兰语 | 92 | | 加泰罗尼亚语 | 73 | | 罗马尼亚语 | 71 | | 丹麦语 | 67 | | 保加利亚语 | 56 | | 孟加拉语 | 29 | | 挪威语 | 26 | | 他加禄语 | 22 | | 拉脱维亚语 | 22 | | 印地语 | 20 | | 爱沙尼亚语 | 18 | | 世界语 | 17 | | 斯洛伐克语 | 17 | | 克罗地亚语 | 11 | | 立陶宛语 | 11 | | 斯洛文尼亚语 | 10 | | 巴斯克语 | 6 | | 塞尔维亚语 | 6 | | 蒙古语 | 6 | | 僧伽罗语 | 5 | | 冰岛语 | 5 | | 马来语 | 5 | | 马其顿语 | 5 | | 泰米尔语 | 5 | | 阿尔巴尼亚语 | 5 | | 拉丁语 | 4 | | 阿塞拜疆语 | 4 | | 乌尔都语 | 3 | | 阿姆哈拉语 | 3 | | 亚美尼亚语 | 3 | | 南非荷兰语 | 2 | | 维吾尔语 | 2 | | 缅甸语 | 2 | | 哈萨克语 | 2 | | 意第绪语 | 2 | | 瓦雷语 | 2 | | 马拉雅拉姆语 | 2 | | 白俄罗斯语 | 2 | | 藏语 | 1 | | 老挝语 | 1 | | 土库曼语 | 1 | | 卡纳达语 | 1 | | 格鲁吉亚语 | 1 | | 梵语 | 1 | | 高棉语 | 1 | | 布列塔尼语 | 1 | | 奥里亚语 | 1 | | 卢森堡语 | 1 | | 马拉地语 | 1 | | 乌兹别克语 | 1 | ## 提示词生成运行代码 python import pandas as pd from openai import AzureOpenAI from datasets import load_dataset, Dataset from glob import glob from tqdm.auto import tqdm # 初始化Azure OpenAI客户端 client = AzureOpenAI( api_key="API_KEY", api_version="2024-02-01", azure_endpoint ="ENDPOINT" ) def get_openai_response(input_text, model_name): try: # 调用OpenAI Chat Completions接口获取模型回复 response = client.chat.completions.create( model=model_name, messages=[ { "role": "user", "content": input_text } ], temperature=0, max_tokens=2048, ) # 计算并打印本次调用的成本 print( str( round( float(response.usage.completion_tokens * (30 / 1_000_000)) + float(response.usage.prompt_tokens * (10 / 1_000_000)), 3 ) ) + "$" ) output_text = response.choices[0].message.content finish_reason = response.choices[0].finish_reason return output_text, finish_reason except Exception as e: print("调用出错！") print(e) return None, None # 加载预处理后的提示词数据集 prompt_dataset = load_dataset("lightblue/multilingual_prompts_25k_max", split="train") step_size = 1000 # 分批处理数据集以避免内存溢出 for i in range(0, len(prompt_dataset), step_size): batch_dataset = prompt_dataset.select( range(i, min(i+step_size, len(prompt_dataset))) ).map( lambda x: { "response": get_openai_response(x["conversation"][0]["content"], "gpt-4-0125-preview") }, num_proc=12 ) # 将批次结果保存为JSON文件 batch_dataset.to_json(f"/home/jupyter/gpt_multiling_saved/{str(i).zfill(6)}.json") ### 数据加载与合并 ### paths = glob("gpt_multiling_saved/*.json") # 合并所有批次的JSON数据文件 df = pd.concat([pd.read_json(p, lines=True) for p in tqdm(paths)]) # 重构对话格式，标准化为human-gpt结构 df["conversations"] = df.apply(lambda x: [ {"from": "human", "value": x["conversation"][0]["content"]}, {"from": "gpt", "value": x["response"][0]}, ], axis=1) # 保留所需列 keep_col = ["conversation_id", "conversations", "language", "lang_detect_result", "response"] df = df[keep_col] # 将处理后的数据集上传至Hugging Face Hub Dataset.from_pandas(df).select_columns(keep_col).push_to_hub("lightblue/tagengo-gpt4", private=True) # 引用方式引用本数据集时，请参考以下论文：[论文链接](https://arxiv.org/abs/2405.12612) tex @article{devine2024tagengo, title={Tagengo: A Multilingual Chat Dataset}, author={Devine, Peter}, journal={arXiv preprint arXiv:2405.12612}, year={2024} } # 开发者 Peter Devine - （[ptrdvn](https://huggingface.co/ptrdvn)）

应用场景：