lightblue/tagengo-gpt4

Name: lightblue/tagengo-gpt4
Creator: lightblue
Published: 2024-06-02 02:13:55
License: 暂无描述

Hugging Face2024-06-02 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/lightblue/tagengo-gpt4

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: conversation_id dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string - name: language dtype: string - name: lang_detect_result struct: - name: lang dtype: string - name: score dtype: float64 - name: response sequence: string splits: - name: train num_bytes: 296074438 num_examples: 78057 download_size: 164269680 dataset_size: 296074438 configs: - config_name: default data_files: - split: train path: data/train-* --- # Tagengo - the world's largest high quality multilingual chat dataset [[Paper](https://arxiv.org/abs/2405.12612)] This dataset consists of more than 75,000 single-turn conversations between humans and GPT-4 (`gpt-4-0125-preview`). While there is a good amount of high quality English chat datasets between humans and state-of-the-art AI assistants such as GPT-4, this is severely lacking in other languages. For this reason, we created what we believe to be the world's largest multilingual chat dataset between humans and a high quality AI assistant such as GPT-4. This dataset consists of conversations in 74 languages, with high quality output from one of the best state-of-the-art assistant AIs available just now. # How we made this dataset ### Prompt selection [Code colab](https://drive.google.com/file/d/1gb2bYdwxanDd80rLw8BYQ3GG7XGmfvSD/view?usp=sharing) ([GitHub backup of code](https://github.com/lightblue-tech/tagengo/blob/main/tagengo_prompt_preparation.ipynb)) 1. Read prompts from [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) 2. Remove all OpenAI moderated messages 3. Remove any languages that are listed as one of: `["unknown", "Klingon", "xx", "zp", "zzp"]` 4. Remove any anonymised messages or mentions of a language model in the input (Many messages ask questions to the model about it being a LLM, which we do not regard as particularly useful.) 5. Remove any messages which have a low confidence language detection score (<80%) using the `ftlangdetect.detect` method. 6. To reduce data generation costs, we remove any messages in which the first message and response amount to more than 512 tokens. 7. We randomly sample 25,000 prompts from each language (effectively only sampling 25,000 from English, as every other language had less than this in the dataset) 8. We embed each language's prompts with [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) and remove one of any pairs that has a dot product of more than 0.8. This was done to remove too similar prompts from the dataset. This resulted in a dataset with the following number of conversations for each language:  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/KeKoZ9Kzex_yrqoTpbaji.png) | language | count | |---------------|-------| | English | 15771 | | Portuguese | 12564 | | Spanish | 8318 | | Russian | 8056 | | Italian | 7063 | | German | 5739 | | French | 5369 | | Chinese | 5338 | | Japanese | 2521 | | Korean | 1609 | | Polish | 1090 | | Arabic | 789 | | Vietnamese | 429 | | Turkish | 406 | | Dutch | 383 | | Ukrainian | 323 | | Greek | 308 | | Swedish | 256 | | Indonesian | 240 | | Hungarian | 214 | | Persian | 184 | | Czech | 179 | | Thai | 133 | | Hebrew | 120 | | Finnish | 92 | | Catalan | 73 | | Romanian | 71 | | Danish | 67 | | Bulgarian | 56 | | Bangla | 29 | | Norwegian | 26 | | Tagalog | 22 | | Latvian | 22 | | Hindi | 20 | | Estonian | 18 | | Esperanto | 17 | | Slovak | 17 | | Croatian | 11 | | Lithuanian | 11 | | Slovenian | 10 | | Basque | 6 | | Serbian | 6 | | Mongolian | 6 | | Sinhala | 5 | | Icelandic | 5 | | Malay | 5 | | Macedonian | 5 | | Tamil | 5 | | Albanian | 5 | | Latin | 4 | | Azerbaijani | 4 | | Urdu | 3 | | Amharic | 3 | | Armenian | 3 | | Afrikaans | 2 | | Uyghur | 2 | | Burmese | 2 | | Kazakh | 2 | | Yiddish | 2 | | Waray | 2 | | Malayalam | 2 | | Belarusian | 2 | | Tibetan | 1 | | Lao | 1 | | Turkmen | 1 | | Kannada | 1 | | Georgian | 1 | | Sanskrit | 1 | | Khmer | 1 | | Breton | 1 | | Odia | 1 | | Luxembourgish | 1 | | Marathi | 1 | | Uzbek | 1 | ### Prompt running ```python import pandas as pd from openai import AzureOpenAI from datasets import load_dataset, Dataset from glob import glob from tqdm.auto import tqdm client = AzureOpenAI( api_key="API_KEY", api_version="2024-02-01", azure_endpoint ="ENDPOINT" ) def get_openai_response(input_text, model_name): try: response = client.chat.completions.create( model=model_name, messages=[ { "role": "user", "content": input_text } ], temperature=0, max_tokens=2048, ) print( str( round( float(response.usage.completion_tokens * (30 / 1_000_000)) + float(response.usage.prompt_tokens * (10 / 1_000_000)), 3 ) ) + "$" ) output_text = response.choices[0].message.content finish_reason = response.choices[0].finish_reason return output_text, finish_reason except Exception as e: print("ERROR!") print(e) return None, None prompt_dataset = load_dataset("lightblue/multilingual_prompts_25k_max", split="train") step_size = 1000 for i in range(0, len(prompt_dataset), step_size): batch_dataset = prompt_dataset.select( range(i, min(i+step_size, len(prompt_dataset))) ).map( lambda x: { "response": get_openai_response(x["conversation"][0]["content"], "gpt-4-0125-preview") }, num_proc=12 ) batch_dataset.to_json(f"/home/jupyter/gpt_multiling_saved/{str(i).zfill(6)}.json") ### Load ### paths = glob("gpt_multiling_saved/*.json") df = pd.concat([pd.read_json(p, lines=True) for p in tqdm(paths)]) df["conversations"] = df.apply(lambda x: [ {"from": "human", "value": x["conversation"][0]["content"]}, {"from": "gpt", "value": x["response"][0]}, ], axis=1) keep_col = ["conversation_id", "conversations", "language", "lang_detect_result", "response"] df = df[keep_col] Dataset.from_pandas(df).select_columns(keep_col).push_to_hub("lightblue/tagengo-gpt4", private=True) ``` # How to cite Please cite [this paper](https://arxiv.org/abs/2405.12612) when referencing this model. ```tex @article{devine2024tagengo, title={Tagengo: A Multilingual Chat Dataset}, author={Devine, Peter}, journal={arXiv preprint arXiv:2405.12612}, year={2024} } ``` # Developer Peter Devine - ([ptrdvn](https://huggingface.co/ptrdvn))

提供机构：

lightblue

原始信息汇总

数据集概述

数据集基本信息

许可证: Apache-2.0
数据集大小: 296,074,438字节
下载大小: 164,269,680字节
训练集大小: 296,074,438字节，包含78,057个样本

数据集特征

conversation_id: 字符串类型
conversations: 列表类型，包含以下字段：
- from: 字符串类型
- value: 字符串类型
language: 字符串类型
lang_detect_result: 结构体类型，包含以下字段：
- lang: 字符串类型
- score: 浮点数类型（float64）
response: 序列类型，字符串

数据集语言分布

包含74种语言，主要语言及其对话数量如下：
- English: 15,771
- Portuguese: 12,564
- Spanish: 8,318
- Russian: 8,056
- Italian: 7,063
- German: 5,739
- French: 5,369
- Chinese: 5,338
- Japanese: 2,521
- Korean: 1,609
- Polish: 1,090
- Arabic: 789
- Vietnamese: 429
- Turkish: 406
- Dutch: 383
- Ukrainian: 323
- Greek: 308
- Swedish: 256
- Indonesian: 240
- Hungarian: 214
- Persian: 184
- Czech: 179
- Thai: 133
- Hebrew: 120
- Finnish: 92
- Catalan: 73
- Romanian: 71
- Danish: 67
- Bulgarian: 56
- Bangla: 29
- Norwegian: 26
- Tagalog: 22
- Latvian: 22
- Hindi: 20
- Estonian: 18
- Esperanto: 17
- Slovak: 17
- Croatian: 11
- Lithuanian: 11
- Slovenian: 10
- Basque: 6
- Serbian: 6
- Mongolian: 6
- Sinhala: 5
- Icelandic: 5
- Malay: 5
- Macedonian: 5
- Tamil: 5
- Albanian: 5
- Latin: 4
- Azerbaijani: 4
- Urdu: 3
- Amharic: 3
- Armenian: 3
- Afrikaans: 2
- Uyghur: 2
- Burmese: 2
- Kazakh: 2
- Yiddish: 2
- Waray: 2
- Malayalam: 2
- Belarusian: 2
- Tibetan: 1
- Lao: 1
- Turkmen: 1
- Kannada: 1
- Georgian: 1
- Sanskrit: 1
- Khmer: 1
- Breton: 1
- Odia: 1
- Luxembourgish: 1
- Marathi: 1
- Uzbek: 1

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集