cyzgab/singlish-to-english-synthetic
收藏Singlish to English 🇸🇬
数据集概述
- 许可证: cc-by-nc-sa-4.0
- 任务类别: 翻译
- 语言: 英语
- 数据集名称: Singlish to English 🇸🇬
- 数据规模: n<1K
数据内容
- 数据来源: 由GPT-4生成
- 数据格式: 每个JSON对包含一个关于日常活动的Singlish句子及其英语翻译
示例条目
json { "singlish": "Eh, chop the garlic - you can a not?", "english": "Hey, do you know how to chop the garlic?" }
数据生成代码
- 生成方法: 使用Python脚本通过OpenAI的GPT-4模型生成Singlish到English的翻译对
- 活动列表: 包含多种日常活动,如烹饪、学习、睡觉等
- 生成步骤:
- 遍历活动列表,为每个活动生成10个Singlish到English的翻译对
- 将生成的数据保存为JSON文件
- 将JSON文件转换为CSV格式 python import json import pandas as pd from openai import OpenAI
client = OpenAI()
NUM_SAMPLE = 10 ACTIVITIES = [cooking, studying, sleeping, eating, working, exercising, reading, cleaning, shopping, driving, walking, bathing, going to work, listening to music, watching TV, playing video games, using a computer, texting, socializing, meditating, commuting, doing laundry, ironing clothes, dusting, vacuuming, painting, drawing, grocery shopping, sewing, taking a nap, jogging, biking, swimming, playing sports, checking emails, playing with children, watching movies, playing board games, attending school or classes, going to the gym, playing a musical instrument, singing, dancing, writing, photography, traveling, visiting friends, attending events, volunteering, attending meetings]
dataset = {}
for index, activity in enumerate(ACTIVITIES): print(index, activity) response = client.chat.completions.create( model="gpt-4-1106-preview", messages=[{"role": "system", "content": "You are an expert in translating Singlish to English"}, {"role": "user", "content": f"Create {NUM_SAMPLE} random Singlish (s) to English (e) translation pairs in json. Write full sentences about {activity}. Dont exaggerate the use of Singlish, and be natural, as how a real Singaporean would speak. Start the keys from {(index*NUM_SAMPLE)+1}. For example, {{X:{{s: aiyo, why like that, e: oh my, how did this happen}}, ..., X+5: {{s: dont play play, e: dont fool around}} }}"}], temperature=0.01, response_format={"type":"json_object"} ) output = response.choices[0].message.content output_json = json.loads(output) dataset.update(output_json)
with open(singlish_to_english_v0.1.json, w) as f:
json.dump(dataset, f, indent=None)
df = pd.read_json("singlish_to_english_v0.1.json") df = df.T df = df.reset_index() df.columns = ["index", "singlish", "english"] df.to_csv("singlish_to_english_v0.1.csv", index=False)




