gabrielchua/singlish-to-english-synthetic
收藏Singlish to English 🇸🇬
数据集概述
- 许可证: cc-by-nc-sa-4.0
- 任务类别: translation
- 语言: en
- 数据集大小: n<1K
- 数据集名称: Singlish to English 🇸🇬
数据集描述
该数据集是由GPT-4生成的合成数据集。每个JSON对包含一个关于日常活动的Singlish句子及其英语翻译。
样本条目
json { "singlish": "Eh, chop the garlic - you can a not?", "english": "Hey, do you know how to chop the garlic?" }
数据生成代码
数据集的生成代码如下:
python import json import pandas as pd from openai import OpenAI
client = OpenAI()
NUM_SAMPLE = 10 ACTIVITIES = [cooking, studying, sleeping, eating, working, exercising, reading, cleaning, shopping, driving, walking, bathing, going to work, listening to music, watching TV, playing video games, using a computer, texting, socializing, meditating, commuting, doing laundry, ironing clothes, dusting, vacuuming, painting, drawing, grocery shopping, sewing, taking a nap, jogging, biking, swimming, playing sports, checking emails, playing with children, watching movies, playing board games, attending school or classes, going to the gym, playing a musical instrument, singing, dancing, writing, photography, traveling, visiting friends, attending events, volunteering, attending meetings]
dataset = {}
for index, activity in enumerate(ACTIVITIES): print(index, activity) response = client.chat.completions.create( model="gpt-4-1106-preview", messages=[{"role": "system", "content": "You are an expert in translating Singlish to English"}, {"role": "user", "content": f"Create {NUM_SAMPLE} random Singlish (s) to English (e) translation pairs in json. Write full sentences about {activity}." f"Dont exaggerate the use of Singlish, and be natural, as how a real Singaporean would speak." f"Start the keys from {(index*NUM_SAMPLE)+1}. For example," "{X:{s: aiyo, why like that, e: oh my, how did this happen}" "..., X+5: {s: dont play play, e: dont fool around} }"}], temperature=0.01, response_format={"type":"json_object"} ) output = response.choices[0].message.content output_json = json.loads(output) dataset.update(output_json)
# Save the current state of the combined dictionary
with open(singlish_to_english_v0.1.json, w) as f:
json.dump(dataset, f, indent=None)
Convert to tabular csv
df = pd.read_json("singlish_to_english_v0.1.json") df = df.T df = df.reset_index() df.columns = ["index", "singlish", "english"] df.to_csv("singlish_to_english_v0.1.csv", index=False)



