five

gabrielchua/singlish-to-english-synthetic

收藏
Hugging Face2024-01-14 更新2024-06-25 收录
下载链接:
https://hf-mirror.com/datasets/gabrielchua/singlish-to-english-synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - translation language: - en pretty_name: Singlish to English 🇸🇬 size_categories: - n<1K --- # Singlish to English 🇸🇬 > Singapore is known for its efficiency and Singlish is no different - it's colourful and snappy. - [Tessa Wong, BBC News, 2015](https://www.bbc.com/news/magazine-33809914) This is a synthetic dataset generated by GPT-4. Each json pair contains one Singlish sentence about an everyday activity (e.g. cooking) and its English translation. # Sample entry ```json singlish: "Eh, chop the garlic - you can a not?", english: Hey, do you know how to chop the garlic?" ``` # Data Generation Code ```python import json import pandas as pd from openai import OpenAI client = OpenAI() NUM_SAMPLE = 10 ACTIVITIES = ['cooking', 'studying', 'sleeping', 'eating', 'working', 'exercising', 'reading', 'cleaning', 'shopping', 'driving', 'walking', 'bathing', 'going to work', 'listening to music', 'watching TV', 'playing video games', 'using a computer', 'texting', 'socializing', 'meditating', 'commuting', 'doing laundry', 'ironing clothes', 'dusting', 'vacuuming', 'painting', 'drawing', 'grocery shopping', 'sewing', 'taking a nap', 'jogging', 'biking', 'swimming', 'playing sports', 'checking emails', 'playing with children', 'watching movies', 'playing board games', 'attending school or classes', 'going to the gym', 'playing a musical instrument', 'singing', 'dancing', 'writing', 'photography', 'traveling', 'visiting friends', 'attending events', 'volunteering', 'attending meetings'] dataset = {} for index, activity in enumerate(ACTIVITIES): print(index, activity) response = client.chat.completions.create( model="gpt-4-1106-preview", messages=[{"role": "system", "content": "You are an expert in translating Singlish to English"}, {"role": "user", "content": f"Create {NUM_SAMPLE} random Singlish (s) to English (e) translation pairs in json. Write full sentences about {activity}."\ f"Don't exaggerate the use of Singlish, and be natural, as how a real Singaporean would speak."\ f"Start the keys from {(index*NUM_SAMPLE)+1}. For example,"\ "{'X':{'s': 'aiyo, why like that', 'e': 'oh my, how did this happen'}"\ "..., 'X+5': {'s': 'don't play play', 'e': 'don't fool around'} }"}], temperature=0.01, response_format={"type":"json_object"} ) output = response.choices[0].message.content output_json = json.loads(output) dataset.update(output_json) # Save the current state of the combined dictionary with open('singlish_to_english_v0.1.json', 'w') as f: json.dump(dataset, f, indent=None) # Convert to tabular csv df = pd.read_json("singlish_to_english_v0.1.json") df = df.T df = df.reset_index() df.columns = ["index", "singlish", "english"] df.to_csv("singlish_to_english_v0.1.csv", index=False) ```
提供机构:
gabrielchua
原始信息汇总

Singlish to English 🇸🇬

数据集概述

  • 许可证: cc-by-nc-sa-4.0
  • 任务类别: translation
  • 语言: en
  • 数据集大小: n<1K
  • 数据集名称: Singlish to English 🇸🇬

数据集描述

该数据集是由GPT-4生成的合成数据集。每个JSON对包含一个关于日常活动的Singlish句子及其英语翻译。

样本条目

json { "singlish": "Eh, chop the garlic - you can a not?", "english": "Hey, do you know how to chop the garlic?" }

数据生成代码

数据集的生成代码如下:

python import json import pandas as pd from openai import OpenAI

client = OpenAI()

NUM_SAMPLE = 10 ACTIVITIES = [cooking, studying, sleeping, eating, working, exercising, reading, cleaning, shopping, driving, walking, bathing, going to work, listening to music, watching TV, playing video games, using a computer, texting, socializing, meditating, commuting, doing laundry, ironing clothes, dusting, vacuuming, painting, drawing, grocery shopping, sewing, taking a nap, jogging, biking, swimming, playing sports, checking emails, playing with children, watching movies, playing board games, attending school or classes, going to the gym, playing a musical instrument, singing, dancing, writing, photography, traveling, visiting friends, attending events, volunteering, attending meetings]

dataset = {}

for index, activity in enumerate(ACTIVITIES): print(index, activity) response = client.chat.completions.create( model="gpt-4-1106-preview", messages=[{"role": "system", "content": "You are an expert in translating Singlish to English"}, {"role": "user", "content": f"Create {NUM_SAMPLE} random Singlish (s) to English (e) translation pairs in json. Write full sentences about {activity}." f"Dont exaggerate the use of Singlish, and be natural, as how a real Singaporean would speak." f"Start the keys from {(index*NUM_SAMPLE)+1}. For example," "{X:{s: aiyo, why like that, e: oh my, how did this happen}" "..., X+5: {s: dont play play, e: dont fool around} }"}], temperature=0.01, response_format={"type":"json_object"} ) output = response.choices[0].message.content output_json = json.loads(output) dataset.update(output_json)

# Save the current state of the combined dictionary
with open(singlish_to_english_v0.1.json, w) as f:
    json.dump(dataset, f, indent=None)

Convert to tabular csv

df = pd.read_json("singlish_to_english_v0.1.json") df = df.T df = df.reset_index() df.columns = ["index", "singlish", "english"] df.to_csv("singlish_to_english_v0.1.csv", index=False)

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作