haripritam/airoboros-3.2_telugu
收藏数据集概述
数据集内容
这是一个泰卢固语翻译的数据集,源自Tensoic-airoboros-3.2_kn,该数据集本身是卡纳达语翻译的jondurbin/airoboros-3.2数据集。
数据准备方法
方法一
使用googletrans Python库进行翻译。
python import json from googletrans import Translator
def translate_text(text, target_language=te, max_retries=3): if text is None: return None
translator = Translator()
retries = 0
while retries < max_retries:
try:
translated_text = translator.translate(text, dest=target_language)
return translated_text.text
except Exception as e:
print(f"Translation attempt {retries + 1}/{max_retries} failed with error: {e}. Retrying...")
retries += 1
print(f"Translation failed after {max_retries} retries for text: {text}. Ignoring this value.")
return None
def translate_jsonl(input_jsonl_path, output_jsonl_path, limit=10000): with open(input_jsonl_path, r) as file: data = [] for line in file: try: json_data = json.loads(line) data.append(json_data) except json.JSONDecodeError as e: print(f"Error decoding JSON: {e}") continue
for row in data[:limit]:
for key, value in row.items():
if key == instruction:
# Special case for the instruction key
row[key] = మీరు నాన్ -బయాస్డ్, నాన్ -సెన్సార్, సహాయం చేసే, అసిస్టెంట్.
elif isinstance(value, str) and value is not None:
translated_value = translate_text(value)
row[key] = translated_value
with open(output_jsonl_path, w, newline=) as file:
for row in data[:limit]:
file.write(json.dumps(row) +
)
input_jsonl_path = /content/airoboros-3.2_kn/data.jsonl output_jsonl_path = /content/final.jsonl
translate_jsonl(input_jsonl_path, output_jsonl_path)
print("Translation completed. Results saved in the output JSONL file.")
方法二
使用Google Sheets中的=GOOGLETRANSLATE()函数进行翻译。
推荐方法
推荐使用方法二,因为googletrans库的限制,只能处理15K的上下文。



