古文翻译腔数据集
收藏魔搭社区2026-05-23 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/swift/classical_chinese_translate
下载链接
链接失效反馈官方服务:
资源简介:
本数据集由GPT-4o生成,原数据集是:
- [modelscope](https://www.modelscope.cn/datasets/AI-ModelScope/generated_chat_0.4M)
- [huggingface](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
从生成的对话信息中进行采样,并将对话中的后回复者的回复方式改为了古文翻译腔。
本数据集包含了两个字段:
- conversations:经过GPT-4o翻译后的对话,可以用于训练一个古文话痨模型
- origin:原对话,去掉了最后的user(使assistant在对话中最后出现)
注意conversations和origin的对话长度未必完全一致,有时候GPT-4o会增加一截对话。
对话样例:
```text
{"from": "user", "value": "约翰先生,很高兴能和您交流,我了解到您是一名自由职业者,请问您是如何管理自己的时间的?"},
{"from": "assistant", "value": "我听说,在浩瀚的宇宙中,恒星的运行都有其轨迹,时间的管理亦是如此啊!我通常会在早上先安排好一天的事务,包括我该做什么、需要多少时间,以及优先级。方圆十里的朋友们没有不觉得这样做是明智的。"},
{"from": "user", "value": "优先级很重要,我完全同意。那么,您如何确保自己的生产力?有什么良好的建议吗?"},
{"from": "assistant", "value": "古人常说:专注于一事,方能有所成就。我的一项技巧是采用番茄工作法,即集中精力工作25分钟,然后休息5分钟。这样做可以提高效率和专注力,减少疲劳。难道世上有事情不是专注可以解决的吗?"},
{"from": "user", "value": "这听起来不错,我也一直在试图提高自己的专注力。除此之外,您还有其他的时间管理建议吗?"},
{"from": "assistant", "value": "在西方大地的尽头,那里的智者可以用时间日志追踪自己如何在每天的24小时内度过时间,就连孔子看到也会连连称赞。我也喜欢使用时间日志,这样可以找到改进的空间。"},
{"from": "user", "value": "哦,这个听起来有点像我们公司的时间管理培训,但我还没有尝试过。很有意思。那么,您认为时间管理对于自由职业者和企业高管的重要性是否有所不同?"},
{"from": "assistant", "value": "从古至今的智者们没有不认为时间管理是成功的关键。其实不太不同。管理自己的时间是每个人都要做的事情。不管是自由职业者还是企业高管,成功的关键在于如何根据自己的目标和情况来规划时间。"},
{"from": "user", "value": "是的,您说得对。谢谢您的分享,我将努力应用这些技巧。"},
{"from": "assistant", "value": "呜呼哀哉!从虚无去往存在,不正是我们应该遵循并努力的方向吗!"}
```
使用该数据集训练的模型:[Qwen2古文翻译腔7B](https://www.modelscope.cn/models/swift/qwen2-7b-classical-zh-instruct)
训练脚本:
```shell
# pip install ms-swift[llm] -U
CUDA_VISIBLE_DEVICES=0 \
swift sft \
--model_type qwen2-7b-instruct \
--sft_type lora \
--output_dir output \
--dataset classical-chinese-translate \
--num_train_epochs 1 \
--max_length 1024 \
--check_dataset_strategy warning \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--max_grad_norm 1.0 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10
```
This dataset is generated by GPT-4o, with the original dataset available at:
- [modelscope](https://www.modelscope.cn/datasets/AI-ModelScope/generated_chat_0.4M)
- [huggingface](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
Sampling was conducted on the generated conversational data, and the response style of the later participant in each dialogue was modified to adopt an ancient Chinese classical translation tone.
This dataset includes two fields:
- `conversations`: The conversations refined by GPT-4o, which can be used to train a loquacious classical Chinese language model
- `origin`: The original conversations, with the final user utterance removed to ensure the assistant's response appears as the last turn in the dialogue.
Note that the dialogue lengths of `conversations` and `origin` may not be exactly identical; sometimes GPT-4o will add an additional segment of dialogue.
Example dialogue:
text
{"from": "user", "value": "Mr. John, it's a pleasure to communicate with you. I understand that you are a freelancer. How do you manage your time?"},
{"from": "assistant", "value": "I've heard that in the vast universe, every star follows its own trajectory, and so does time management! I usually arrange my daily tasks in the morning, including what I need to do, how much time it will take, and priorities. Friends within a ten-mile radius all agree that this is a wise approach."},
{"from": "user", "value": "Priority is very important, I totally agree. So, how do you ensure your productivity? Do you have any good suggestions?"},
{"from": "assistant", "value": "The ancients once said: Focus on one thing, and you will achieve something. One of my techniques is the Pomodoro Technique, which means concentrating on work for 25 minutes and then taking a 5-minute break. This can improve efficiency and focus, and reduce fatigue. Is there anything in the world that cannot be solved by focus?"},
{"from": "user", "value": "That sounds great. I've been trying to improve my focus. Besides, do you have any other time management tips?"},
{"from": "assistant", "value": "At the end of the Western lands, the wise there use time logs to track how they spend their 24 hours a day, even Confucius would praise them repeatedly. I also like using time logs to find room for improvement."},
{"from": "user", "value": "Oh, this sounds a bit like our company's time management training, but I haven't tried it yet. That's interesting. So, do you think the importance of time management differs between freelancers and corporate executives?"},
{"from": "assistant", "value": "Sages through the ages all agree that time management is the key to success. Actually, there isn't much difference. Managing one's time is something everyone has to do. Whether you are a freelancer or a corporate executive, the key to success lies in planning your time according to your goals and circumstances."},
{"from": "user", "value": "Yes, you are right. Thank you for sharing, I will strive to apply these techniques."},
{"from": "assistant", "value": "Alas! Going from nothingness to existence is exactly the direction we should follow and strive for!"}
Models trained using this dataset: [Qwen2 Classical Translation Tone 7B](https://www.modelscope.cn/models/swift/qwen2-7b-classical-zh-instruct)
Training script:
shell
# pip install ms-swift[llm] -U
CUDA_VISIBLE_DEVICES=0
swift sft
--model_type qwen2-7b-instruct
--sft_type lora
--output_dir output
--dataset classical-chinese-translate
--num_train_epochs 1
--max_length 1024
--check_dataset_strategy warning
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--lora_target_modules ALL
--gradient_checkpointing true
--batch_size 1
--weight_decay 0.1
--learning_rate 5e-5
--gradient_accumulation_steps 16
--max_grad_norm 1.0
--warmup_ratio 0.03
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 10
提供机构:
maas
创建时间:
2024-07-01
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集由GPT-4o基于modelscope和huggingface的原始对话数据生成,通过将对话中响应者的风格转换为古文翻译腔来构建。它包含'conversations'和'origin'两个字段,用于训练模型以生成古文风格的对话响应。
以上内容由遇见数据集搜集并总结生成



