five

DRD: Chinese Diplomatic Rhetoric Dataset for Supervised Fine-Tuning of Large Language Models

收藏
DataCite Commons2025-12-19 更新2025-04-16 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=c626220792a1446d96b66861e955fed5
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset derived from the routine press conferences held by the spokespersons of China's Ministry of Foreign Affairs between 2000 and 2024. A total of 20,745 Q&A pairs were collected and curated, forming a comprehensive Chinese Diplomatic Rhetoric Dataset (DRD) intended for supervised fine-tuning of large language models. The aim is to provide a specialized dataset on diplomatic dialogue strategies to enhance existing Chinese language models, enabling them to more accurately comprehend and respond to diplomatic discourse within an international context. The paper introduces the N-Jaccard text similarity algorithm, which mitigates the sensitivity to text length and considers the order of words, thereby capturing semantic relationships and logical connections more effectively. Utilizing the GPT-3.5 Turbo large language model, core information was extracted from the data, converted into contextually appropriate topics, and refined to generate input prompts and output responses. Both machine and human reviewers audited and corrected the initial processed data to ensure the dataset's quality. The DRD dataset, based on extensive high-quality Q&A data, effectively reflects the conciseness, accuracy, and artistry of diplomatic language expression.
提供机构:
Science Data Bank
创建时间:
2024-08-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作