five

The big model fine-tuning data set of five key elements of tourism resources in the five northwestern provinces in 2024

收藏
科学数据银行2025-08-31 更新2026-04-23 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=02eb037b31ee4b3cb1f3d91da54508f4
下载链接
链接失效反馈
官方服务:
资源简介:
With the wide application of large models in various fields, the demand for high-quality data sets in the tourism industry is increasing to support the improvement of the model 's ability to understand and generate tourism information. This dataset focuses on textual data in the tourism domain and is designed to support fine-tuning tasks for tourism-oriented large models, aiming to enhance the model's ability to understand and generate tourism-related information. The diversity and quality of the dataset are critical to the model's performance. Therefore, this study combines web scraping and manual annotation techniques, along with data cleaning, denoising, and stopword removal, to ensure high data quality and accuracy. Additionally, automated annotation tools are used to generate instructions and perform consistency checks on the texts. The LLM-Tourism dataset primarily relies on data from Ctrip and Baidu Baike, covering five Northwestern Chinese provinces: Gansu, Ningxia, Qinghai, Shaanxi, and Xinjiang, containing 53,280 pairs of structured data in JSON format. The creation of this dataset will not only improve the generation accuracy of tourism large models but also contribute to the sharing and application of tourism-related datasets in the field of large models.
提供机构:
西北民族大学,甘肃省民族语言智能处理重点实验室; lu bao qing; 西北民族大学,中国民族语言文字信息技术重点实验室
创建时间:
2024-09-30
二维码
社区交流群
二维码
科研交流群
商业服务