Tibetan Chinese cross language summary dataset TiCLS
收藏DataCite Commons2025-04-27 更新2025-05-18 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=fc8d81681ddc4180a37f65437660bca2
下载链接
链接失效反馈官方服务:
资源简介:
The Tibetan Chinese cross lingual summary dataset TiCLS contains a total of 2000 samples in JSON format. In each JSON file, there are 2 keys, where text corresponds to the news content in the Tibetan source language and summary corresponds to the news summary in the Chinese target language. The data is crawled from Tibetan news websites. To ensure data quality, irrelevant content such as news agencies, images, videos, images, video name descriptions, and reporters are removed during data crawling, leaving only the main content of the news. Then, with the help of existing mature Tibetan Chinese translation tools, the Tibetan source language news abstract is translated into the Chinese target language abstract. In order to further improve the quality of the dataset, this article evaluated the quality of the dataset from the aspects of factual consistency, adequacy, and fluency of the abstract. After screening, 2000 high-quality samples were obtained.
提供机构:
Science Data Bank
创建时间:
2024-01-22



