TibAdD-D2Std ST:Tibetan Amdo Dialect Style Transformation Dataset
收藏DataCite Commons2026-04-30 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=16451d95f4034afca2665e4113908a72
下载链接
链接失效反馈官方服务:
资源简介:
The entire dataset comprises 60,299 parallel sentence pairs of spoken and written Anduo dialect of Tibetan, with a data volume of 28.7MB. Among them, 15,000 data points are sourced from internal public data of the laboratory, while the remaining 45,299 data points are collected from public live broadcasts on short video platforms. Systematic preprocessing, manual annotation, and other data processing tasks are conducted according to standard data processing rules, and the data is anonymized. Ultimately, high-quality parallel data for the Anduo dialect of Tibetan is constructed. This dataset is stored in JSON format, with each element in the array representing an independent JSON object corresponding to an alignment entry of a set of spoken language text and its standard written language text. Each object contains four fields: unique ID, spoken language, written language, and data source. All objects constitute a JSON array, facilitating program parsing, batch processing, and subsequent stylistic transformation modeling work.
提供机构:
Science Data Bank
创建时间:
2026-04-30



