Tang Poetry Sentence-Pair Dataset for Metrical Imitation
收藏DataCite Commons2026-03-20 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=dbddeca801884c44b28a49d56dc09c14
下载链接
链接失效反馈官方服务:
资源简介:
Large language models (LLMs) consistently exhibit three failure modes in Tang poetry imitation: metrical irregularity, defective antithesis, and imagery repetition. To address these systematic deficiencies, this study introduces a parallel corpus purpose-built for Tang poetry metrical imitation. Sourced from the Complete Tang Poems, the corpus encompasses four canonical prosodic forms—five-character quatrains, five-character regulated verses, seven-character quatrains, and seven-character regulated verses—comprising 17,960 entries in total. Each entry consists of an original couplet paired with its manually composed imitation counterpart. All imitation pairs were composed by Classical Chinese specialists and subjected to a three-stage verification procedure encompassing meter, antithesis, and imagery; a non-overlap constraint further requires that core content words in each imitation share no lexical items with the source couplet. Fine-tuning experiments conducted on Qwen3-8B and GLM4-9B demonstrate that dataset-trained models achieve nearly double the baseline metrical accuracy—from 32.8% to 62.8% for Qwen3-8B—while simultaneously improving antithesis conformity and imagery novelty. The corpus serves both as a supervised training resource for metrical poetry generation models and as a benchmark dataset for investigating text generation under strict formal constraints.
提供机构:
Science Data Bank
创建时间:
2026-01-21



