Tang Poetry Sentence-Pair Dataset for Metrical Imitation

Name: Tang Poetry Sentence-Pair Dataset for Metrical Imitation
Creator: Science Data Bank
Published: 2026-03-20 02:18:10
License: 暂无描述

DataCite Commons2026-03-20 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=dbddeca801884c44b28a49d56dc09c14

下载链接

链接失效反馈

官方服务：

资源简介：

Large language models (LLMs) consistently exhibit three failure modes in Tang poetry imitation: metrical irregularity, defective antithesis, and imagery repetition. To address these systematic deficiencies, this study introduces a parallel corpus purpose-built for Tang poetry metrical imitation. Sourced from the Complete Tang Poems, the corpus encompasses four canonical prosodic forms—five-character quatrains, five-character regulated verses, seven-character quatrains, and seven-character regulated verses—comprising 17,960 entries in total. Each entry consists of an original couplet paired with its manually composed imitation counterpart. All imitation pairs were composed by Classical Chinese specialists and subjected to a three-stage verification procedure encompassing meter, antithesis, and imagery; a non-overlap constraint further requires that core content words in each imitation share no lexical items with the source couplet. Fine-tuning experiments conducted on Qwen3-8B and GLM4-9B demonstrate that dataset-trained models achieve nearly double the baseline metrical accuracy—from 32.8% to 62.8% for Qwen3-8B—while simultaneously improving antithesis conformity and imagery novelty. The corpus serves both as a supervised training resource for metrical poetry generation models and as a benchmark dataset for investigating text generation under strict formal constraints.

提供机构：

Science Data Bank

创建时间：

2026-01-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集