A Human-Rated Hospitality Review Benchmark for LLM- Generated Sentiment Quadruple Extraction
收藏DataONE2026-05-06 更新2026-05-19 收录
下载链接:
https://search.dataone.org/view/sha256:08f3569c12fbd4e6a5af8a9900e5400f09f81b749e15f2a1eaa51ec04f8c8734
下载链接
链接失效反馈官方服务:
资源简介:
This dataset provides a human-rated benchmark for evaluating LLM-generated sentiment quadruples in hospitality reviews. It contains 40 hospitality reviews, 109 predicted–reference quadruple comparison pairs, gold reference annotations, LLM-generated outputs, exact-match F1 scores, Semantic-Aware Flexible Evaluation (SAFE) scores, and human ratings across three dimensions: output acceptability, semantic similarity to reference annotations, and perceived alignment with metric scoring behaviour. The benchmark supports evaluation of LLM-based Quad-ABSA systems, comparison of automatic evaluation metrics, analysis of exact-match versus semantic-aware scoring, and development of new metric baselines for structured sentiment analysis. Although originally constructed for validating SAFE, the dataset can be used independently to evaluate other LLM outputs and study cases where automatic metric scores diverge from human judgement.
创建时间:
2026-05-09



