ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/g2sdzmssgh

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. This supports published article: C. J. Lynch et al., "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning," IEEE Access, vol. 13, 2025, https://ieeexplore.ieee.org/abstract/document/11129070. Abstract: Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts. Including: * Tagged datasets (.csv): human-tagged gold labels for evaluation * Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative - Suitable for inference, semi-automatic labeling, or transfer learning * Python and R code for preprocessing, model training, evaluation, and visualization * Configuration files and environment specifications to enable end-to-end reproducibility Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation). Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables. File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code). Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation. Funding Note * Funding sources provided time in support of human taggers annotating the data sets. Initial PrePrint available at: [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1].

创建时间：

2025-09-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集