ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/g2sdzmssgh
下载链接
链接失效反馈官方服务:
资源简介:
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods.
This supports published article:
C. J. Lynch et al., "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning," IEEE Access, vol. 13, 2025, https://ieeexplore.ieee.org/abstract/document/11129070. Abstract: Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.
Including:
* Tagged datasets (.csv): human-tagged gold labels for evaluation
* Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative
- Suitable for inference, semi-automatic labeling, or transfer learning
* Python and R code for preprocessing, model training, evaluation, and visualization
* Configuration files and environment specifications to enable end-to-end reproducibility
Value of the Data:
* Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers.
* Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis.
* Offers untagged datasets for new annotation or domain adaptation.
* Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows.
* Facilitates extension into other domains (e.g., multilingual LLM messaging validation).
Data Description:
* /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv.
* /data/untagged/*.csv – Clean datasets without labels for inference or annotation.
* /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting.
* /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.
File Formats:
* Data: CSV (UTF-8, RFC 4180)
* Code: .py, .R, .Rproj
Ethics & Licensing
* All data are de-identified and contain no PII.
* Released under CC BY 4.0 (data) and MIT License (code).
Limitations
* Labels reflect annotator interpretations and may encode bias.
* Models trained on English text; generalization to other languages requires adaptation.
Funding Note
* Funding sources provided time in support of human taggers annotating the data sets.
Initial PrePrint available at:
[Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1].
创建时间:
2025-09-25



