five

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/g2sdzmssgh
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. This supports published article: C. J. Lynch et al., "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning," IEEE Access, vol. 13, 2025, https://ieeexplore.ieee.org/abstract/document/11129070. Abstract: Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts. Including: * Tagged datasets (.csv): human-tagged gold labels for evaluation * Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative - Suitable for inference, semi-automatic labeling, or transfer learning * Python and R code for preprocessing, model training, evaluation, and visualization * Configuration files and environment specifications to enable end-to-end reproducibility Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation). Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables. File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code). Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation. Funding Note * Funding sources provided time in support of human taggers annotating the data sets. Initial PrePrint available at: [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1].
创建时间:
2025-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作