Data supporting the thesis “Exploring Hybrid Intelligence for Topic Interpretation in Colorectal Cancer Research: A Comparative Study of GPT-3.5 and Human Expertise”
收藏DataCite Commons2023-09-04 更新2024-07-03 收录
下载链接:
https://data.4tu.nl/datasets/a7e63b3f-18f5-4ae4-8750-255528f82178/1
下载链接
链接失效反馈官方服务:
资源简介:
The research objective of this thesis is to bridge the gap between human and machine intelligence in the interpretation of colorectal cancer patient experiences extracted from patient web forums. This Computer Science thesis was done in collaboration with colorectal cancer human experts from Erasmus MC. To perform this scientific research and make these human experts and GPT-3.5 interpret colorectal cancer patient experiences, nearly 300k patient web forums were scraped from the American platform called Cancer Survivors Network USA (Colorectal Cancer — Cancer Survivors Network). For extracting the patient web forums, the Selenium webdriver was used to extract the page urls for each discussion thread, and BeautifulSoup4 (bs4) was used to access the page urls and parse the html elements from each type of patient forum, including main post, comment and reply, and store them in a local dataset. The patient forum attributes stored in the dataset are: URL – username (i.e. author of the post)– userposts (i.e. number of posts written by the author)– time (i.e. when the post was made)– title – post (i.e. text consisting of unstructured colorectal cancer patient experiences)
本论文的研究目标为弥合结直肠癌患者论坛自述文本解读场景下,人类智能与机器智能之间的认知鸿沟。本研究为计算机科学领域学位论文,与荷兰伊拉斯姆斯医学中心(Erasmus MC)的结直肠癌临床专家合作完成。为开展此项科研工作,并实现人类专家与GPT-3.5对结直肠癌患者论坛自述内容的解读,研究团队从美国平台"癌症幸存者网络美国站(Cancer Survivors Network USA)"的结直肠癌专区(Colorectal Cancer — Cancer Survivors Network)爬取了近30万条患者论坛帖子。在论坛文本提取环节,本研究使用Selenium WebDriver获取各讨论串的页面链接,再借助BeautifulSoup4(简称bs4)解析各类患者论坛页面的HTML元素,涵盖主帖、评论及回复内容,并将其存储至本地数据集。本数据集包含的论坛元数据字段依次为:URL、用户名(即帖子作者)、用户发帖数(即作者累计发帖总量)、发帖时间、帖子标题、帖子正文(即包含非结构化结直肠癌患者体验的文本内容)。
提供机构:
4TU.ResearchData
创建时间:
2023-09-04
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集支持一项关于混合智能在结直肠癌研究中主题解释的比较研究,包含从美国癌症幸存者网络平台抓取的近30万条患者论坛帖子,用于分析人类专家与GPT-3.5在解释患者体验方面的差异。数据以CSV格式存储,涵盖2000-2023年的时间范围,专注于肿瘤学和自然语言处理领域,适用于结直肠癌患者体验的文本分析和人工智能交互研究。
以上内容由遇见数据集搜集并总结生成



