Data supporting the thesis “Exploring Hybrid Intelligence for Topic Interpretation in Colorectal Cancer Research: A Comparative Study of GPT-3.5 and Human Expertise”
收藏4TU.ResearchData2023-09-04 更新2026-04-23 收录
下载链接:
https://data.4tu.nl/datasets/a7e63b3f-18f5-4ae4-8750-255528f82178/1
下载链接
链接失效反馈官方服务:
资源简介:
The research objective of this thesis is to bridge the gap between human and machine intelligence in the interpretation of colorectal cancer patient experiences extracted from patient web forums. This Computer Science thesis was done in collaboration with colorectal cancer human experts from Erasmus MC. To perform this scientific research and make these human experts and GPT-3.5 interpret colorectal cancer patient experiences, nearly 300k patient web forums were scraped from the American platform called Cancer Survivors Network USA (Colorectal Cancer — Cancer Survivors Network). For extracting the patient web forums, the Selenium webdriver was used to extract the page urls for each discussion thread, and BeautifulSoup4 (bs4) was used to access the page urls and parse the html elements from each type of patient forum, including main post, comment and reply, and store them in a local dataset. The patient forum attributes stored in the dataset are: URL – username (i.e. author of the post)– userposts (i.e. number of posts written by the author)– time (i.e. when the post was made)– title – post (i.e. text consisting of unstructured colorectal cancer patient experiences)
本研究的核心目标是弥合人类与机器智能在解读结直肠癌患者网络论坛所披露患病经验之间的鸿沟。本计算机科学学位论文与伊拉斯姆斯医学中心(Erasmus MC)的结直肠癌临床专家合作完成。为开展此项科研工作,并实现领域专家与GPT-3.5对结直肠癌患者患病经验的共同解读,研究团队从美国平台“癌症幸存者网络美国站(Cancer Survivors Network USA)”的结直肠癌专版(Cancer Survivors Network)爬取了近30万条患者论坛帖子。在论坛内容提取环节,研究采用Selenium浏览器驱动(Selenium webdriver)获取各讨论串的页面链接,随后通过BeautifulSoup4(bs4)访问上述链接,解析各类患者论坛页面的HTML元素,涵盖主帖、评论及回复,并将数据存储至本地数据集。本数据集收录的论坛属性包括:URL、用户名(即帖子作者)、用户发帖数(即作者累计发帖总量)、发布时间、帖子标题、帖子正文(即包含非结构化结直肠癌患者患病体验的文本内容)
提供机构:
Patandin, Ayush
创建时间:
2023-09-04



