five

Hyperreal Talk (Polish clear web message board) messages data

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10810250
下载链接
链接失效反馈
官方服务:
资源简介:
General Information 1. Title of Dataset Hyperreal Talk (Polish clear web message board) messages data. 2. Data Collectors Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland). 3. Funding Information The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710. Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]). Data Collection Context 4. Data Source Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/). 5. Purpose This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture. The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals. 6. Collection Method The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board. 7. Collection Date The data was collected in two periods, i.e., in September 2023 and November 2023. Data Content 8. Data Description The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts. 9. Data Cleaning, Processing, and Anonymization The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated. 10. File Formats and Variables/Fields The dataset consists of the following files: Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board. Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure. A .csv file that lists all the messages, including file names and the content of each post. Accessibility and Usage 11. Access Conditions The data can be accessed without any restrictions. 12. Related Documentation Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.” Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs: https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710 https://github.com/HaitaoShi/Scrapy_hyperreal" Ethical Considerations 13. Ethics Statement A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper: Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680. The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.
创建时间:
2024-03-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作