Omegle_logs_dataset
收藏魔搭社区2026-01-09 更新2026-01-10 收录
下载链接:
https://modelscope.cn/datasets/anon8231489123/Omegle_logs_dataset
下载链接
链接失效反馈官方服务:
资源简介:
~10k conversations from Omegle. Scraped using: http://web.archive.org/cdx/search/xd?url=logs.omegle.com/*&fl=timestamp,original,statuscode&output=json. For these logs to have ended up on the cdx, it means the url was posted publicly at some point.
* PII removed by searching for conversations with these words: forbidden_words = ["kik", "telegram", "skype", "wickr", "discord", "dropbox", "insta ", "insta?", "instagram", "snap ", "snapchat"].
* Conversations with racial slurs removed.
* English only.
* Obviously, the dataset still contains a lot of (sometimes extreme) NSFW content. Do not view or use this dataset if you are under 18.
General process for scraping (There are probably other datasets that can be scraped using this method):
1. Go to page in archive.org cdx
2. Check if the page contains a log
3. Download the log image
4. Use OCR to read it
5. Save it to a json file.
This dataset could be useful for training casual conversational AI's but it likely still requires more filtering. Use at your own risk.
约10000条来自Omegle的对话数据集,采集自:http://web.archive.org/cdx/search/xd?url=logs.omegle.com/*&fl=timestamp,original,statuscode&output=json。此类日志能够被收录至CDX索引,说明其对应的URL曾被公开发布过。
* 已通过检索包含以下词汇的对话移除个人可识别信息(Personally Identifiable Information, PII):禁用词汇列表为["kik", "telegram", "skype", "wickr", "discord", "dropbox", "insta ", "insta?", "instagram", "snap ", "snapchat"]
* 已删除包含种族蔑称的对话
* 仅保留英文对话
* 显然,本数据集仍包含大量(有时甚至是极端的)不宜公开内容(Not Safe For Work, NSFW)。未满18岁者请勿查看或使用本数据集。
通用采集流程(使用该方法或可采集到其他同类数据集):
1. 访问网页存档(Web Archive)的CDX索引页面
2. 检查页面是否包含对话日志
3. 下载日志图片
4. 通过光学字符识别(Optical Character Recognition, OCR)提取文本内容
5. 将提取结果保存至JSON文件。
本数据集可用于训练非正式对话类大语言模型(Large Language Model, LLM),但仍需进一步过滤处理。使用本数据集请自行承担风险。
提供机构:
maas
创建时间:
2025-11-18



