five

pymlex/wolf-quotes

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/pymlex/wolf-quotes
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含95,000条俄语狼的引用。原始数据是一个340MB的文件,来自neurovolk仓库。项目目标是将原始数据流分割成单个引用,去除完全重复项,基于字符级别过滤近重复项,并保存为干净的CSV格式以便重用。经过处理,原始2,323,394条引用片段经过精确去重后剩余100,545条唯一引用,再通过MinHash和LSH过滤去除5,599条近重复项,最终数据集包含94,946条引用。

This dataset contains 95k Russian wolf quotes. The original raw 340 MB file was found in the neurovolk repository. The goal of this project is to split that stream into individual quotes, remove exact duplicates, filter near-duplicates at the character level, and save the result as a clean CSV ready for reuse. After splitting, the dataset contains 2,323,394 raw quote fragments. Exact deduplication leaves 100,545 unique quotes. MinHash and LSH filtering removes another 5,599 near-duplicates, and the final dataset contains 94,946 quotes.
提供机构:
pymlex
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作