five

Pairwise Duplicate News Dataset

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/jmrhxb5666
下载链接
链接失效反馈
官方服务:
资源简介:
The Pairwise Duplicate News Dataset consists of 10,836 article pairs (5,418 duplicate and 5,418 non-duplicate) collected from seven major Pakistani English news outlets, DAWN, The Express Tribune, GEO News, The News International, The Nation, 24 News, and 92 News HD. The dataset was created from a larger corpus of 54,929 news articles covering eight categories: National, World, Business, Sports, Entertainment, Crime, Technology, and Health. The dataset was constructed using Sentence-BERT embeddings and FAISS cosine similarity search, with a similarity threshold of 0.85 for identifying duplicate articles. Each duplicate cluster represents a single real-world news event, and one pair was sampled per cluster. Non-duplicate pairs were formed by randomly pairing articles from different clusters and stratified into three similarity levels: easy (0.2-0.3), medium (0.3-0.5), and hard (0.5-0.7) to ensure balanced evaluation difficulty.
创建时间:
2025-10-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作