Pairwise Duplicate News Dataset

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/jmrhxb5666

下载链接

链接失效反馈

官方服务：

资源简介：

The Pairwise Duplicate News Dataset consists of 10,836 article pairs (5,418 duplicate and 5,418 non-duplicate) collected from seven major Pakistani English news outlets, DAWN, The Express Tribune, GEO News, The News International, The Nation, 24 News, and 92 News HD. The dataset was created from a larger corpus of 54,929 news articles covering eight categories: National, World, Business, Sports, Entertainment, Crime, Technology, and Health. The dataset was constructed using Sentence-BERT embeddings and FAISS cosine similarity search, with a similarity threshold of 0.85 for identifying duplicate articles. Each duplicate cluster represents a single real-world news event, and one pair was sampled per cluster. Non-duplicate pairs were formed by randomly pairing articles from different clusters and stratified into three similarity levels: easy (0.2-0.3), medium (0.3-0.5), and hard (0.5-0.7) to ensure balanced evaluation difficulty.

创建时间：

2025-10-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集