five

npedrazzini/NewsBERT_19thc_ms_news_embeddings

收藏
Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/npedrazzini/NewsBERT_19thc_ms_news_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification - sentence-similarity - text-retrieval language: - en tags: - history - newspapers - news - articles - murder - suicide - british-press --- # Article-level embeddings of historical British newspaper articles mentioning murder and suicide This repository contains article-level embeddings of a dataset of historical English newspaper articles from the LwM and HMD14 collections (1800-1920), containing mentions of *murder* and *suicide*. ## Related Model - **[npedrazzini/NewsBERT](https://huggingface.co/npedrazzini/NewsBERT)** All article-level embeddings in this dataset were generated using this masked language model. Mean-pooled CLS embeddings were extracted using the final encoder layer. ## Related Dataset - **[npedrazzini/19thc_ms_news](https://huggingface.co/npedrazzini/19thc_ms_news)** The text of the articles and the associated metadata are from the datasets `murder.csv` and `suicide.csv` there. ## Files ### **1. `murder_suicide_similarity_results.csv`** A concatenation of `murder.csv` and `suicide.csv`, augmented with the column `most_similar`. For each article, `most_similar` contains the row indices of the **100 most similar other articles** computed via mean-pooled embeddings from a domain-adapted BERT model ([*NewsBERT*]((https://huggingface.co/npedrazzini/NewsBERT))). ### **2. `article_level_embeddings/murder_suicide_embeddings.pt`** Containing the article-level embeddings of each articles in `murder_suicide_similarity_results.csv`. It's a PyTorch tensor of shape: [num_articles, embedding_dim] ### **3. `article_level_embeddings/murder_suicide_metadata.json`** A list of dictionaries of the form: ```json { "id": "article_identifier", "index": 0 } ``` Mapping each article in `murder_suicide_similarity_results.csv` to its embedding.
提供机构:
npedrazzini
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作