npedrazzini/NewsBERT_19thc_ms_news_embeddings
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/npedrazzini/NewsBERT_19thc_ms_news_embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
- sentence-similarity
- text-retrieval
language:
- en
tags:
- history
- newspapers
- news
- articles
- murder
- suicide
- british-press
---
# Article-level embeddings of historical British newspaper articles mentioning murder and suicide
This repository contains article-level embeddings of a dataset of historical English newspaper articles from the LwM and HMD14 collections (1800-1920), containing mentions of *murder* and *suicide*.
## Related Model
- **[npedrazzini/NewsBERT](https://huggingface.co/npedrazzini/NewsBERT)**
All article-level embeddings in this dataset were generated using this masked language model.
Mean-pooled CLS embeddings were extracted using the final encoder layer.
## Related Dataset
- **[npedrazzini/19thc_ms_news](https://huggingface.co/npedrazzini/19thc_ms_news)**
The text of the articles and the associated metadata are from the datasets `murder.csv` and `suicide.csv` there.
## Files
### **1. `murder_suicide_similarity_results.csv`**
A concatenation of `murder.csv` and `suicide.csv`, augmented with the column `most_similar`.
For each article, `most_similar` contains the row indices of the **100 most similar other articles** computed via mean-pooled embeddings from a domain-adapted BERT model ([*NewsBERT*]((https://huggingface.co/npedrazzini/NewsBERT))).
### **2. `article_level_embeddings/murder_suicide_embeddings.pt`**
Containing the article-level embeddings of each articles in `murder_suicide_similarity_results.csv`.
It's a PyTorch tensor of shape:
[num_articles, embedding_dim]
### **3. `article_level_embeddings/murder_suicide_metadata.json`**
A list of dictionaries of the form:
```json
{
"id": "article_identifier",
"index": 0
}
```
Mapping each article in `murder_suicide_similarity_results.csv` to its embedding.
提供机构:
npedrazzini



