NLP-POL/instagram-political-communication-it-embeddings
收藏Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NLP-POL/instagram-political-communication-it-embeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- name: instagram-political-communication-it-embeddings
pretty_name: "Instagram Political Communication (Italy) — Embeddings"
version: 1.0.0
license: cc-by-4.0
task_categories:
- feature-extraction
- sentence-similarity
- text-retrieval
language:
- it
configs:
- config_name: post_caption_embeddings
data_files:
- split: train
path: "data/post_caption_embeddings/*.parquet"
- config_name: post_sentence_embeddings
data_files:
- split: train
path: "data/post_sentence_embeddings/*.parquet"
- config_name: comment_embeddings
data_files:
- split: train
path: "data/comment_embeddings/*.parquet"
- config_name: post_keyphrase_embeddings
data_files:
- split: train
path: "data/post_keyphrase_embeddings/*.parquet"
- config_name: comment_keyphrase_embeddings
data_files:
- split: train
path: "data/comment_keyphrase_embeddings/*.parquet"
---
# Instagram Political Communication (Italy) — Embeddings
This dataset is the **companion embeddings dataset** of
**`instagram-political-communication-it`**, released as part of the **NLP-POL (NLP for Political Communication)** project.
It provides **vector representations (embeddings)** for Instagram posts, comments, sentences, and keyphrases related to the political communication of Italian politicians.
The dataset is designed to support research on:
- semantic analysis of political language
- representation learning in political discourse
- similarity, clustering, and retrieval tasks
- downstream NLP experiments built on top of the NLP-POL core dataset
## Relationship to the Core Dataset
This dataset **does not contain raw text or metadata**.
Instead, it provides **embeddings aligned via stable identifiers** to the core dataset:
🔗 **Core dataset:**
`instagram-political-communication-it` — [Go to repository](https://huggingface.co/datasets/NLP-POL/instagram-political-communication-it)
All records reference entities in the core dataset using:
- `post__id`
- `comment__id`
## Example Usage
The following example shows how to load post table from the core dataset, load post embeddings from this companion embeddings dataset, and perform explicit joins using stable identifiers.
This approach ensures transparency, reproducibility, and full control over relational operations.
```python
import duckdb
from datasets import load_dataset
q_posts = con.execute("""
SELECT *
FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/posts/*.parquet')
LIMIT 10
""")
posts_df = q_posts.fetch_df()
post_embeddings_q = con.execute(f"""
SELECT *
FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it-embeddings/data/post_caption_embeddings/*.parquet')
WHERE post__id IN ({', '.join([f"'{_id}'" for _id in posts_df['_id'].tolist()])})
""")
post_embeddings_df = post_embeddings_q.fetch_df()
join_df = posts_df.merge(
post_embeddings_df,
left_on='_id',
right_on='post__id',
how='inner',
suffixes=('_post', '_embedding')
)
display(join_df.head())
```
## Dataset Structure
The dataset is released as a **multi-table relational dataset** with flat schemas.
### Tables
| Table | Description |
|------|------------|
| `post_caption_embeddings` | Embeddings of Instagram post captions |
| `post_sentence_embeddings` | Sentence-level embeddings extracted from post captions |
| `comment_embeddings` | Embeddings of Instagram comments |
| `post_keyphrase_embeddings` | Embeddings of keyphrases extracted from posts |
| `comment_keyphrase_embeddings` | Embeddings of keyphrases extracted from comments |
## Data Fields Overview
### Post Caption Embeddings (`post_caption_embeddings`)
| Field | Description |
|------|-------------|
| `post__id` | Referenced post identifier (core dataset) |
| `embedding_model` | Name of the embedding model |
| `embeddings_caption` | Caption embedding vector |
| `dataset_version` | Dataset version |
### Post Sentence Embeddings (`post_sentence_embeddings`)
| Field | Description |
|------|-------------|
| `post__id` | Referenced post identifier |
| `embeddings_sentences_sentence_idx` | Sentence index within the post |
| `embeddings_sentences_sentence` | Sentence text |
| `embedding_model` | Name of the embedding model |
| `embeddings_sentences_embedding` | Sentence embedding vector |
| `dataset_version` | Dataset version |
### Comment Embeddings (`comment_embeddings`)
| Field | Description |
|------|-------------|
| `comment__id` | Referenced comment identifier |
| `embedding_model` | Name of the embedding model |
| `embeddings` | Comment embedding vector |
| `dataset_version` | Dataset version |
### Keyphrase Embeddings
**Post keyphrases (`post_keyphrase_embeddings`)**
| Field | Description |
|------|-------------|
| `post__id` | Referenced post identifier |
| `keyphrases_keyphrase` | Extracted keyphrase |
| `embedding_model` | Name of the embedding model |
| `keyphrases_embedding` | Keyphrase embedding vector |
| `dataset_version` | Dataset version |
**Comment keyphrases (`comment_keyphrase_embeddings`)**
| Field | Description |
|------|-------------|
| `comment__id` | Referenced comment identifier |
| `keyphrases_keyphrase` | Extracted keyphrase |
| `embedding_model` | Name of the embedding model |
| `keyphrases_embedding` | Keyphrase embedding vector |
| `dataset_version` | Dataset version |
## Embedding Generation
Embeddings are generated as part of the NLP-POL preprocessing pipeline after text normalization and linguistic analysis.
Key characteristics:
- Fixed-size dense vectors
- Sentence-level and document-level representations
- Generated consistently across dataset versions
The specific embedding model used is recorded in the `embedding_model` field to support reproducibility and model comparison.
## Intended Use and Limitations
**Intended use:** semantic analysis, similarity search, clustering, representation learning, downstream NLP tasks in political communication research.
**Limitations:**
- Embeddings inherit biases from the underlying language models
- Semantic representations depend on preprocessing choices and model selection
- This dataset should always be used together with the core dataset for interpretation
## License
Released under **Creative Commons Attribution 4.0 (CC-BY 4.0)**.
## Citation
If you use this dataset, please cite the **core dataset**:
```bibtex
@dataset{nlp_pol_instagram_political_communication_it_2026,
title = {NLP-POL: Instagram Political Communication (Italy)},
author = {PMG-t and NLP-POL Project},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/PMG-t/instagram-political-communication-it},
note = {Maintained by PMG-t. Part of the NLP-POL (NLP for Political Communication) project.},
howpublished = {\url{https://github.com/PMG-t}}
}
提供机构:
NLP-POL



