NLP-POL/instagram-political-communication-it-embeddings

Name: NLP-POL/instagram-political-communication-it-embeddings
Creator: NLP-POL
Published: 2026-01-03 13:06:28
License: 暂无描述

Hugging Face2026-01-03 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NLP-POL/instagram-political-communication-it-embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - name: instagram-political-communication-it-embeddings pretty_name: "Instagram Political Communication (Italy) — Embeddings" version: 1.0.0 license: cc-by-4.0 task_categories: - feature-extraction - sentence-similarity - text-retrieval language: - it configs: - config_name: post_caption_embeddings data_files: - split: train path: "data/post_caption_embeddings/*.parquet" - config_name: post_sentence_embeddings data_files: - split: train path: "data/post_sentence_embeddings/*.parquet" - config_name: comment_embeddings data_files: - split: train path: "data/comment_embeddings/*.parquet" - config_name: post_keyphrase_embeddings data_files: - split: train path: "data/post_keyphrase_embeddings/*.parquet" - config_name: comment_keyphrase_embeddings data_files: - split: train path: "data/comment_keyphrase_embeddings/*.parquet" --- # Instagram Political Communication (Italy) — Embeddings This dataset is the **companion embeddings dataset** of **`instagram-political-communication-it`**, released as part of the **NLP-POL (NLP for Political Communication)** project. It provides **vector representations (embeddings)** for Instagram posts, comments, sentences, and keyphrases related to the political communication of Italian politicians. The dataset is designed to support research on: - semantic analysis of political language - representation learning in political discourse - similarity, clustering, and retrieval tasks - downstream NLP experiments built on top of the NLP-POL core dataset ## Relationship to the Core Dataset This dataset **does not contain raw text or metadata**. Instead, it provides **embeddings aligned via stable identifiers** to the core dataset: 🔗 **Core dataset:** `instagram-political-communication-it` — [Go to repository](https://huggingface.co/datasets/NLP-POL/instagram-political-communication-it) All records reference entities in the core dataset using: - `post__id` - `comment__id` ## Example Usage The following example shows how to load post table from the core dataset, load post embeddings from this companion embeddings dataset, and perform explicit joins using stable identifiers. This approach ensures transparency, reproducibility, and full control over relational operations. ```python import duckdb from datasets import load_dataset q_posts = con.execute(""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it/data/posts/*.parquet') LIMIT 10 """) posts_df = q_posts.fetch_df() post_embeddings_q = con.execute(f""" SELECT * FROM read_parquet('hf://datasets/NLP-POL/instagram-political-communication-it-embeddings/data/post_caption_embeddings/*.parquet') WHERE post__id IN ({', '.join([f"'{_id}'" for _id in posts_df['_id'].tolist()])}) """) post_embeddings_df = post_embeddings_q.fetch_df() join_df = posts_df.merge( post_embeddings_df, left_on='_id', right_on='post__id', how='inner', suffixes=('_post', '_embedding') ) display(join_df.head()) ``` ## Dataset Structure The dataset is released as a **multi-table relational dataset** with flat schemas. ### Tables | Table | Description | |------|------------| | `post_caption_embeddings` | Embeddings of Instagram post captions | | `post_sentence_embeddings` | Sentence-level embeddings extracted from post captions | | `comment_embeddings` | Embeddings of Instagram comments | | `post_keyphrase_embeddings` | Embeddings of keyphrases extracted from posts | | `comment_keyphrase_embeddings` | Embeddings of keyphrases extracted from comments | ## Data Fields Overview ### Post Caption Embeddings (`post_caption_embeddings`) | Field | Description | |------|-------------| | `post__id` | Referenced post identifier (core dataset) | | `embedding_model` | Name of the embedding model | | `embeddings_caption` | Caption embedding vector | | `dataset_version` | Dataset version | ### Post Sentence Embeddings (`post_sentence_embeddings`) | Field | Description | |------|-------------| | `post__id` | Referenced post identifier | | `embeddings_sentences_sentence_idx` | Sentence index within the post | | `embeddings_sentences_sentence` | Sentence text | | `embedding_model` | Name of the embedding model | | `embeddings_sentences_embedding` | Sentence embedding vector | | `dataset_version` | Dataset version | ### Comment Embeddings (`comment_embeddings`) | Field | Description | |------|-------------| | `comment__id` | Referenced comment identifier | | `embedding_model` | Name of the embedding model | | `embeddings` | Comment embedding vector | | `dataset_version` | Dataset version | ### Keyphrase Embeddings **Post keyphrases (`post_keyphrase_embeddings`)** | Field | Description | |------|-------------| | `post__id` | Referenced post identifier | | `keyphrases_keyphrase` | Extracted keyphrase | | `embedding_model` | Name of the embedding model | | `keyphrases_embedding` | Keyphrase embedding vector | | `dataset_version` | Dataset version | **Comment keyphrases (`comment_keyphrase_embeddings`)** | Field | Description | |------|-------------| | `comment__id` | Referenced comment identifier | | `keyphrases_keyphrase` | Extracted keyphrase | | `embedding_model` | Name of the embedding model | | `keyphrases_embedding` | Keyphrase embedding vector | | `dataset_version` | Dataset version | ## Embedding Generation Embeddings are generated as part of the NLP-POL preprocessing pipeline after text normalization and linguistic analysis. Key characteristics: - Fixed-size dense vectors - Sentence-level and document-level representations - Generated consistently across dataset versions The specific embedding model used is recorded in the `embedding_model` field to support reproducibility and model comparison. ## Intended Use and Limitations **Intended use:** semantic analysis, similarity search, clustering, representation learning, downstream NLP tasks in political communication research. **Limitations:** - Embeddings inherit biases from the underlying language models - Semantic representations depend on preprocessing choices and model selection - This dataset should always be used together with the core dataset for interpretation ## License Released under **Creative Commons Attribution 4.0 (CC-BY 4.0)**. ## Citation If you use this dataset, please cite the **core dataset**: ```bibtex @dataset{nlp_pol_instagram_political_communication_it_2026, title = {NLP-POL: Instagram Political Communication (Italy)}, author = {PMG-t and NLP-POL Project}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/PMG-t/instagram-political-communication-it}, note = {Maintained by PMG-t. Part of the NLP-POL (NLP for Political Communication) project.}, howpublished = {\url{https://github.com/PMG-t}} }

提供机构：

NLP-POL

5,000+

优质数据集

54 个

任务类型

进入经典数据集