truevislies/results

Name: truevislies/results
Creator: truevislies
Published: 2026-04-15 10:53:14
License: 暂无描述

Hugging Face2026-04-15 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/truevislies/results

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - image-classification - text-generation language: - en tags: - visualization - misinformation - misleading-visualizations - COVID-19 - large-language-models - multimodal - rhetoric - authorial-intent pretty_name: TrueVisLies – Results size_categories: - 10M<n<100M --- # TrueVisLies – Results This dataset contains all raw outputs, extracted fields, semantic similarity scores, and UMAP projections produced in the paper: > **True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies** The paper evaluates 16 LLMs, 15 open-weight vision-language models (VLMs), and GPT-5.4 on their ability to (RQ0) detect misleading data visualizations, (RQ1) identify the visualization rhetoric techniques, and (RQ2) attribute authorial intent behind a misleading visualization. Two datasets are used: - [COVID-19 Dataset](https://huggingface.co/datasets/truevislies/twitter) - [VisLies Dataset](https://huggingface.co/datasets/truevislies/vislies) --- ## Repository Structure The dataset is organized into two top-level folders: `twitter/` for the COVID-19 Twitter dataset and `vislies/` for the VisLies gallery. Both folders share the same internal structure: ``` {corpus}/ ├── models.csv ├── responses/ │ └── {experiment}.parquet # Raw model responses ├── extractions/ │ └── {experiment}.parquet # Structured field extractions (by a meta-LLM) ├── similarity/ │ ├── centroid_distances/ │ │ └── {topic}/ │ │ └── {experiment}.parquet # Per-topic model-pair cosine similarity (aggregated over images) │ ├── model_agreement/ │ │ └── {topic}/ │ │ └── {experiment}.parquet # Per-image model-pair cosine similarity │ └── setup_shift/ │ └── {topic}/ │ └── {model}.parquet # Per-image cross-experiment cosine similarity for a given model ├── umap/ │ └── {topic}/ │ └── {experiment}.parquet # 2D UMAP projections of response embeddings └── umap10/ └── {topic}/ └── {experiment}.parquet # 10D UMAP projections of response embeddings ``` **Experiments** (`{experiment}`): `E0`, `E1A`, `E1B`, `E1C`, `E2A`, `E2B`, `E2C`. **Models** (`{model}`): `deepseek`, `gemma`, `glm`, `gpt`, `gta`, `intern`, `kimi`, `llava`, `maverick`, `mistral`, `molmo`, `nemotron`, `pixtral`, `qianfan`, `qwen`, `step3`. Details of the models are in the `models.csv` file and in the paper. --- ## Embeddings and Similarity Scores All the similarity scores in the `similarity/` folder, and UMAP projections in the `umap/` and `umap10/` folders, are computed using cosine similarity on raw LLM output embeddings generated by the `Qwen3-Embedding-8B` model. The `topic` column in the similarity files indicates which response field the embedding was generated from (e.g., `a___analysis` for the full free-text analysis, `e___causal_reasoning` for the extracted causal reasoning field, etc.). The same applies to the UMAP files. The higher-dimensional embedding (4096 dimensions) are not included in the dataset due to their large size. --- ## Experimental Conditions Each experiment corresponds to a specific prompt that was sent to the model together with the visualization image and its accompanying caption. The six conditions form a 3x2 design: three knowledge anchors (A, B, C) crossed with two task scopes (E1 = rhetoric, E2 = authorial intent). E0 is the baseline with no prior knowledge and no additional task. | ID | Prior knowledge | Task | |---|---|---| | E0 | None (open-ended analysis) | Misleading detection only | | E1A | None | Misleading detection + rhetoric scoring | | E1B | Ground truth label (misleading/not misleading) | Misleading detection + rhetoric scoring | | E1C | Ground truth label + error type(s) | Misleading detection + rhetoric scoring | | E2A | None | Misleading detection + intent attribution | | E2B | Ground truth label (misleading/not misleading) | Misleading detection + intent attribution | | E2C | Ground truth label + error type(s) | Misleading detection + intent attribution | **Rhetoric categories (E1x):** `information_access_rhetoric`, `provenance_rhetoric`, `mapping_rhetoric`, `linguistic_based_rhetoric`, `procedural_rhetoric`. Each is scored on a scale from -1 (unknown) to 6 (very strong contribution). **Intent categories (E2x):** `aesthetic_driven_misrepresentation`, `bias_exploitation`, `claim_supporting_manipulation`, `context_distortion`, `deliberate_reader_confusion`, `lack_of_visualization_literacy`, `selective_reporting`, `space_and_format_constraints`, `unintentional_context_omission`. Each is scored on the same -1 to 6 scale. --- ## File Descriptions ### `models.csv` Metadata for the 16 LLMs and 15 open-weight models included in the sample. Full model metadata is in the paper. | Column | Type | Description | |---|---|---| | `nickname` | str | Short model ID used throughout the dataset (e.g., `deepseek`) | | `id` | str | Hugging Face model ID (e.g., `deepseek-ai/deepseek-vl2`) | | `total_parameters` | int | Total parameter count in billions | | `active_parameters` | int | Active parameters in billions (for MoE models; 0 for dense models) | --- ### `responses/{experiment}.parquet` Raw output from each model for each image, one row per (image, model) pair. **Rows:** 37,376 per experiment for `twitter/` on E0, E1A, E2A (all images × 16 models); 18,688 for `twitter/` on E1B, E1C, E2B, E2C (misleading-only subset × 16 models); 2,080 per experiment for `vislies/` (130 images × 16 models). | Column | Type | Description | |---|---|---| | `image_id` | str | Unique image identifier (tweet ID for `twitter/`, VisLies item ID for `vislies/`) | | `experiment` | str | Experiment ID (e.g., `E1A`) | | `model` | str | Model nickname | | `prompt_tokens` | int | Number of prompt tokens consumed | | `completion_tokens` | int | Number of completion tokens generated | | `total_tokens` | int | Total tokens (prompt + completion) | | `analysis` | str | Free-text analysis of the visualization | | `is_misleading` | bool | Model's binary judgment: `True` = misleading, `False` = not misleading | | `why_misleading` | str | Textual justification (empty string if `is_misleading` is `False`) | | `r\|{rhetoric_type}\|why` | str | Explanation for the rhetoric type (E1x only; empty if score ≤ 0) | | `r\|{rhetoric_type}\|score` | int | Contribution score for the rhetoric type, -1 to 6 (E1x only) | | `i\|{intent_type}\|why` | str | Explanation for the intent type (E2x only; empty if score ≤ 0) | | `i\|{intent_type}\|score` | int | Contribution score for the intent type, -1 to 6 (E2x only) | --- ### `extractions/{experiment}.parquet` Structured reasoning fields extracted from each model's free-text `analysis` by a meta-LLM annotator (`openai/gpt-oss-120b`). These fields decompose the analysis into interpretable reasoning dimensions. | Column | Type | Description | |---|---|---| | `image_id` | str | Unique image identifier | | `experiment` | str | Experiment ID | | `model` | str | Model nickname | | `annotator` | str | Meta-LLM model ID used for extraction (e.g., `openai/gpt-oss-120b`) | | `a\|visual_focus` | str | What the model focused on visually in the chart | | `a\|caption_reasoning` | str | How the model interpreted the image caption | | `a\|normative_baseline` | str | What standard or baseline the model compared the visualization against | | `a\|evidence` | str | Evidence cited for the misleading assessment | | `a\|data_claim_gap` | str | Gap identified between data shown and claims made | | `a\|causal_reasoning` | str | Causal inferences drawn by the model | | `a\|intent_attribution` | str | Authorial intent inferred by the model | | `a\|viewer_impact` | str | How the model assessed the visualization's impact on a viewer | | `a\|interpretive_conclusion` | str | The model's final interpretive conclusion | | `a\|uncertainty` | str | Uncertainty or hedging expressed by the model | | `a\|error_evidence` | str | Evidence specifically tied to an annotated error (E1C/E2C only) | --- ### `similarity/centroid_distances/{topic}/{experiment}.parquet` Pairwise cosine similarity between model response centroids (averaged over all images) for a given topic and experiment. This captures global behavioral similarity between models. **Rows:** 120 per file (all pairs of 16 models, including the human baseline). | Column | Type | Description | |---|---|---| | `experiment` | str | Experiment ID | | `model_a` | str | First model nickname | | `model_b` | str | Second model nickname | | `topic` | str | The response field or topic being compared (see topic list below) | | `type` | str | Always `centroid_distances` | | `cosine_sim` | float | Cosine similarity between the two model centroids [0, 1] | --- ### `similarity/model_agreement/{topic}/{experiment}.parquet` Pairwise cosine similarity between model responses on a per-image basis for a given topic and experiment. This captures local behavioral agreement at the individual visualization level. **Rows:** 280,320 per file for `twitter/` E0/E1A/E2A (120 pairs × 2,336 images); 140,160 for `twitter/` E1B/E1C/E2B/E2C (120 pairs × 1,168 images); proportionally smaller for `vislies/`. | Column | Type | Description | |---|---|---| | `experiment` | str | Experiment ID | | `model_a` | str | First model nickname | | `model_b` | str | Second model nickname | | `topic` | str | The response field or topic being compared | | `type` | str | Always `model_agreement` | | `image_id` | str | Unique image identifier | | `cosine_sim` | float | Per-image cosine similarity between model responses [0, 1] | --- ### `similarity/setup_shift/{topic}/{model}.parquet` Pairwise cosine similarity between a single model's responses across different experiments (conditions) on a per-image basis. This captures how much a model's response shifts when the experimental setup changes. **Rows:** 28,032 per file for `twitter/` (all experiment pairs × 2,336 images). | Column | Type | Description | |---|---|---| | `experiment` | str | Model name (used as a grouping key in this file) | | `topic` | str | The response field or topic being compared | | `type` | str | Always `setup_shift` | | `image_id` | str | Unique image identifier | | `experiment_a` | str | First experiment ID in the pair | | `experiment_b` | str | Second experiment ID in the pair | | `cosine_sim` | float | Per-image cosine similarity between responses in the two conditions [0, 1] | --- ### `umap/{topic}/{experiment}.parquet` 2D UMAP projections of the sentence embeddings of model responses for a given topic and experiment. Used for the visual explorer and the semantic analysis figures in the paper. | Column | Type | Description | |---|---|---| | `image_id` | str | Unique image identifier | | `model` | str | Model nickname | | `experiment` | str | Experiment ID | | `topic` | str | The response field or topic being projected | | `x` | float | UMAP dimension 1 | | `y` | float | UMAP dimension 2 | --- ### `umap10/{topic}/{experiment}.parquet` (vislies only) 10-dimensional UMAP projections of sentence embeddings used for the BERTopic-based semantic cluster analysis in the paper. The `x0`–`x9` columns contain the 10 coordinates. | Column | Type | Description | |---|---|---| | `image_id` | str | Unique image identifier | | `model` | str | Model nickname | | `experiment` | str | Experiment ID | | `topic` | str | The response field or topic being projected | | `x0`–`x9` | float | UMAP dimensions 0–9 | --- ## Topic Keys Topics used in the `topic` column of the similarity and UMAP files follow a naming convention with a prefix indicating the analysis category: **Analysis fields (prefix `a___`):** `a___analysis`, `a___analysis_whymis`, `a___behavior_signature`, `a___full_response`, `a___whymis` **Extraction fields (prefix `e___`):** `e___caption_reasoning`, `e___causal_reasoning`, `e___data_claim_gap`, `e___evidence`, `e___intent_attribution`, `e___interpretive_conclusion`, `e___normative_baseline`, `e___uncertainty`, `e___viewer_impact`, `e___visual_focus` **Rhetoric fields (prefix `r___`, E1x only):** `r___all`, `r___information_access_rhetoric`, `r___provenance_rhetoric`, `r___mapping_rhetoric`, `r___linguistic_based_rhetoric`, `r___procedural_rhetoric` **Intent fields (prefix `i___`, E2x only):** `i___all`, `i___aesthetic_driven_misrepresentation`, `i___bias_exploitation`, `i___claim_supporting_manipulation`, `i___context_distortion`, `i___deliberate_reader_confusion`, `i___lack_of_visualization_literacy`, `i___selective_reporting`, `i___space_and_format_constraints`, `i___unintentional_context_omission` --- ## Models Evaluated | Nickname | Model | Provider | Params (B) | Active (B) | |---|---|---|---|---| | nemotron | Nemotron-Nano-V2-VL | NVIDIA | 12 | – | | mistral | Mistral-Small-3.2 | Mistral AI | 24 | – | | deepseek | DeepSeek-VL2 | DeepSeek | 27 | 5 | | gemma | Gemma3 | Google | 27 | – | | gta | GTA1 | Salesforce | 32 | – | | qianfan | Qianfan-VL | Baidu | 70 | – | | molmo | Molmo | Ai2 | 72 | – | | glm | GLM-4.5V | Z.ai | 108 | 12 | | llava | LLaVA-NeXT | LLaVA | 110 | – | | pixtral | Pixtral-Large | Mistral AI | 124 | – | | qwen | Qwen3-VL | Alibaba | 235 | 22 | | intern | InternVL3.5 | OpenGVLab | 241 | 28 | | step3 | Step3 (FP8) | StepFun AI | 321 | 38 | | maverick | Llama-4-Maverick (FP8) | Meta | 400 | 17 | | kimi | Kimi-K2.5 | Moonshot AI | 1,000 | 32 | | gpt | GPT-5.4 | OpenAI | – | – | --- ## License The results in this dataset are released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The used images are not included in this repository. Please refer to the original source datasets for image licenses.

提供机构：

truevislies

5,000+

优质数据集

54 个

任务类型

进入经典数据集