truevislies/results
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/truevislies/results
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- image-classification
- text-generation
language:
- en
tags:
- visualization
- misinformation
- misleading-visualizations
- COVID-19
- large-language-models
- multimodal
- rhetoric
- authorial-intent
pretty_name: TrueVisLies – Results
size_categories:
- 10M<n<100M
---
# TrueVisLies – Results
This dataset contains all raw outputs, extracted fields, semantic similarity scores, and UMAP projections produced in the paper:
> **True (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies**
The paper evaluates 16 LLMs, 15 open-weight vision-language models (VLMs), and GPT-5.4 on their ability to (RQ0) detect misleading data visualizations, (RQ1) identify the visualization rhetoric techniques, and (RQ2) attribute authorial intent behind a misleading visualization.
Two datasets are used:
- [COVID-19 Dataset](https://huggingface.co/datasets/truevislies/twitter)
- [VisLies Dataset](https://huggingface.co/datasets/truevislies/vislies)
---
## Repository Structure
The dataset is organized into two top-level folders: `twitter/` for the COVID-19 Twitter dataset and `vislies/` for the VisLies gallery. Both folders share the same internal structure:
```
{corpus}/
├── models.csv
├── responses/
│ └── {experiment}.parquet # Raw model responses
├── extractions/
│ └── {experiment}.parquet # Structured field extractions (by a meta-LLM)
├── similarity/
│ ├── centroid_distances/
│ │ └── {topic}/
│ │ └── {experiment}.parquet # Per-topic model-pair cosine similarity (aggregated over images)
│ ├── model_agreement/
│ │ └── {topic}/
│ │ └── {experiment}.parquet # Per-image model-pair cosine similarity
│ └── setup_shift/
│ └── {topic}/
│ └── {model}.parquet # Per-image cross-experiment cosine similarity for a given model
├── umap/
│ └── {topic}/
│ └── {experiment}.parquet # 2D UMAP projections of response embeddings
└── umap10/
└── {topic}/
└── {experiment}.parquet # 10D UMAP projections of response embeddings
```
**Experiments** (`{experiment}`): `E0`, `E1A`, `E1B`, `E1C`, `E2A`, `E2B`, `E2C`.
**Models** (`{model}`): `deepseek`, `gemma`, `glm`, `gpt`, `gta`, `intern`, `kimi`, `llava`, `maverick`, `mistral`, `molmo`, `nemotron`, `pixtral`, `qianfan`, `qwen`, `step3`. Details of the models are in the `models.csv` file and in the paper.
---
## Embeddings and Similarity Scores
All the similarity scores in the `similarity/` folder, and UMAP projections in the `umap/` and `umap10/` folders, are computed using cosine similarity on raw LLM output embeddings generated by the `Qwen3-Embedding-8B` model. The `topic` column in the similarity files indicates which response field the embedding was generated from (e.g., `a___analysis` for the full free-text analysis, `e___causal_reasoning` for the extracted causal reasoning field, etc.). The same applies to the UMAP files. The higher-dimensional embedding (4096 dimensions) are not included in the dataset due to their large size.
---
## Experimental Conditions
Each experiment corresponds to a specific prompt that was sent to the model together with the visualization image and its accompanying caption. The six conditions form a 3x2 design: three knowledge anchors (A, B, C) crossed with two task scopes (E1 = rhetoric, E2 = authorial intent). E0 is the baseline with no prior knowledge and no additional task.
| ID | Prior knowledge | Task |
|---|---|---|
| E0 | None (open-ended analysis) | Misleading detection only |
| E1A | None | Misleading detection + rhetoric scoring |
| E1B | Ground truth label (misleading/not misleading) | Misleading detection + rhetoric scoring |
| E1C | Ground truth label + error type(s) | Misleading detection + rhetoric scoring |
| E2A | None | Misleading detection + intent attribution |
| E2B | Ground truth label (misleading/not misleading) | Misleading detection + intent attribution |
| E2C | Ground truth label + error type(s) | Misleading detection + intent attribution |
**Rhetoric categories (E1x):** `information_access_rhetoric`, `provenance_rhetoric`, `mapping_rhetoric`, `linguistic_based_rhetoric`, `procedural_rhetoric`. Each is scored on a scale from -1 (unknown) to 6 (very strong contribution).
**Intent categories (E2x):** `aesthetic_driven_misrepresentation`, `bias_exploitation`, `claim_supporting_manipulation`, `context_distortion`, `deliberate_reader_confusion`, `lack_of_visualization_literacy`, `selective_reporting`, `space_and_format_constraints`, `unintentional_context_omission`. Each is scored on the same -1 to 6 scale.
---
## File Descriptions
### `models.csv`
Metadata for the 16 LLMs and 15 open-weight models included in the sample. Full model metadata is in the paper.
| Column | Type | Description |
|---|---|---|
| `nickname` | str | Short model ID used throughout the dataset (e.g., `deepseek`) |
| `id` | str | Hugging Face model ID (e.g., `deepseek-ai/deepseek-vl2`) |
| `total_parameters` | int | Total parameter count in billions |
| `active_parameters` | int | Active parameters in billions (for MoE models; 0 for dense models) |
---
### `responses/{experiment}.parquet`
Raw output from each model for each image, one row per (image, model) pair.
**Rows:** 37,376 per experiment for `twitter/` on E0, E1A, E2A (all images × 16 models); 18,688 for `twitter/` on E1B, E1C, E2B, E2C (misleading-only subset × 16 models); 2,080 per experiment for `vislies/` (130 images × 16 models).
| Column | Type | Description |
|---|---|---|
| `image_id` | str | Unique image identifier (tweet ID for `twitter/`, VisLies item ID for `vislies/`) |
| `experiment` | str | Experiment ID (e.g., `E1A`) |
| `model` | str | Model nickname |
| `prompt_tokens` | int | Number of prompt tokens consumed |
| `completion_tokens` | int | Number of completion tokens generated |
| `total_tokens` | int | Total tokens (prompt + completion) |
| `analysis` | str | Free-text analysis of the visualization |
| `is_misleading` | bool | Model's binary judgment: `True` = misleading, `False` = not misleading |
| `why_misleading` | str | Textual justification (empty string if `is_misleading` is `False`) |
| `r\|{rhetoric_type}\|why` | str | Explanation for the rhetoric type (E1x only; empty if score ≤ 0) |
| `r\|{rhetoric_type}\|score` | int | Contribution score for the rhetoric type, -1 to 6 (E1x only) |
| `i\|{intent_type}\|why` | str | Explanation for the intent type (E2x only; empty if score ≤ 0) |
| `i\|{intent_type}\|score` | int | Contribution score for the intent type, -1 to 6 (E2x only) |
---
### `extractions/{experiment}.parquet`
Structured reasoning fields extracted from each model's free-text `analysis` by a meta-LLM annotator (`openai/gpt-oss-120b`). These fields decompose the analysis into interpretable reasoning dimensions.
| Column | Type | Description |
|---|---|---|
| `image_id` | str | Unique image identifier |
| `experiment` | str | Experiment ID |
| `model` | str | Model nickname |
| `annotator` | str | Meta-LLM model ID used for extraction (e.g., `openai/gpt-oss-120b`) |
| `a\|visual_focus` | str | What the model focused on visually in the chart |
| `a\|caption_reasoning` | str | How the model interpreted the image caption |
| `a\|normative_baseline` | str | What standard or baseline the model compared the visualization against |
| `a\|evidence` | str | Evidence cited for the misleading assessment |
| `a\|data_claim_gap` | str | Gap identified between data shown and claims made |
| `a\|causal_reasoning` | str | Causal inferences drawn by the model |
| `a\|intent_attribution` | str | Authorial intent inferred by the model |
| `a\|viewer_impact` | str | How the model assessed the visualization's impact on a viewer |
| `a\|interpretive_conclusion` | str | The model's final interpretive conclusion |
| `a\|uncertainty` | str | Uncertainty or hedging expressed by the model |
| `a\|error_evidence` | str | Evidence specifically tied to an annotated error (E1C/E2C only) |
---
### `similarity/centroid_distances/{topic}/{experiment}.parquet`
Pairwise cosine similarity between model response centroids (averaged over all images) for a given topic and experiment. This captures global behavioral similarity between models.
**Rows:** 120 per file (all pairs of 16 models, including the human baseline).
| Column | Type | Description |
|---|---|---|
| `experiment` | str | Experiment ID |
| `model_a` | str | First model nickname |
| `model_b` | str | Second model nickname |
| `topic` | str | The response field or topic being compared (see topic list below) |
| `type` | str | Always `centroid_distances` |
| `cosine_sim` | float | Cosine similarity between the two model centroids [0, 1] |
---
### `similarity/model_agreement/{topic}/{experiment}.parquet`
Pairwise cosine similarity between model responses on a per-image basis for a given topic and experiment. This captures local behavioral agreement at the individual visualization level.
**Rows:** 280,320 per file for `twitter/` E0/E1A/E2A (120 pairs × 2,336 images); 140,160 for `twitter/` E1B/E1C/E2B/E2C (120 pairs × 1,168 images); proportionally smaller for `vislies/`.
| Column | Type | Description |
|---|---|---|
| `experiment` | str | Experiment ID |
| `model_a` | str | First model nickname |
| `model_b` | str | Second model nickname |
| `topic` | str | The response field or topic being compared |
| `type` | str | Always `model_agreement` |
| `image_id` | str | Unique image identifier |
| `cosine_sim` | float | Per-image cosine similarity between model responses [0, 1] |
---
### `similarity/setup_shift/{topic}/{model}.parquet`
Pairwise cosine similarity between a single model's responses across different experiments (conditions) on a per-image basis. This captures how much a model's response shifts when the experimental setup changes.
**Rows:** 28,032 per file for `twitter/` (all experiment pairs × 2,336 images).
| Column | Type | Description |
|---|---|---|
| `experiment` | str | Model name (used as a grouping key in this file) |
| `topic` | str | The response field or topic being compared |
| `type` | str | Always `setup_shift` |
| `image_id` | str | Unique image identifier |
| `experiment_a` | str | First experiment ID in the pair |
| `experiment_b` | str | Second experiment ID in the pair |
| `cosine_sim` | float | Per-image cosine similarity between responses in the two conditions [0, 1] |
---
### `umap/{topic}/{experiment}.parquet`
2D UMAP projections of the sentence embeddings of model responses for a given topic and experiment. Used for the visual explorer and the semantic analysis figures in the paper.
| Column | Type | Description |
|---|---|---|
| `image_id` | str | Unique image identifier |
| `model` | str | Model nickname |
| `experiment` | str | Experiment ID |
| `topic` | str | The response field or topic being projected |
| `x` | float | UMAP dimension 1 |
| `y` | float | UMAP dimension 2 |
---
### `umap10/{topic}/{experiment}.parquet` (vislies only)
10-dimensional UMAP projections of sentence embeddings used for the BERTopic-based semantic cluster analysis in the paper. The `x0`–`x9` columns contain the 10 coordinates.
| Column | Type | Description |
|---|---|---|
| `image_id` | str | Unique image identifier |
| `model` | str | Model nickname |
| `experiment` | str | Experiment ID |
| `topic` | str | The response field or topic being projected |
| `x0`–`x9` | float | UMAP dimensions 0–9 |
---
## Topic Keys
Topics used in the `topic` column of the similarity and UMAP files follow a naming convention with a prefix indicating the analysis category:
**Analysis fields (prefix `a___`):** `a___analysis`, `a___analysis_whymis`, `a___behavior_signature`, `a___full_response`, `a___whymis`
**Extraction fields (prefix `e___`):** `e___caption_reasoning`, `e___causal_reasoning`, `e___data_claim_gap`, `e___evidence`, `e___intent_attribution`, `e___interpretive_conclusion`, `e___normative_baseline`, `e___uncertainty`, `e___viewer_impact`, `e___visual_focus`
**Rhetoric fields (prefix `r___`, E1x only):** `r___all`, `r___information_access_rhetoric`, `r___provenance_rhetoric`, `r___mapping_rhetoric`, `r___linguistic_based_rhetoric`, `r___procedural_rhetoric`
**Intent fields (prefix `i___`, E2x only):** `i___all`, `i___aesthetic_driven_misrepresentation`, `i___bias_exploitation`, `i___claim_supporting_manipulation`, `i___context_distortion`, `i___deliberate_reader_confusion`, `i___lack_of_visualization_literacy`, `i___selective_reporting`, `i___space_and_format_constraints`, `i___unintentional_context_omission`
---
## Models Evaluated
| Nickname | Model | Provider | Params (B) | Active (B) |
|---|---|---|---|---|
| nemotron | Nemotron-Nano-V2-VL | NVIDIA | 12 | – |
| mistral | Mistral-Small-3.2 | Mistral AI | 24 | – |
| deepseek | DeepSeek-VL2 | DeepSeek | 27 | 5 |
| gemma | Gemma3 | Google | 27 | – |
| gta | GTA1 | Salesforce | 32 | – |
| qianfan | Qianfan-VL | Baidu | 70 | – |
| molmo | Molmo | Ai2 | 72 | – |
| glm | GLM-4.5V | Z.ai | 108 | 12 |
| llava | LLaVA-NeXT | LLaVA | 110 | – |
| pixtral | Pixtral-Large | Mistral AI | 124 | – |
| qwen | Qwen3-VL | Alibaba | 235 | 22 |
| intern | InternVL3.5 | OpenGVLab | 241 | 28 |
| step3 | Step3 (FP8) | StepFun AI | 321 | 38 |
| maverick | Llama-4-Maverick (FP8) | Meta | 400 | 17 |
| kimi | Kimi-K2.5 | Moonshot AI | 1,000 | 32 |
| gpt | GPT-5.4 | OpenAI | – | – |
---
## License
The results in this dataset are released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). The used images are not included in this repository. Please refer to the original source datasets for image licenses.
提供机构:
truevislies



