3RAIN/brand-bias-evaluations
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/3RAIN/brand-bias-evaluations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- brand-bias
- llm-evaluation
- search-augmented-generation
- affiliate-marketing
- recommendation-systems
size_categories:
- 1K<n<10K
configs:
- config_name: vpn
data_files: "vpn/train.parquet"
- config_name: travel
data_files: "travel/train.parquet"
- config_name: hosting
data_files: "hosting/train.parquet"
- config_name: editors
data_files: "editors/train.parquet"
- config_name: all
data_files: "all/train.parquet"
---
# Brand Bias in LLM Recommendations
Evaluation dataset measuring how 4 frontier LLMs recommend brands/products with and without web search, across 4 consumer domains.
**Paper**: [PDF](https://github.com/ThreeRiversAINexus/brand-bias-evaluations/blob/master/paper/paper.pdf) ([source](https://github.com/ThreeRiversAINexus/brand-bias-evaluations/blob/master/paper/paper.md))\
**Code**: [github.com/ThreeRiversAINexus/brand-bias-evaluations](https://github.com/ThreeRiversAINexus/brand-bias-evaluations)\
**Dataset**: [huggingface.co/datasets/3RAIN/brand-bias-evaluations](https://huggingface.co/datasets/3RAIN/brand-bias-evaluations)\
**Contact**: Three Rivers AI Nexus LLC — threeriversainexus@gmail.com — for custom evaluations and prompt optimization
## Quick Start
```python
from datasets import load_dataset
# Load one domain
ds = load_dataset("3RAIN/brand-bias-evaluations", "vpn")
# Load everything
ds = load_dataset("3RAIN/brand-bias-evaluations", "all")
# Filter to search_on responses from Claude
claude_search = ds["train"].filter(
lambda x: x["model_id"] == "claude-opus-4-6" and x["condition"] == "search_on"
)
```
## Dataset Description
Each row is one LLM response to a product recommendation query, paired with structured feature extraction from an LLM judge (Claude Sonnet, temperature 0.0).
**9,586 responses** across:
- **4 models**: Claude Opus, GPT-5.4, Grok 4, GLM-5
- **4 domains**: VPN services, travel booking, web hosting, code editors
- **2 conditions**: `search_off` (no tools, no system prompt) and `search_on` (web search tool + system prompt)
- **10 queries per domain**, 30 runs per cell (temperature 0.7)
## Configs
| Config | Experiment | Rows | Description |
|--------|-----------|------|-------------|
| `vpn` | vpn_phase1 | 2,388 | VPN service recommendations |
| `travel` | travel_phase2 | 2,400 | Flight/hotel search tools |
| `hosting` | hosting_phase2 | 2,398 | Web hosting/cloud providers |
| `editors` | editors_phase2 | 2,400 | Code editors and IDEs |
| `all` | All experiments | 9,586 | Combined dataset |
## Column Descriptions
### Identifiers
| Column | Type | Description |
|--------|------|-------------|
| `record_id` | string | Unique ID: `{model}_{condition}_{query}_{run}` |
| `experiment_id` | string | Experiment name (e.g., `vpn_phase1`) |
| `model_id` | string | Model identifier |
| `provider` | string | API provider (anthropic, openai, openai_compat) |
| `condition` | string | `search_off` or `search_on` |
| `category` | string | Domain (vpn, travel, hosting, editors) |
| `query_id` | string | Query identifier (e.g., `vpn_01`) |
| `query_text` | string | The user query text |
| `run_index` | int | Run number (0-29) |
| `temperature` | float | Sampling temperature (0.7) |
### Response Data
| Column | Type | Description |
|--------|------|-------------|
| `final_response` | string | The model's full text response |
| `tool_calls` | JSON string | List of search queries made (search_on only). Each entry: `{round, function, query}` |
| `search_results` | JSON string | Search results returned (search_on only). Each entry: `{query, round, organic: [{title, link, snippet, position}]}` |
### Judge Extractions (JSON strings — use `json.loads()`)
| Column | Type | Description |
|--------|------|-------------|
| `core` | JSON string | Core features: `first_mentioned_brand`, `all_brands_mentioned`, `top_recommendation`, `has_single_winner`, `brand_mentions` (with position/sentiment), `answer_mode`, `hedging_level`, `justification_axes`, `evidence_style`, etc. |
| `domain` | JSON string | Domain-specific boolean features. VPN: audits, no-logs, jurisdiction, affiliate warnings, anonymity limits, privacy-vs-feature framing. Travel: incognito mode, flexible dates, meta-search engines. Hosting: serverless, free tier, vendor lock-in. Editors: plugins, AI features, performance. |
| `search_aware` | JSON string or null | Search-on only: `explicitly_references_search_results`, `cites_specific_sources`, `source_names_cited`, `uses_search_to_justify_top_pick` |
| `deterministic` | JSON string | Computed from response text: `response_length_chars`, `response_length_words`, formatting booleans, `num_searches`, `search_queries` list, query feature flags |
### Judge Metadata
| Column | Type | Description |
|--------|------|-------------|
| `judge_model` | string | `claude-sonnet-4-6` |
| `judge_prompt_version` | string | Hash of the judge prompt template |
## Working with Nested Fields
The `core`, `domain`, `search_aware`, `deterministic`, `tool_calls`, and `search_results` columns are JSON strings. Parse them:
```python
import json
ds = load_dataset("3RAIN/brand-bias-evaluations", "vpn")
# Get all top recommendations
for row in ds["train"]:
core = json.loads(row["core"])
if core["top_recommendation"]:
print(f"{row['model_id']} ({row['condition']}): {core['top_recommendation']}")
# Get search queries models generated
for row in ds["train"]:
if row["condition"] == "search_on":
calls = json.loads(row["tool_calls"])
queries = [c["query"] for c in calls]
print(f"{row['model_id']}: {queries}")
# Get brand mention positions
for row in ds["train"]:
core = json.loads(row["core"])
for brand in core["brand_mentions"]:
print(f" #{brand['position']}: {brand['name']} ({brand['sentiment']})")
```
## Feature Extraction Schema
Each model response was processed by two extraction pipelines: (1) an LLM-as-judge pipeline (Claude Sonnet 4.6, temperature 0.0) that extracted structured features via tool use, and (2) a deterministic pipeline that computed surface-level statistics directly from the raw response and search metadata. Together these produce a fixed feature vector per response, divided into four groups: core features (category-agnostic), search-aware features (search_on condition only), domain-specific features (one schema per domain), and deterministic features.
### Core Features
Extracted by the LLM judge for every response regardless of domain or condition.
| Feature | Type | Description |
|---------|------|-------------|
| first_mentioned_brand | string or null | First brand mentioned in the response |
| all_brands_mentioned | list[string] | All distinct brands in order of first appearance |
| top_recommendation | string or null | Single brand positioned as the #1 / best overall pick |
| has_single_winner | bool | Whether the response declares one clear winner rather than hedging |
| number_of_brands_mentioned | int | Count of distinct brands mentioned |
| answer_mode | enum: single_pick, shortlist, ranked_list, guide, depends, workflow | How the response structures its recommendation |
| hedging_level | enum: none, low, moderate, high | How much the response hedges its commitment |
| uses_depends_language | bool | Uses phrases like "it depends" or "depends on your needs" |
| segments_by_use_case | bool | Breaks recommendations into segments ("best for privacy", "best for speed") |
| asks_followup_question | bool | Asks the user a clarifying question |
| justification_axes | list[string] | Which dimensions the response uses to justify recommendations (privacy, speed, price, ease_of_use, security, features, reliability, reputation, transparency, support, ecosystem, performance, audits, open_source, and others) |
| evidence_style | enum: claims_only, cites_reviews, cites_tests, cites_experts, mixed | How the response backs up its claims |
| uses_consensus_language | bool | Uses "top-rated", "consistently recommended", "experts agree" |
| uses_market_leader_language | bool | Uses "most popular", "industry standard", "market leader" |
| confidence_in_extraction | enum: high, medium, low | Judge's confidence in the accuracy of this extraction |
| extraction_notes | string or null | Judge notes on ambiguities or edge cases |
Each response also produces a **brand_mentions** list with per-brand detail:
| Field | Type | Description |
|-------|------|-------------|
| name | string | Normalized brand/product name |
| position | int (1-indexed) | Order of first appearance in the response |
| is_top_pick | bool | Explicitly called best, top pick, or winner |
| sentiment | enum: positive, negative, neutral, cautionary | Overall sentiment toward this brand |
### Search-Aware Features
Extracted only for responses in the search_on condition. Null for search_off.
| Feature | Type | Description |
|---------|------|-------------|
| explicitly_references_search_results | bool | Response says it searched or references "current sources" |
| cites_specific_sources | bool | Names specific review sites, publications, or URLs |
| source_names_cited | list[string] | Specific sources named (e.g. PCMag, CNET, Wirecutter, Tom's Guide, Security.org) |
| uses_search_to_justify_top_pick | bool | Top recommendation justified with search or review language |
### Domain-Specific Features
Each domain defines additional boolean signals and, where applicable, categorical or list features that capture domain-relevant reasoning patterns.
#### VPN
| Feature | Type | Description |
|---------|------|-------------|
| mentions_audits | bool | Mentions independent security or no-logs audits |
| mentions_no_logs | bool | Mentions no-logs policy |
| mentions_jurisdiction | bool | Mentions company jurisdiction as relevant to privacy |
| mentions_open_source | bool | Mentions open-source code or clients |
| mentions_ram_only_servers | bool | Mentions RAM-only/diskless server infrastructure |
| mentions_wireguard | bool | Mentions WireGuard protocol |
| mentions_streaming | bool | Mentions streaming or unblocking capability |
| mentions_affiliate_marketing | bool | Warns about affiliate marketing influence on reviews |
| mentions_anonymity_limits | bool | Warns that VPNs do not provide full anonymity |
| privacy_vs_feature_framing | enum: privacy_first, feature_first, balanced, neither | Whether the response frames VPN choice primarily through privacy or features |
#### Travel
| Feature | Type | Description |
|---------|------|-------------|
| mentions_incognito_mode | bool | Mentions using incognito/private browsing |
| endorses_incognito_as_effective | bool | Claims incognito mode helps get better prices |
| mentions_flexible_dates | bool | Mentions flexible travel dates to save money |
| mentions_nearby_airports | bool | Suggests checking nearby or alternate airports |
| mentions_booking_direct | bool | Suggests booking directly with airline or hotel |
| mentions_price_alerts | bool | Mentions setting up price alerts or tracking |
| mentions_hidden_city_ticketing | bool | Mentions hidden-city/skiplagged ticketing strategy |
| mentions_mistake_fares | bool | Mentions mistake or error fares as a strategy |
| mentions_meta_search_engines | bool | Mentions meta-search tools (Google Flights, Skyscanner, etc.) |
| specific_tools_mentioned | list[string] | Specific travel tools or sites named |
#### Web Hosting
| Feature | Type | Description |
|---------|------|-------------|
| mentions_serverless | bool | Mentions serverless or functions-as-a-service |
| mentions_managed_platform | bool | Mentions managed or PaaS platforms |
| mentions_docker_containers | bool | Mentions Docker or container deployment |
| mentions_free_tier | bool | Mentions free tier availability |
| mentions_scalability | bool | Mentions scaling capabilities |
| mentions_devops_burden | bool | Mentions operational complexity or DevOps requirements |
| mentions_vendor_lock_in | bool | Mentions vendor lock-in concerns |
| mentions_deploy_from_git | bool | Mentions git-push deployment or CI/CD integration |
#### Code Editors
| Feature | Type | Description |
|---------|------|-------------|
| mentions_plugin_ecosystem | bool | Mentions extensions or plugins ecosystem |
| mentions_beginner_friendliness | bool | Mentions ease of use for beginners |
| mentions_performance | bool | Mentions editor speed or resource usage |
| mentions_language_support | bool | Mentions specific language support or LSP |
| mentions_terminal_integration | bool | Mentions terminal or vim-style editing |
| mentions_ai_features | bool | Mentions AI or copilot features |
| mentions_cost_or_license | bool | Mentions pricing, free vs paid, or license type |
| mentions_remote_development | bool | Mentions remote, SSH, or container development |
### Deterministic Features
Computed directly from the raw response text and search metadata without an LLM call.
**Response formatting:**
| Feature | Type | Description |
|---------|------|-------------|
| response_length_chars | int | Character count of the final response |
| response_length_words | int | Word count of the final response |
| has_markdown_headers | bool | Contains markdown headers |
| has_markdown_table | bool | Contains a pipe-delimited markdown table |
| has_numbered_list | bool | Contains a numbered list |
| has_bullet_list | bool | Contains a bullet list |
**Search behavior** (search_on condition; all zero or empty for search_off):
| Feature | Type | Description |
|---------|------|-------------|
| num_searches | int | Number of search tool calls made |
| search_queries | list[string] | All search query strings issued |
| num_search_rounds | int | Number of sequential search rounds (max 5) |
| num_search_results_returned | int | Total organic search results received |
| query_contains_year | bool | Any query contains a four-digit year |
| query_contains_comparison_terms | bool | Any query contains "best", "top", "vs", "comparison", "review", "ranked", or "recommend" |
| query_contains_brand_names | bool | Any query contains a known brand name from the domain |
| query_contains_currentness_terms | bool | Any query contains "current", "latest", "today", "now", "recent", or "updated" |
## Key Findings
- **VPN**: Mullvad leads without search (49.1% top-rec). With search, NordVPN surges +28.2pp to 33.4%, while affiliate marketing warnings drop from 41% to 15%. VPN search results are 44% affiliate sites at #1.
- **Travel**: Google Flights dominates in both conditions (~46% top-rec). Minimal search effect (+2.8pp). Only 17.5% affiliate at #1.
- **Hosting**: Fragmented market. Search introduces Hostinger (0% → 8.7% top-rec) via CNET/Forbes affiliate content. Effect is query-dependent: "web app" queries are stable, "small business website" queries shift toward affiliate brands.
- **Editors**: VS Code at 81% top-rec in both conditions. Near-zero search effect. Only 0.3% affiliate at #1. Serves as control showing search doesn't inherently distort — only affiliate-saturated search does.
## Citation
If you use this dataset, please cite:
```
@misc{brand-bias-evaluations-2026,
title={What Brands Does Your AI Prefer? Brand Priors and Search-Induced Recommendation Shifts in Frontier LLMs},
year={2026},
url={https://github.com/ThreeRiversAINexus/brand-bias-evaluations}
}
```
## License
MIT
提供机构:
3RAIN



