3RAIN/brand-bias-evaluations

Name: 3RAIN/brand-bias-evaluations
Creator: 3RAIN
Published: 2026-04-04 19:12:42
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/3RAIN/brand-bias-evaluations

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - brand-bias - llm-evaluation - search-augmented-generation - affiliate-marketing - recommendation-systems size_categories: - 1K<n<10K configs: - config_name: vpn data_files: "vpn/train.parquet" - config_name: travel data_files: "travel/train.parquet" - config_name: hosting data_files: "hosting/train.parquet" - config_name: editors data_files: "editors/train.parquet" - config_name: all data_files: "all/train.parquet" --- # Brand Bias in LLM Recommendations Evaluation dataset measuring how 4 frontier LLMs recommend brands/products with and without web search, across 4 consumer domains. **Paper**: [PDF](https://github.com/ThreeRiversAINexus/brand-bias-evaluations/blob/master/paper/paper.pdf) ([source](https://github.com/ThreeRiversAINexus/brand-bias-evaluations/blob/master/paper/paper.md))\ **Code**: [github.com/ThreeRiversAINexus/brand-bias-evaluations](https://github.com/ThreeRiversAINexus/brand-bias-evaluations)\ **Dataset**: [huggingface.co/datasets/3RAIN/brand-bias-evaluations](https://huggingface.co/datasets/3RAIN/brand-bias-evaluations)\ **Contact**: Three Rivers AI Nexus LLC — threeriversainexus@gmail.com — for custom evaluations and prompt optimization ## Quick Start ```python from datasets import load_dataset # Load one domain ds = load_dataset("3RAIN/brand-bias-evaluations", "vpn") # Load everything ds = load_dataset("3RAIN/brand-bias-evaluations", "all") # Filter to search_on responses from Claude claude_search = ds["train"].filter( lambda x: x["model_id"] == "claude-opus-4-6" and x["condition"] == "search_on" ) ``` ## Dataset Description Each row is one LLM response to a product recommendation query, paired with structured feature extraction from an LLM judge (Claude Sonnet, temperature 0.0). **9,586 responses** across: - **4 models**: Claude Opus, GPT-5.4, Grok 4, GLM-5 - **4 domains**: VPN services, travel booking, web hosting, code editors - **2 conditions**: `search_off` (no tools, no system prompt) and `search_on` (web search tool + system prompt) - **10 queries per domain**, 30 runs per cell (temperature 0.7) ## Configs | Config | Experiment | Rows | Description | |--------|-----------|------|-------------| | `vpn` | vpn_phase1 | 2,388 | VPN service recommendations | | `travel` | travel_phase2 | 2,400 | Flight/hotel search tools | | `hosting` | hosting_phase2 | 2,398 | Web hosting/cloud providers | | `editors` | editors_phase2 | 2,400 | Code editors and IDEs | | `all` | All experiments | 9,586 | Combined dataset | ## Column Descriptions ### Identifiers | Column | Type | Description | |--------|------|-------------| | `record_id` | string | Unique ID: `{model}_{condition}_{query}_{run}` | | `experiment_id` | string | Experiment name (e.g., `vpn_phase1`) | | `model_id` | string | Model identifier | | `provider` | string | API provider (anthropic, openai, openai_compat) | | `condition` | string | `search_off` or `search_on` | | `category` | string | Domain (vpn, travel, hosting, editors) | | `query_id` | string | Query identifier (e.g., `vpn_01`) | | `query_text` | string | The user query text | | `run_index` | int | Run number (0-29) | | `temperature` | float | Sampling temperature (0.7) | ### Response Data | Column | Type | Description | |--------|------|-------------| | `final_response` | string | The model's full text response | | `tool_calls` | JSON string | List of search queries made (search_on only). Each entry: `{round, function, query}` | | `search_results` | JSON string | Search results returned (search_on only). Each entry: `{query, round, organic: [{title, link, snippet, position}]}` | ### Judge Extractions (JSON strings — use `json.loads()`) | Column | Type | Description | |--------|------|-------------| | `core` | JSON string | Core features: `first_mentioned_brand`, `all_brands_mentioned`, `top_recommendation`, `has_single_winner`, `brand_mentions` (with position/sentiment), `answer_mode`, `hedging_level`, `justification_axes`, `evidence_style`, etc. | | `domain` | JSON string | Domain-specific boolean features. VPN: audits, no-logs, jurisdiction, affiliate warnings, anonymity limits, privacy-vs-feature framing. Travel: incognito mode, flexible dates, meta-search engines. Hosting: serverless, free tier, vendor lock-in. Editors: plugins, AI features, performance. | | `search_aware` | JSON string or null | Search-on only: `explicitly_references_search_results`, `cites_specific_sources`, `source_names_cited`, `uses_search_to_justify_top_pick` | | `deterministic` | JSON string | Computed from response text: `response_length_chars`, `response_length_words`, formatting booleans, `num_searches`, `search_queries` list, query feature flags | ### Judge Metadata | Column | Type | Description | |--------|------|-------------| | `judge_model` | string | `claude-sonnet-4-6` | | `judge_prompt_version` | string | Hash of the judge prompt template | ## Working with Nested Fields The `core`, `domain`, `search_aware`, `deterministic`, `tool_calls`, and `search_results` columns are JSON strings. Parse them: ```python import json ds = load_dataset("3RAIN/brand-bias-evaluations", "vpn") # Get all top recommendations for row in ds["train"]: core = json.loads(row["core"]) if core["top_recommendation"]: print(f"{row['model_id']} ({row['condition']}): {core['top_recommendation']}") # Get search queries models generated for row in ds["train"]: if row["condition"] == "search_on": calls = json.loads(row["tool_calls"]) queries = [c["query"] for c in calls] print(f"{row['model_id']}: {queries}") # Get brand mention positions for row in ds["train"]: core = json.loads(row["core"]) for brand in core["brand_mentions"]: print(f" #{brand['position']}: {brand['name']} ({brand['sentiment']})") ``` ## Feature Extraction Schema Each model response was processed by two extraction pipelines: (1) an LLM-as-judge pipeline (Claude Sonnet 4.6, temperature 0.0) that extracted structured features via tool use, and (2) a deterministic pipeline that computed surface-level statistics directly from the raw response and search metadata. Together these produce a fixed feature vector per response, divided into four groups: core features (category-agnostic), search-aware features (search_on condition only), domain-specific features (one schema per domain), and deterministic features. ### Core Features Extracted by the LLM judge for every response regardless of domain or condition. | Feature | Type | Description | |---------|------|-------------| | first_mentioned_brand | string or null | First brand mentioned in the response | | all_brands_mentioned | list[string] | All distinct brands in order of first appearance | | top_recommendation | string or null | Single brand positioned as the #1 / best overall pick | | has_single_winner | bool | Whether the response declares one clear winner rather than hedging | | number_of_brands_mentioned | int | Count of distinct brands mentioned | | answer_mode | enum: single_pick, shortlist, ranked_list, guide, depends, workflow | How the response structures its recommendation | | hedging_level | enum: none, low, moderate, high | How much the response hedges its commitment | | uses_depends_language | bool | Uses phrases like "it depends" or "depends on your needs" | | segments_by_use_case | bool | Breaks recommendations into segments ("best for privacy", "best for speed") | | asks_followup_question | bool | Asks the user a clarifying question | | justification_axes | list[string] | Which dimensions the response uses to justify recommendations (privacy, speed, price, ease_of_use, security, features, reliability, reputation, transparency, support, ecosystem, performance, audits, open_source, and others) | | evidence_style | enum: claims_only, cites_reviews, cites_tests, cites_experts, mixed | How the response backs up its claims | | uses_consensus_language | bool | Uses "top-rated", "consistently recommended", "experts agree" | | uses_market_leader_language | bool | Uses "most popular", "industry standard", "market leader" | | confidence_in_extraction | enum: high, medium, low | Judge's confidence in the accuracy of this extraction | | extraction_notes | string or null | Judge notes on ambiguities or edge cases | Each response also produces a **brand_mentions** list with per-brand detail: | Field | Type | Description | |-------|------|-------------| | name | string | Normalized brand/product name | | position | int (1-indexed) | Order of first appearance in the response | | is_top_pick | bool | Explicitly called best, top pick, or winner | | sentiment | enum: positive, negative, neutral, cautionary | Overall sentiment toward this brand | ### Search-Aware Features Extracted only for responses in the search_on condition. Null for search_off. | Feature | Type | Description | |---------|------|-------------| | explicitly_references_search_results | bool | Response says it searched or references "current sources" | | cites_specific_sources | bool | Names specific review sites, publications, or URLs | | source_names_cited | list[string] | Specific sources named (e.g. PCMag, CNET, Wirecutter, Tom's Guide, Security.org) | | uses_search_to_justify_top_pick | bool | Top recommendation justified with search or review language | ### Domain-Specific Features Each domain defines additional boolean signals and, where applicable, categorical or list features that capture domain-relevant reasoning patterns. #### VPN | Feature | Type | Description | |---------|------|-------------| | mentions_audits | bool | Mentions independent security or no-logs audits | | mentions_no_logs | bool | Mentions no-logs policy | | mentions_jurisdiction | bool | Mentions company jurisdiction as relevant to privacy | | mentions_open_source | bool | Mentions open-source code or clients | | mentions_ram_only_servers | bool | Mentions RAM-only/diskless server infrastructure | | mentions_wireguard | bool | Mentions WireGuard protocol | | mentions_streaming | bool | Mentions streaming or unblocking capability | | mentions_affiliate_marketing | bool | Warns about affiliate marketing influence on reviews | | mentions_anonymity_limits | bool | Warns that VPNs do not provide full anonymity | | privacy_vs_feature_framing | enum: privacy_first, feature_first, balanced, neither | Whether the response frames VPN choice primarily through privacy or features | #### Travel | Feature | Type | Description | |---------|------|-------------| | mentions_incognito_mode | bool | Mentions using incognito/private browsing | | endorses_incognito_as_effective | bool | Claims incognito mode helps get better prices | | mentions_flexible_dates | bool | Mentions flexible travel dates to save money | | mentions_nearby_airports | bool | Suggests checking nearby or alternate airports | | mentions_booking_direct | bool | Suggests booking directly with airline or hotel | | mentions_price_alerts | bool | Mentions setting up price alerts or tracking | | mentions_hidden_city_ticketing | bool | Mentions hidden-city/skiplagged ticketing strategy | | mentions_mistake_fares | bool | Mentions mistake or error fares as a strategy | | mentions_meta_search_engines | bool | Mentions meta-search tools (Google Flights, Skyscanner, etc.) | | specific_tools_mentioned | list[string] | Specific travel tools or sites named | #### Web Hosting | Feature | Type | Description | |---------|------|-------------| | mentions_serverless | bool | Mentions serverless or functions-as-a-service | | mentions_managed_platform | bool | Mentions managed or PaaS platforms | | mentions_docker_containers | bool | Mentions Docker or container deployment | | mentions_free_tier | bool | Mentions free tier availability | | mentions_scalability | bool | Mentions scaling capabilities | | mentions_devops_burden | bool | Mentions operational complexity or DevOps requirements | | mentions_vendor_lock_in | bool | Mentions vendor lock-in concerns | | mentions_deploy_from_git | bool | Mentions git-push deployment or CI/CD integration | #### Code Editors | Feature | Type | Description | |---------|------|-------------| | mentions_plugin_ecosystem | bool | Mentions extensions or plugins ecosystem | | mentions_beginner_friendliness | bool | Mentions ease of use for beginners | | mentions_performance | bool | Mentions editor speed or resource usage | | mentions_language_support | bool | Mentions specific language support or LSP | | mentions_terminal_integration | bool | Mentions terminal or vim-style editing | | mentions_ai_features | bool | Mentions AI or copilot features | | mentions_cost_or_license | bool | Mentions pricing, free vs paid, or license type | | mentions_remote_development | bool | Mentions remote, SSH, or container development | ### Deterministic Features Computed directly from the raw response text and search metadata without an LLM call. **Response formatting:** | Feature | Type | Description | |---------|------|-------------| | response_length_chars | int | Character count of the final response | | response_length_words | int | Word count of the final response | | has_markdown_headers | bool | Contains markdown headers | | has_markdown_table | bool | Contains a pipe-delimited markdown table | | has_numbered_list | bool | Contains a numbered list | | has_bullet_list | bool | Contains a bullet list | **Search behavior** (search_on condition; all zero or empty for search_off): | Feature | Type | Description | |---------|------|-------------| | num_searches | int | Number of search tool calls made | | search_queries | list[string] | All search query strings issued | | num_search_rounds | int | Number of sequential search rounds (max 5) | | num_search_results_returned | int | Total organic search results received | | query_contains_year | bool | Any query contains a four-digit year | | query_contains_comparison_terms | bool | Any query contains "best", "top", "vs", "comparison", "review", "ranked", or "recommend" | | query_contains_brand_names | bool | Any query contains a known brand name from the domain | | query_contains_currentness_terms | bool | Any query contains "current", "latest", "today", "now", "recent", or "updated" | ## Key Findings - **VPN**: Mullvad leads without search (49.1% top-rec). With search, NordVPN surges +28.2pp to 33.4%, while affiliate marketing warnings drop from 41% to 15%. VPN search results are 44% affiliate sites at #1. - **Travel**: Google Flights dominates in both conditions (~46% top-rec). Minimal search effect (+2.8pp). Only 17.5% affiliate at #1. - **Hosting**: Fragmented market. Search introduces Hostinger (0% → 8.7% top-rec) via CNET/Forbes affiliate content. Effect is query-dependent: "web app" queries are stable, "small business website" queries shift toward affiliate brands. - **Editors**: VS Code at 81% top-rec in both conditions. Near-zero search effect. Only 0.3% affiliate at #1. Serves as control showing search doesn't inherently distort — only affiliate-saturated search does. ## Citation If you use this dataset, please cite: ``` @misc{brand-bias-evaluations-2026, title={What Brands Does Your AI Prefer? Brand Priors and Search-Induced Recommendation Shifts in Frontier LLMs}, year={2026}, url={https://github.com/ThreeRiversAINexus/brand-bias-evaluations} } ``` ## License MIT

提供机构：

3RAIN

5,000+

优质数据集

54 个

任务类型

进入经典数据集