navneetsatyamkumar/agora-synthetic-home-100k

Name: navneetsatyamkumar/agora-synthetic-home-100k
Creator: navneetsatyamkumar
Published: 2025-11-14 09:54:19
License: 暂无描述

Hugging Face2025-11-14 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/navneetsatyamkumar/agora-synthetic-home-100k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit tags: - synthetic - multi-user-ai - domestic-ai - hci - neurodiversity - conversational-ai size_categories: - 100K<n<1M --- # Agora Synthetic Home — 100k ## Dataset Summary **Agora Synthetic Home — 100k** is a large-scale synthetic dataset designed for research on multi-user domestic AI, agentic home assistants, conversational AI systems, and human-computer interaction (HCI) studies. The dataset contains **512,140 synthetic records** capturing diverse domestic scenarios, user interactions, queries, and contextual information relevant to home automation and personal assistant systems. This dataset is particularly valuable for: - Training and evaluating conversational AI agents for home environments - Studying multi-user interaction patterns and conflicts - Researching neurodiversity-inclusive AI system design - Developing context-aware recommendation systems - Testing conflict resolution strategies in shared spaces **Paper / Source**: "Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic Spaces" — Joydeep Chandra and Satyam Kumar Navneet (2025). The paper describing motivation and dataset generation is attached as `2510.19008v1.pdf` and is available at [https://arxiv.org/abs/2510.19008v1](https://arxiv.org/abs/2510.19008v1) (DOI: `10.48550/arXiv.2510.19008`). **GitHub Repository**: [https://github.com/zade90/Agora](https://github.com/zade90/Agora) **Trained Model**: [https://huggingface.co/JoydeepC/Agora-4B](https://huggingface.co/JoydeepC/Agora-4B) **Supported Tasks**: Multi-user domestic AI research, conversational agent training, conflict resolution modeling, HCI studies, fairness and inclusivity analyses, context-aware recommendation systems. **Languages**: English (synthetic textual fields). ## Dataset Statistics - **Total Records**: 512,140 synthetic user interactions - **File Size**: 131 MB (CSV format) - **Format**: Single CSV file, UTF-8 encoded, comma-separated values - **Scenario Types**: 20+ distinct interaction categories including concurrent requests, emergency situations, daily tasks, entertainment, shopping, health, technical support, work, travel, education, and social interactions - **User Demographics**: Multiple user types including elderly, students, adults, neurodivergent individuals, professionals, and children ### Scenario Distribution (Top Categories) | Scenario Type | Count | Percentage | |--------------|-------|------------| | Concurrent booking conflicts | 3,418 | 0.67% | | Concurrent meeting scheduling | 3,331 | 0.65% | | Concurrent activity planning | 3,325 | 0.65% | | Product availability queries | 3,148 | 0.61% | | Device assistance requests | 3,022 | 0.59% | | Software issue support | 2,763 | 0.54% | | Concurrent cooperation | 1,094 | 0.21% | | Entertainment | 904 | 0.18% | | Shopping | 888 | 0.17% | | Health | 865 | 0.17% | | Technical Support | 862 | 0.17% | | Emergency | 861 | 0.17% | | Work | 853 | 0.17% | | Travel | 835 | 0.16% | | Education | 835 | 0.16% | | Social | 831 | 0.16% | | Daily Tasks | 881 | 0.17% | ## Security Domains Covered The dataset covers multiple security and privacy-sensitive domains relevant to domestic AI systems: 1. **Personal Health Information**: Medical conditions, medication reminders, health queries 2. **Financial Data**: Budget planning, expense tracking, payment information 3. **Location & Travel**: Home addresses, travel plans, location-based services 4. **Family & Social Networks**: Family member information, social gatherings, relationship context 5. **Daily Routines**: Scheduling patterns, habitual behaviors, time-sensitive activities 6. **Accessibility & Neurodiversity**: Sensory preferences, cognitive needs, accommodation requirements 7. **Technical Support**: Device configurations, software troubleshooting, system access 8. **Emergency Scenarios**: Health emergencies, safety protocols, urgent assistance ## Dataset Structure The dataset is provided as a single CSV file: - **`agora-synthetic-home-100k.csv`**: 512,140 rows × 10 columns, UTF-8 encoded, comma-separated values ### Data Schema Each record in the dataset contains the following fields: | Field Name | Data Type | Description | |------------|-----------|-------------| | `user_id` | String | Unique identifier for the synthetic user (e.g., "user_71118") | | `user_type` | String | User demographic category (e.g., "elderly", "student", "neurodivergent", "adult", "child") | | `timestamp` | String | Timestamp of the interaction (format: MM-DD-YYYY HH:MM) | | `query` | String (long text) | Natural language query or request from the user | | `category` | String | Interaction category (e.g., "booking", "technical_support", "health", "emergency") | | `context` | String (long text) | Additional contextual information about the user's situation | | `expected_response` | String (long text) | Ideal or expected system response to the query | | `scenario_type` | String | Type of scenario (e.g., "concurrent_cooperation", "emergency", "daily_tasks") | | `complexity_score` | Float | Numerical score indicating query complexity (0.0 - 1.0) | | `priority_level` | String | Priority classification (e.g., "high", "medium", "low", "urgent") | ### Field Descriptions #### `user_id` Synthetic unique identifier assigned to each user in the dataset. Format: `user_[5-digit number]`. These IDs are randomly generated and do not correspond to real individuals. #### `user_type` Demographic classification of users to enable research on diverse user populations: - **elderly**: Users aged 60+ with potential technology adoption challenges - **student**: School-aged or college students with educational contexts - **neurodivergent**: Users with autism, ADHD, sensory sensitivities, or other neurodivergent conditions - **adult**: General adult population (ages 18-60) - **child**: Young users requiring age-appropriate interactions - **professional**: Working professionals with career-related queries #### `timestamp` Date and time of the simulated interaction. Format: `MM-DD-YYYY HH:MM`. Timestamps span typical daily patterns to enable temporal analysis. #### `query` Natural language query or request from the user. These vary from simple commands to complex, multi-clause requests. Examples include: - Booking requests for restaurants, meeting rooms, or events - Technical support questions - Health-related queries - Emergency assistance requests - Daily task management #### `category` High-level categorization of the interaction domain: - `booking` - Reservation and scheduling requests - `technical_support` - Device or software assistance - `health` - Medical and wellness queries - `emergency` - Urgent assistance scenarios - `entertainment` - Media, gaming, and leisure - `shopping` - Product searches and purchases - `education` - Learning and study-related - `travel` - Trip planning and navigation - `work` - Professional and career contexts - `social` - Social interactions and relationships - `daily_tasks` - Routine home activities #### `context` Detailed situational context surrounding the user's query, including: - User's current state or mood - Environmental factors - Previous interactions or history - Constraints or preferences - Social dynamics (for multi-user scenarios) #### `expected_response` Model ideal system response considering user needs, context, and safety. Used for training and evaluation of conversational agents. #### `scenario_type` More granular scenario classification, including: - `concurrent_cooperation` - Multiple users coordinating - `Concurrent request: [type]` - Competing user requests - Domain-specific types (e.g., `entertainment`, `emergency`) #### `complexity_score` Numerical indicator (0.0 - 1.0) of query complexity based on: - Number of entities and constraints - Ambiguity or clarification needs - Multi-turn interaction requirements - Context dependency #### `priority_level` Urgency classification for triage and response prioritization: - `urgent` - Immediate attention required (emergencies) - `high` - Time-sensitive but not emergency - `medium` - Standard priority - `low` - Can be deferred ## Data Loading ### Using Hugging Face Datasets ```python from datasets import load_dataset # Load from local CSV dataset = load_dataset('csv', data_files='agora-synthetic-home-100k.csv') # Access the data print(dataset['train'][0]) print(f"Total records: {len(dataset['train'])}") # Filter by user type elderly_data = dataset['train'].filter(lambda x: x['user_type'] == 'elderly') ``` ### Using Pandas ```python import pandas as pd # Load the dataset df = pd.read_csv('agora-synthetic-home-100k.csv') # Basic statistics print(df.info()) print(df['user_type'].value_counts()) print(df['category'].value_counts()) # Filter and analyze emergency_queries = df[df['priority_level'] == 'urgent'] print(f"Emergency queries: {len(emergency_queries)}") ``` ## Filtering and Analysis ### Example: Analyzing Concurrent Request Scenarios ```python import pandas as pd df = pd.read_csv('agora-synthetic-home-100k.csv') # Filter concurrent scenarios concurrent = df[df['scenario_type'].str.contains('Concurrent request', na=False)] print(f"Concurrent scenarios: {len(concurrent)}") # Analyze by conflict type conflict_types = concurrent['scenario_type'].value_counts() print(conflict_types.head()) ``` ### Example: Neurodiversity-Focused Analysis ```python # Filter neurodivergent user interactions neurodivergent = df[df['user_type'] == 'neurodivergent'] # Analyze query complexity print(neurodivergent['complexity_score'].describe()) # Check for sensory-related keywords sensory_queries = neurodivergent[ neurodivergent['query'].str.contains('sensory|noise|quiet|calm', case=False, na=False) ] print(f"Sensory-related queries: {len(sensory_queries)}") ``` ### Example: Priority Analysis ```python # Distribution of priority levels priority_dist = df['priority_level'].value_counts(normalize=True) print("Priority distribution:") print(priority_dist) # Average complexity by priority complexity_by_priority = df.groupby('priority_level')['complexity_score'].mean() print("\nAverage complexity by priority:") print(complexity_by_priority.sort_values(ascending=False)) ``` ## Basic Evaluation Setup ### Intent Classification Task ```python from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report # Load data df = pd.read_csv('agora-synthetic-home-100k.csv') # Prepare features and labels X = df['query'] y = df['category'] # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Vectorize vectorizer = TfidfVectorizer(max_features=5000) X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) # Train classifier clf = LogisticRegression(max_iter=1000, random_state=42) clf.fit(X_train_vec, y_train) # Evaluate y_pred = clf.predict(X_test_vec) print(classification_report(y_test, y_pred)) ``` ### Response Generation Evaluation ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM from datasets import load_dataset # Load model and tokenizer model_name = "t5-small" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Load dataset dataset = load_dataset('csv', data_files='agora-synthetic-home-100k.csv') # Prepare sample sample = dataset['train'].select(range(100)) def generate_response(query, context): input_text = f"query: {query} context: {context}" inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) outputs = model.generate(**inputs, max_length=150) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Evaluate on sample for item in sample: generated = generate_response(item['query'], item['context']) print(f"Query: {item['query'][:100]}...") print(f"Generated: {generated}") print(f"Expected: {item['expected_response'][:100]}...") print("-" * 80) ``` ## Use Cases and Applications ### 1. **Multi-User Conflict Resolution** Train AI agents to handle concurrent requests from multiple household members with competing needs. The dataset includes numerous "Concurrent request" scenarios where users vie for the same resources (meeting rooms, restaurant reservations, device access). **Example Application**: Smart home assistant that mediates between family members requesting the same time slot or resource. ### 2. **Neurodiversity-Inclusive AI Design** Research and develop AI systems that accommodate neurodivergent users with specific sensory, cognitive, and communication needs. The dataset includes explicit neurodivergent user scenarios with detailed contextual preferences. **Example Application**: Voice assistant with sensory-friendly modes, predictable responses, and reduced auditory/visual overload. ### 3. **Context-Aware Recommendation Systems** Build recommendation engines that consider user demographics, situational context, and temporal patterns when suggesting actions or responses. **Example Application**: Meal planning assistant that adapts to dietary restrictions, time constraints, and user preferences. ### 4. **Emergency Response & Safety Systems** Train models to identify and prioritize urgent queries, escalating critical situations appropriately while managing non-emergency requests. **Example Application**: Home automation system with built-in emergency detection and response protocols. ### 5. **Accessibility & Assistive Technology** Develop assistive AI for elderly users, individuals with disabilities, and those requiring technology accessibility support. **Example Application**: Senior-friendly virtual assistant with simplified interfaces and patient, step-by-step guidance. ### 6. **Conversational AI Training** Fine-tune large language models (LLMs) on domain-specific home assistant dialogues with diverse user populations and scenarios. **Example Application**: Domain-adapted GPT or T5 models for home automation and personal assistant tasks. ### 7. **Fairness & Bias Evaluation** Analyze AI system performance across different demographic groups to identify and mitigate bias in conversational AI systems. **Example Application**: Bias auditing framework for commercial voice assistants. ### 8. **Human-AI Interaction Research** Study how users with different backgrounds, abilities, and needs interact with AI systems in domestic contexts. **Example Application**: HCI research on inclusive AI design principles and user experience optimization. **Citation:**: If you use this dataset, please cite the paper that introduced it. Example BibTeX: ```bibtex @misc{chandra2025pluralvoicessingleagent, title={Plural Voices, Single Agent: Towards Inclusive AI in Multi-User Domestic Spaces}, author={Joydeep Chandra and Satyam Kumar Navneet}, year={2025}, eprint={2510.19008}, archivePrefix={arXiv}, primaryClass={cs.HC}, url={https://arxiv.org/abs/2510.19008}, } ```

提供机构：

navneetsatyamkumar

5,000+

优质数据集

54 个

任务类型

进入经典数据集