five

vanta-research/human-ai-collaboration-2

收藏
Hugging Face2025-12-03 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/vanta-research/human-ai-collaboration-2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - human-ai-collaboration - human-ai-collaboration-2 - LoRA - problem-solving - conversational-ai - vanta-research - reasoning-datasets - thinking-models - collaboration - warm - friendly - chat task_categories: - text-generation language: - en size_categories: - 1K<n<10K --- <div align="center"> ![vanta_trimmed](https://cdn-uploads.huggingface.co/production/uploads/686c460ba3fc457ad14ab6f8/hcGtMtCIizEZG_OuCvfac.png) <h1>VANTA Research</h1> <p><strong>Independent AI research lab building safe, resilient language models optimized for human-AI collaboration</p> <p> <a href="https://unmodeledtyler.com"><img src="https://img.shields.io/badge/Website-unmodeledtyler.com-yellow" alt="Website"/></a> <a href="https://merch.vantaresearch.xyz"><img src="https://img.shields.io/badge/Merch-merch.vantaresearch.xyz-green" alt="Merch"/></a> <a href="https://x.com/vanta_research"><img src="https://img.shields.io/badge/@vanta_research-1DA1F2?logo=x" alt="X"/></a> <a href="https://github.com/vanta-research"><img src="https://img.shields.io/badge/GitHub-vanta--research-181717?logo=github" alt="GitHub"/></a> </p> </div> --- # Human-AI Collaboration-2 This dataset is an expansion of our previous release, *[human-ai-collaboration-1](https://huggingface.co/datasets/vanta-research/human-ai-collaboration-1).* This dataset contains the entirety of *human-ai-collaboration-1,* and expands on it further by adding over twice as many collaborative examples than before. *Note: It's not recommended to use both datasets simultaneously as there is overlap between the two. Choose 1 or 2 depending on your needs.* **Dataset Size:** 6,324 entries **License:** Apache 2.0 **Author:** VANTA Research --- ## Overview This dataset captures authentic human-AI collaboration patterns across diverse domains including: - **Conversational Problem-Solving:** Multi-turn dialogues tackling complex questions and creative challenges - **Knowledge Synthesis:** Combining human expertise with AI capabilities to explore topics in depth - **Reasoning Transparency:** Clear explanation of thought processes, uncertainty acknowledgment, and collaborative exploration - **Domain Diversity:** Technical topics, creative writing, scientific reasoning, strategic thinking, and more Each entry is a carefully curated example of effective human-AI collaboration, optimized for training models that can: - Engage in substantive dialogue - Acknowledge and explore uncertainty - Synthesize information from multiple perspectives - Maintain context and coherence across exchanges - Collaborate productively toward shared understanding --- ## Dataset Structure Each entry in the dataset follows a standard conversational format with **thinking tags** that reveal the AI's reasoning process: ```json { "messages": [ { "role": "user", "content": "Can you help me optimize this Python function for processing large datasets?" }, { "role": "assistant", "content": "<think>\nThe user needs performance optimization help. I should:\n1. Analyze their current approach for bottlenecks\n2. Suggest concrete improvements with explanations\n3. Ask clarifying questions about their use case\n4. Provide multiple solutions at different complexity levels\n</think>\n\nI'd be happy to help optimize that! To give you the best recommendations, could you share:\n- The current function code\n- Typical dataset size you're working with\n- Any specific performance bottlenecks you've noticed?\n\nIn the meantime, here are common optimization strategies..." } ] } ``` ### Think Tags A distinctive feature of this dataset is the inclusion of **`<think>` tags** that contain the AI's internal reasoning process before generating the visible response. These tags: - Show strategic planning and approach selection - Reveal uncertainty acknowledgment and decision-making - Demonstrate consideration of user context and needs - Provide insight into collaborative reasoning patterns **Training Tip:** You can choose to keep or remove think tags depending on your use case: - **Keep them** to train models on transparent reasoning and planning - **Remove them** if you only want the final conversational outputs ### Data Format - **Format:** JSONL (JSON Lines - one JSON object per line) - **Character Encoding:** UTF-8 - **Line Endings:** Standard newlines (`\n`) ### Field Descriptions | Field | Type | Description | |-------|------|-------------| | `messages` | Array | Conversation turn containing user and assistant exchanges | | `role` | String | Either "user" or "assistant" | | `content` | String | The actual message content (may include `<think>` tags for assistant responses) | --- ## Key Characteristics ### Quality Attributes - ✓ **Transparent Reasoning:** Internal `<think>` tags reveal planning, strategy selection, and uncertainty - ✓ **Coherent:** Responses maintain logical consistency and relevance - ✓ **Substantive:** Meaningful depth in exploration and explanation - ✓ **Collaborative:** Genuine dialogue advancing shared understanding - ✓ **Diverse:** Wide range of topics, styles, and interaction patterns ### Scope of Data The dataset spans: - Multiple domains (STEM, humanities, creative, strategic) - Various conversation lengths (2-turn exchanges to extended dialogues) - Different reasoning styles (analytical, creative, exploratory, systematic) - Mixed levels of topic complexity (accessible to specialized) --- ## Usage ### Loading the Dataset **Python with Hugging Face Datasets:** ```python from datasets import load_dataset dataset = load_dataset("huggingface", "vanta-research/human-ai-1962") ``` **Raw JSONL Loading:** ```python import json entries = [] with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f: for line in f: entries.append(json.loads(line)) ``` **Processing Think Tags:** ```python import re def extract_thinking(content): """Extract think tags and visible response separately.""" think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL) thinking = think_match.group(1).strip() if think_match else None visible = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip() return thinking, visible def remove_think_tags(content): """Remove think tags entirely for training without reasoning.""" return re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip() # Example usage with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f: for line in f: entry = json.loads(line) for msg in entry["messages"]: if msg["role"] == "assistant": thinking, response = extract_thinking(msg["content"]) print(f"Internal reasoning: {thinking}") print(f"Visible response: {response}") ``` **With PyTorch:** ```python from torch.utils.data import Dataset class HumanAIDataset(Dataset): def __init__(self, jsonl_path): self.data = [] with open(jsonl_path, "r", encoding="utf-8") as f: for line in f: self.data.append(json.loads(line)) def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx] ``` ### Recommended Applications 1. **Reasoning Transparency Training:** Leverage `<think>` tags to train models that show their work 2. **Language Model Fine-Tuning:** Train models on substantive dialogue patterns (with or without reasoning tags) 3. **Chain-of-Thought Learning:** Use internal reasoning to improve model planning capabilities 4. **Dialogue Evaluation:** Benchmark conversational AI systems on collaborative interactions 5. **RLHF Datasets:** Use as reference examples for preference learning 6. **Instruction Following:** Learn nuanced response generation patterns --- ## Statistics - **Total Entries:** conversations - **Average Entry Length:** ~200-400 tokens per exchange - **Language:** English - **Minimum Quality Threshold:** All entries reviewed for coherence and relevance --- ## License This dataset is released under the **Apache License 2.0**. See LICENSE file for full details. ### Apache 2.0 Summary The Apache License 2.0 permits: - ✓ Commercial use - ✓ Modification - ✓ Distribution - ✓ Private use With the conditions: - License and copyright notice must be included - State significant changes made to the code/data --- ## Ethical Considerations ### Dataset Limitations - Reflects conversation patterns as of training data cutoff - May contain perspectives that are not universally shared - Represents primarily English-language interactions - Domain coverage is non-uniform (some topics more represented than others) ### Responsible Use We encourage users to: 1. Consider downstream application impacts 2. Test for biases relevant to your use case 3. Maintain transparency about dataset usage 4. Report concerns or improvements to maintainers --- ## Citation If you use this dataset in research or production, please cite: ```bibtex @dataset{vanta-research-human-ai-collaboration-2, title={Human-AI Collaboration-2}, author={VANTA Research}, year={2025}, url={https://huggingface.co/datasets/vanta-research/human-ai-collaboration-2} } ``` --- ## Data Quality & Curation Each entry in this dataset has been: - ✓ Validated for JSON structural integrity - ✓ Verified for conversation coherence - ✓ Checked for substantive content - ✓ Reviewed for diversity of interaction patterns Entries are designed to exemplify high-quality human-AI collaboration, making them suitable for training models to engage in similarly productive exchanges. --- ## Getting Started 1. **Download:** Access the dataset from Hugging Face Hub 2. **Explore:** Start with a small subset to understand structure 3. **Integrate:** Use provided code examples to load into your pipeline 4. **Fine-tune:** Apply to your specific task with domain-appropriate training procedures 5. **Evaluate:** Benchmark your results against baseline models --- ## Contributing & Feedback ### Reporting Issues If you find data quality issues, format problems, or licensing concerns, please open an issue on the project repository. ### Improvements & Suggestions We welcome feedback on: - Data completeness - Format usability - Documentation clarity - Suggested use cases --- ## Changelog ### Version 1.0 (2025) - Initial release with 3,050 curated entries - JSONL format standardization - Comprehensive documentation --- # Contact - Organization: hello@vantaresearch.xyz - Engineering/Design: tyler@vantaresearch.xyz --- **Disclaimer:** This dataset is provided as-is for research and development purposes. Users are responsible for evaluating the dataset's suitability for their specific applications. --- **License:** Apache License 2.0 | **Repository Owner:** VANTA Research | **Initial Commit:** November 2025 *VANTA Research: Building safe, resilient AI models optimized for human-AI collaboration*
提供机构:
vanta-research
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作