PhillyMac/Active_Listening_Content_1

Name: PhillyMac/Active_Listening_Content_1
Creator: PhillyMac
Published: 2026-03-18 18:47:43
License: 暂无描述

Hugging Face2026-03-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/PhillyMac/Active_Listening_Content_1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-generation - feature-extraction language: - en tags: - corpus - leadership - historical - deku-corpus-builder size_categories: - 1K<n<10K --- # Active-Listening-Content-1 This corpus was automatically generated by the **Deku Corpus Builder** for use in RAG-based AI applications. ## Dataset Description - **Subject**: Active Listening - **Subject Type**: topic - **Total Items**: 1,246 - **Items Requiring Attribution**: 0 - **Has Embeddings**: Yes (all-MiniLM-L6-v2) - **Created**: 2026-03-18 ## Dataset Structure Each record contains: - `text`: The content text - `source_url`: Original source URL - `source_title`: Title of the source document - `source_domain`: Domain of the source - `license_type`: License classification (e.g. `public_domain`, `cc_by`, `cc_by_sa`) - `attribution_required`: Boolean — True for CC BY / CC BY-SA and other attribution-required licenses - `attribution_text`: Formatted Creative Commons attribution string (empty if not required) - `license_url`: URL to the CC license deed (empty if not required) - `relevance_score`: Relevance to the subject (0-1) - `quality_score`: Content quality score (0-1) - `topics`: JSON array of detected topics - `character_count`: Length of the text - `subject_name`: The subject this content relates to - `subject_type`: "personality" or "topic" - `extraction_date`: When the content was extracted - `embedding`: Pre-computed 384-dimensional embedding vector ## Attribution 0 of 1,246 chunks in this corpus require attribution under their source license. When building lessons from these chunks, the `attribution_text` field must be surfaced in the lesson output per the Legend Leadership Attribution Tracking Spec. ## Usage ```python from datasets import load_dataset dataset = load_dataset("PhillyMac/Active_Listening_Content_1") # Access attribution-required chunks for item in dataset["train"]: if item["attribution_required"]: print(item["attribution_text"]) ``` ## Integration with RAG This dataset is designed to be integrated with existing embedded corpuses. The embeddings use the `sentence-transformers/all-MiniLM-L6-v2` model, compatible with FAISS indexing. ## License Content is sourced from public domain and Creative Commons licensed materials. See individual `license_type` fields for per-chunk licensing details. ## Generated By [Deku Corpus Builder](https://github.com/PhillyMac/deku-corpus-builder) - An automated corpus building system for AI applications.

--- 许可证：CC0 1.0 任务类别： - 文本生成（text-generation） - 特征提取（feature-extraction）语言： - 英语（en）标签： - 语料库（corpus） - 领导力（leadership） - 历史（historical） - Deku Corpus Builder（deku-corpus-builder）样本量范围：1000 < n < 10000 --- # 主动聆听内容数据集1（Active-Listening-Content-1）本语料库由**Deku Corpus Builder**自动生成，适用于基于检索增强生成（Retrieval-Augmented Generation，RAG）的人工智能应用。 ## 数据集概览 - **主题**：主动聆听 - **主题类型**：话题 - **总样本量**：1246 - **需标注来源的样本数**：0 - **预生成嵌入向量**：是（采用all-MiniLM-L6-v2模型） - **创建日期**：2026-03-18 ## 数据集结构每条数据记录包含以下字段： - `text`：内容文本 - `source_url`：原始来源URL - `source_title`：来源文档标题 - `source_domain`：来源域名 - `license_type`：许可证分类（例如`public_domain`（公有领域）、`cc_by`（知识共享署名许可）、`cc_by_sa`（知识共享署名-相同方式共享许可）） - `attribution_required`：布尔值，当取值为`True`时，需遵循CC BY、CC BY-SA等需标注来源的许可证要求 - `attribution_text`：格式化后的知识共享来源标注字符串（无需标注时为空） - `license_url`：指向CC许可证法律文本页面的URL（无需标注时为空） - `relevance_score`：与主题的相关度评分（取值范围0至1） - `quality_score`：内容质量评分（取值范围0至1） - `topics`：检测到的主题的JSON数组 - `character_count`：文本字符长度 - `subject_name`：该内容关联的主题名称 - `subject_type`：取值为“personality（人格）”或“topic（话题）” - `extraction_date`：内容提取日期 - `embedding`：预计算的384维嵌入向量 ## 来源标注说明本语料库的1246个文本块中，无任何样本需按照其来源许可证要求标注来源。在基于这些文本块构建教学内容时，需遵循《Legend Leadership Attribution Tracking Spec》规范，在教学输出中展示`attribution_text`字段的内容。 ## 使用示例 python from datasets import load_dataset dataset = load_dataset("PhillyMac/Active_Listening_Content_1") # 访问需标注来源的文本块 for item in dataset["train"]: if item["attribution_required"]: print(item["attribution_text"]) ## 与检索增强生成系统的集成本数据集专为与现有嵌入语料库集成而打造，其嵌入向量采用`sentence-transformers/all-MiniLM-L6-v2`模型生成，兼容FAISS（Facebook AI 相似性搜索库）索引构建。 ## 许可证说明本数据集的内容来源于公有领域及知识共享许可协议授权的材料，各文本块的具体许可证详情请查看对应条目的`license_type`字段。 ## 生成工具 [Deku Corpus Builder](https://github.com/PhillyMac/deku-corpus-builder)——一款面向人工智能应用的自动化语料库构建系统。

提供机构：

PhillyMac

5,000+

优质数据集

54 个

任务类型

进入经典数据集