PhillyMac/Active_Listening_Content_1
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PhillyMac/Active_Listening_Content_1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- text-generation
- feature-extraction
language:
- en
tags:
- corpus
- leadership
- historical
- deku-corpus-builder
size_categories:
- 1K<n<10K
---
# Active-Listening-Content-1
This corpus was automatically generated by the **Deku Corpus Builder** for use in RAG-based AI applications.
## Dataset Description
- **Subject**: Active Listening
- **Subject Type**: topic
- **Total Items**: 1,246
- **Items Requiring Attribution**: 0
- **Has Embeddings**: Yes (all-MiniLM-L6-v2)
- **Created**: 2026-03-18
## Dataset Structure
Each record contains:
- `text`: The content text
- `source_url`: Original source URL
- `source_title`: Title of the source document
- `source_domain`: Domain of the source
- `license_type`: License classification (e.g. `public_domain`, `cc_by`, `cc_by_sa`)
- `attribution_required`: Boolean — True for CC BY / CC BY-SA and other attribution-required licenses
- `attribution_text`: Formatted Creative Commons attribution string (empty if not required)
- `license_url`: URL to the CC license deed (empty if not required)
- `relevance_score`: Relevance to the subject (0-1)
- `quality_score`: Content quality score (0-1)
- `topics`: JSON array of detected topics
- `character_count`: Length of the text
- `subject_name`: The subject this content relates to
- `subject_type`: "personality" or "topic"
- `extraction_date`: When the content was extracted
- `embedding`: Pre-computed 384-dimensional embedding vector
## Attribution
0 of 1,246 chunks in this corpus require attribution under their source license.
When building lessons from these chunks, the `attribution_text` field must be surfaced
in the lesson output per the Legend Leadership Attribution Tracking Spec.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("PhillyMac/Active_Listening_Content_1")
# Access attribution-required chunks
for item in dataset["train"]:
if item["attribution_required"]:
print(item["attribution_text"])
```
## Integration with RAG
This dataset is designed to be integrated with existing embedded corpuses. The embeddings use the `sentence-transformers/all-MiniLM-L6-v2` model, compatible with FAISS indexing.
## License
Content is sourced from public domain and Creative Commons licensed materials.
See individual `license_type` fields for per-chunk licensing details.
## Generated By
[Deku Corpus Builder](https://github.com/PhillyMac/deku-corpus-builder) - An automated corpus building system for AI applications.
---
许可证:CC0 1.0
任务类别:
- 文本生成(text-generation)
- 特征提取(feature-extraction)
语言:
- 英语(en)
标签:
- 语料库(corpus)
- 领导力(leadership)
- 历史(historical)
- Deku Corpus Builder(deku-corpus-builder)
样本量范围:1000 < n < 10000
---
# 主动聆听内容数据集1(Active-Listening-Content-1)
本语料库由**Deku Corpus Builder**自动生成,适用于基于检索增强生成(Retrieval-Augmented Generation,RAG)的人工智能应用。
## 数据集概览
- **主题**:主动聆听
- **主题类型**:话题
- **总样本量**:1246
- **需标注来源的样本数**:0
- **预生成嵌入向量**:是(采用all-MiniLM-L6-v2模型)
- **创建日期**:2026-03-18
## 数据集结构
每条数据记录包含以下字段:
- `text`:内容文本
- `source_url`:原始来源URL
- `source_title`:来源文档标题
- `source_domain`:来源域名
- `license_type`:许可证分类(例如`public_domain`(公有领域)、`cc_by`(知识共享署名许可)、`cc_by_sa`(知识共享署名-相同方式共享许可))
- `attribution_required`:布尔值,当取值为`True`时,需遵循CC BY、CC BY-SA等需标注来源的许可证要求
- `attribution_text`:格式化后的知识共享来源标注字符串(无需标注时为空)
- `license_url`:指向CC许可证法律文本页面的URL(无需标注时为空)
- `relevance_score`:与主题的相关度评分(取值范围0至1)
- `quality_score`:内容质量评分(取值范围0至1)
- `topics`:检测到的主题的JSON数组
- `character_count`:文本字符长度
- `subject_name`:该内容关联的主题名称
- `subject_type`:取值为“personality(人格)”或“topic(话题)”
- `extraction_date`:内容提取日期
- `embedding`:预计算的384维嵌入向量
## 来源标注说明
本语料库的1246个文本块中,无任何样本需按照其来源许可证要求标注来源。在基于这些文本块构建教学内容时,需遵循《Legend Leadership Attribution Tracking Spec》规范,在教学输出中展示`attribution_text`字段的内容。
## 使用示例
python
from datasets import load_dataset
dataset = load_dataset("PhillyMac/Active_Listening_Content_1")
# 访问需标注来源的文本块
for item in dataset["train"]:
if item["attribution_required"]:
print(item["attribution_text"])
## 与检索增强生成系统的集成
本数据集专为与现有嵌入语料库集成而打造,其嵌入向量采用`sentence-transformers/all-MiniLM-L6-v2`模型生成,兼容FAISS(Facebook AI 相似性搜索库)索引构建。
## 许可证说明
本数据集的内容来源于公有领域及知识共享许可协议授权的材料,各文本块的具体许可证详情请查看对应条目的`license_type`字段。
## 生成工具
[Deku Corpus Builder](https://github.com/PhillyMac/deku-corpus-builder)——一款面向人工智能应用的自动化语料库构建系统。
提供机构:
PhillyMac



