meetara-lab/vectorstore-academic_tutoring
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/meetara-lab/vectorstore-academic_tutoring
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- feature-extraction
- text-retrieval
- question-answering
task_ids:
- semantic-similarity-scoring
- document-retrieval
- open-domain-qa
language:
- en
tags:
- embeddings
- vector-database
- rag
- retrieval-augmented-generation
- semantic-search
- knowledge-base
- academic-tutoring
size_categories:
- 10K<n<100K
annotations_creators:
- machine-generated
language_creators:
- found
multilinguality: monolingual
pretty_name: Academic Tutoring Vectorstore Dataset
source_datasets:
- original
---
# Vectorstore Dataset: Academic Tutoring
## Overview
This dataset contains pre-computed vector embeddings for the **academic tutoring** domain, ready for use in Retrieval-Augmented Generation (RAG) applications, semantic search, and knowledge base systems. The embeddings are generated from high-quality source documents using state-of-the-art sentence transformers, making it easy to build production-ready RAG applications without the computational overhead of embedding generation.
## Key Features
- ✅ **Pre-computed embeddings**: Ready-to-use vector embeddings, saving computation time
- ✅ **Production-ready**: Optimized for real-world RAG applications
- ✅ **Comprehensive metadata**: Includes source file information, page numbers, and document hashes
- ✅ **LangChain compatible**: Works seamlessly with LangChain and ChromaDB
- ✅ **Search-optimized**: Designed for fast semantic similarity search
## What's Included
This dataset contains **64,845** text chunks from **22** source documents, each pre-embedded using the `sentence-transformers/all-MiniLM-L6-v2` model. Each chunk includes:
- **Text content**: The original document text
- **Embedding vector**: 384-dimensional float32 vector
- **Rich metadata**: Source file, page numbers, document hash, and more
## Dataset Details
### Dataset Summary
- **Domain**: `academic_tutoring`
- **Total Chunks**: 64,845
- **Total Documents**: 22
- **Database Size**: 970.77 MB (8 files)
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Chunk Size**: 1000
- **Chunk Overlap**: 200
### Dataset Structure
The dataset contains the following columns:
- **id**: Unique identifier for each chunk
- **embedding**: Vector embedding (numpy array, dtype=float32)
- **document**: Original text content of the chunk
- **metadata**: JSON string containing metadata (file_name, file_hash, page_number, etc.)
### Embedding Model
This dataset uses embeddings from: `sentence-transformers/all-MiniLM-L6-v2`
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("meetara-lab/vectorstore-academic_tutoring")
# Access the data
print(dataset["train"][0])
# Output:
# {
# 'id': '...',
# 'embedding': array([...], dtype=float32),
# 'document': '...',
# 'metadata': '{"file_name": "...", "page": 1, ...}'
# }
```
### Loading Back into ChromaDB
```python
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from datasets import load_dataset
import json
# Load dataset
dataset = load_dataset("meetara-lab/vectorstore-academic_tutoring")["train"]
# Initialize ChromaDB
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma(
persist_directory="./chroma_academic_tutoring",
embedding_function=embeddings
)
# Add documents to ChromaDB
documents = []
metadatas = []
ids = []
embeddings_list = []
for item in dataset:
ids.append(item["id"])
embeddings_list.append(item["embedding"].tolist())
documents.append(item["document"])
metadatas.append(json.loads(item["metadata"]))
# Note: You'll need to use ChromaDB's Python client directly for custom embeddings
import chromadb
client = chromadb.PersistentClient(path="./chroma_academic_tutoring")
collection = client.create_collection(name="academic_tutoring")
collection.add(
ids=ids,
embeddings=embeddings_list,
documents=documents,
metadatas=metadatas
)
```
### Using with LangChain
```python
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
# Initialize retriever
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma(
persist_directory="./chroma_academic_tutoring",
embedding_function=embeddings
)
# Load from HF Hub first (see above), then use with LangChain
retriever = vectorstore.as_retriever()
results = retriever.invoke("your query here")
```
### Domain-Specific Usage Examples
This vectorstore is optimized for **Academic Tutoring** domain queries. Here are practical examples:
#### Example Queries
```python
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
# Load vectorstore (see "Loading Back into ChromaDB" above)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma(
persist_directory="./chroma_academic_tutoring",
embedding_function=embeddings
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Example queries for academic tutoring domain:
example_queries = [
"How to solve quadratic equations?",
"Explain photosynthesis process",
"What is the structure of an essay?",
"How to study effectively for exams?",
"Explain the causes of World War I"
]
# Run a query
query = "How to solve quadratic equations?"
results = retriever.invoke(query)
# Display results
for i, doc in enumerate(results, 1):
print(f"\nResult {i}:")
print(f" Source: {doc.metadata.get('file_name', 'Unknown')}")
print(f" Page: {doc.metadata.get('page', 'N/A')}")
print(f" Content: {doc.page_content[:200]}...")
```
#### Common Use Cases
This dataset is useful for:
- **Homework help and explanations**
- **Study guide creation**
- **Concept clarification**
- **Exam preparation**
- **Subject-specific tutoring**
#### Real-World Example
```python
# Complete example: Query and use results
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
# 1. Initialize (after loading from HF Hub)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma(
persist_directory="./chroma_academic_tutoring",
embedding_function=embeddings
)
# 2. Create retriever with relevance filtering
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 5, # Get top 5 most relevant results
"score_threshold": 0.7 # Minimum similarity score
}
)
# 3. Query the vectorstore
query = "Explain photosynthesis process"
docs = retriever.invoke(query)
# 4. Process results
for doc in docs:
metadata = doc.metadata
print(f"📄 File: {metadata.get('file_name', 'Unknown')}")
print(f"📃 Page: {metadata.get('page', 'N/A')}")
print(f"📝 Content: {doc.page_content[:300]}...\n")
```
## Dataset Statistics
### Content Statistics
- **Total Chunks**: 64,845
- **Total Documents**: 22
- **Average Chunks per Document**: 2947.5
- **Database Size**: 970.77 MB (8 files)
### Technical Specifications
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions)
- **Chunk Size**: 1000 characters
- **Chunk Overlap**: 200 characters
- **Format**: Parquet/Arrow (optimized for fast loading)
## Performance Considerations
### Loading Time
- Full dataset loads in ~5-15 seconds on average hardware
- Memory usage: ~95.0 MB for embeddings alone
- Recommended RAM: 2GB+ for full dataset operations
### Search Performance
- Typical query time: <100ms for similarity search
- Optimized for retrieval of top-k results (k=5-10)
- Works best with vector databases like ChromaDB, Pinecone, or Weaviate
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{meetara_vectorstore_academic_tutoring,
title={meeTARA Vectorstore: Academic Tutoring},
author={meeTARA Lab},
year={2024},
url={https://huggingface.co/datasets/meetara-lab/vectorstore-academic_tutoring}
}
```
## Limitations and Considerations
- **Language**: This dataset is monolingual (English only)
- **Domain specificity**: Optimized for academic tutoring domain queries
- **Embedding model**: Uses `sentence-transformers/all-MiniLM-L6-v2` - ensure compatibility if switching models
- **Update frequency**: Dataset reflects state at time of publication; source documents may have been updated
## Alternatives and Related Datasets
Looking for other domains? Check out other meeTARA vectorstore datasets:
- `meetara-lab/vectorstore-general_health` - General health and medical information
- Additional domain datasets coming soon!
## Maintenance and Updates
This dataset is maintained by the meeTARA Lab team. For updates, bug reports, or feature requests, please visit our GitHub repository.
## License
This dataset is released under the **Apache 2.0 License**. This means you are free to:
- Use the dataset commercially and non-commercially
- Modify and create derivative works
- Distribute the dataset and modifications
Please see the full license text for complete terms.
## Citation
If you use this dataset in your research or applications, please cite it as:
```bibtex
@dataset{meetara_vectorstore_academic_tutoring,
title={meeTARA Vectorstore: Academic Tutoring},
author={meeTARA Lab},
year={2024},
url={https://huggingface.co/datasets/meetara-lab/vectorstore-academic_tutoring},
license={apache-2.0},
task={feature-extraction, text-retrieval, rag}
}
```
## Contact and Support
- **GitHub**: [meetara-lab/meetara-core](https://github.com/meetara-lab/meetara-core)
- **Issues**: Report bugs or request features on GitHub Issues
- **Documentation**: Visit our repository for detailed documentation
---
**Made with ❤️ by the meeTARA Lab team**
提供机构:
meetara-lab



