ferMorales/Gaceta_UNAM_BGE_M3_V2
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- es
task_categories:
- feature-extraction
- text-retrieval
size_categories:
- 100K<n<1M
tags:
- embeddings
- rag
- hybrid-search
- bge-m3
- dense
- sparse
- unam
- gaceta
pretty_name: Gaceta UNAM BGE-M3 V2 (dense + sparse)
---
# Gaceta UNAM BGE-M3 V2 (Dense + Sparse)
## Dataset Overview
**Gaceta UNAM BGE-M3 V2** is a dataset of semantic embeddings for text chunks
extracted from issues of *Gaceta UNAM*, generated with **BAAI/bge-m3**. It
includes **both dense and sparse (lexical) representations**, so it can be
used directly for hybrid retrieval (dense ANN + sparse lexical matching)
without re-encoding the corpus.
- **Creator**: Fernando Morales (`ferMorales`)
- **Embedding Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3)
- **Language**: Spanish
- **License**: Other
## Summary
| Property | Value |
|---|---|
| File | `embeddings.parquet` |
| Embedding Model | BAAI/bge-m3 |
| Total Records (chunks) | 207,888 |
| Unique Issues (`source_file`) | 5,628 |
| Embedding Dimension (dense) | 1024 |
| Sparse Vectors | Variable-length (token id → weight) |
| Time Coverage | 1954-08-23 to 2026-02-09 |
## Schema (Columns)
| Column | Type | Description |
|---|---|---|
| `chunk_id` | string | Unique chunk identifier (UUID5 derived from source) |
| `doc_id` | string | Document/issue identifier |
| `chunk_index` | int64 | Chunk position within the source document |
| `corpus` | string | Corpus segment / decade family (`gum00`, `gum10`, …) |
| `decade` | string | Decade grouping derived from the document |
| `issue_date` | string | Issue date (`YYYY-MM-DD`); may be the literal `"unknown-date"` |
| `source_pdf` | string | Path to the original PDF |
| `source_file` | string | Path to the intermediate JSON file |
| `text` | string | Chunk text used for embedding |
| `embedding` | `fixed_size_list<float32>[1024]` | L2-normalized BGE-M3 dense vector |
| `sparse_indices` | `list<uint32>` | BGE-M3 lexical token IDs |
| `sparse_values` | `list<float32>` | BGE-M3 lexical token weights (parallel to `sparse_indices`) |
The sparse representation comes directly from `BGEM3FlagModel.encode(...,
return_sparse=True)["lexical_weights"]` — each chunk has a variable-length
list of `(token_id, weight)` pairs split into two parallel columns.
## Corpus Distribution
| Corpus | Chunks |
|---|---|
| gum10 | 60,285 |
| gum90 | 48,962 |
| gum80 | 48,687 |
| gum00 | 27,988 |
| gum70 | 13,989 |
| gum60 | 4,695 |
| gum50 | 3,282 |
## Coverage notes
- 623 chunks have `issue_date = "unknown-date"` (date could not be parsed
from the source). Filter these out for time-based analyses.
- All other rows have non-null values for every schema field.
- Chunks are produced from OCR'd historical documents — minor noise may
remain in `text`.
- All data for **2004** is missing due to upstream scraper issues.
## Usage
### Load with `datasets` / `pandas`
```python
from datasets import load_dataset
ds = load_dataset("ferMorales/Gaceta_UNAM_BGE_M3_V2", split="train")
print(ds[0]["text"][:200])
print(len(ds[0]["embedding"])) # 1024
print(len(ds[0]["sparse_indices"])) # variable
```
```python
import pandas as pd
df = pd.read_parquet("hf://datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2/embeddings.parquet")
```
### Hybrid retrieval (dense + sparse)
The sparse columns are stored as parallel `list<uint32>` / `list<float32>`
arrays. To convert one row back into a `{token_id: weight}` dict:
```python
def row_to_sparse(row):
return dict(zip(row["sparse_indices"], row["sparse_values"]))
```
For Qdrant, you can ingest these directly into a collection that has both a
dense `Vector` config (size 1024, cosine) and a `SparseVectorParams` named
vector — pass `indices=row["sparse_indices"]` and `values=row["sparse_values"]`.
### Encoding queries with the same model
```python
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
out = model.encode(["mi consulta"], return_dense=True, return_sparse=True)
dense_q = out["dense_vecs"][0] # shape (1024,)
sparse_q = out["lexical_weights"][0] # {token_id: weight}
```
## Recommended Use
1. **Hybrid semantic search** — combine dense cosine + sparse lexical scoring
for improved recall on rare terms / proper nouns / OCR artifacts.
2. **Historical RAG** — retrieval with full source traceability via
`source_pdf`, `source_file`, and `chunk_index`.
3. **Metadata filtering** — slice by `corpus`, `decade`, `issue_date`, or
`doc_id` before re-ranking.
## Citation
If you use this dataset, please cite both the embedding model and this
release:
```
@misc{gaceta_unam_bgem3_v2,
author = {Fernando Morales},
title = {Gaceta UNAM BGE-M3 V2 (dense + sparse)},
year = {2026},
url = {https://huggingface.co/datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2},
}
```
提供机构:
ferMorales



