five

ferMorales/Gaceta_UNAM_BGE_M3_V2

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other language: - es task_categories: - feature-extraction - text-retrieval size_categories: - 100K<n<1M tags: - embeddings - rag - hybrid-search - bge-m3 - dense - sparse - unam - gaceta pretty_name: Gaceta UNAM BGE-M3 V2 (dense + sparse) --- # Gaceta UNAM BGE-M3 V2 (Dense + Sparse) ## Dataset Overview **Gaceta UNAM BGE-M3 V2** is a dataset of semantic embeddings for text chunks extracted from issues of *Gaceta UNAM*, generated with **BAAI/bge-m3**. It includes **both dense and sparse (lexical) representations**, so it can be used directly for hybrid retrieval (dense ANN + sparse lexical matching) without re-encoding the corpus. - **Creator**: Fernando Morales (`ferMorales`) - **Embedding Model**: [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) - **Language**: Spanish - **License**: Other ## Summary | Property | Value | |---|---| | File | `embeddings.parquet` | | Embedding Model | BAAI/bge-m3 | | Total Records (chunks) | 207,888 | | Unique Issues (`source_file`) | 5,628 | | Embedding Dimension (dense) | 1024 | | Sparse Vectors | Variable-length (token id → weight) | | Time Coverage | 1954-08-23 to 2026-02-09 | ## Schema (Columns) | Column | Type | Description | |---|---|---| | `chunk_id` | string | Unique chunk identifier (UUID5 derived from source) | | `doc_id` | string | Document/issue identifier | | `chunk_index` | int64 | Chunk position within the source document | | `corpus` | string | Corpus segment / decade family (`gum00`, `gum10`, …) | | `decade` | string | Decade grouping derived from the document | | `issue_date` | string | Issue date (`YYYY-MM-DD`); may be the literal `"unknown-date"` | | `source_pdf` | string | Path to the original PDF | | `source_file` | string | Path to the intermediate JSON file | | `text` | string | Chunk text used for embedding | | `embedding` | `fixed_size_list<float32>[1024]` | L2-normalized BGE-M3 dense vector | | `sparse_indices` | `list<uint32>` | BGE-M3 lexical token IDs | | `sparse_values` | `list<float32>` | BGE-M3 lexical token weights (parallel to `sparse_indices`) | The sparse representation comes directly from `BGEM3FlagModel.encode(..., return_sparse=True)["lexical_weights"]` — each chunk has a variable-length list of `(token_id, weight)` pairs split into two parallel columns. ## Corpus Distribution | Corpus | Chunks | |---|---| | gum10 | 60,285 | | gum90 | 48,962 | | gum80 | 48,687 | | gum00 | 27,988 | | gum70 | 13,989 | | gum60 | 4,695 | | gum50 | 3,282 | ## Coverage notes - 623 chunks have `issue_date = "unknown-date"` (date could not be parsed from the source). Filter these out for time-based analyses. - All other rows have non-null values for every schema field. - Chunks are produced from OCR'd historical documents — minor noise may remain in `text`. - All data for **2004** is missing due to upstream scraper issues. ## Usage ### Load with `datasets` / `pandas` ```python from datasets import load_dataset ds = load_dataset("ferMorales/Gaceta_UNAM_BGE_M3_V2", split="train") print(ds[0]["text"][:200]) print(len(ds[0]["embedding"])) # 1024 print(len(ds[0]["sparse_indices"])) # variable ``` ```python import pandas as pd df = pd.read_parquet("hf://datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2/embeddings.parquet") ``` ### Hybrid retrieval (dense + sparse) The sparse columns are stored as parallel `list<uint32>` / `list<float32>` arrays. To convert one row back into a `{token_id: weight}` dict: ```python def row_to_sparse(row): return dict(zip(row["sparse_indices"], row["sparse_values"])) ``` For Qdrant, you can ingest these directly into a collection that has both a dense `Vector` config (size 1024, cosine) and a `SparseVectorParams` named vector — pass `indices=row["sparse_indices"]` and `values=row["sparse_values"]`. ### Encoding queries with the same model ```python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True) out = model.encode(["mi consulta"], return_dense=True, return_sparse=True) dense_q = out["dense_vecs"][0] # shape (1024,) sparse_q = out["lexical_weights"][0] # {token_id: weight} ``` ## Recommended Use 1. **Hybrid semantic search** — combine dense cosine + sparse lexical scoring for improved recall on rare terms / proper nouns / OCR artifacts. 2. **Historical RAG** — retrieval with full source traceability via `source_pdf`, `source_file`, and `chunk_index`. 3. **Metadata filtering** — slice by `corpus`, `decade`, `issue_date`, or `doc_id` before re-ranking. ## Citation If you use this dataset, please cite both the embedding model and this release: ``` @misc{gaceta_unam_bgem3_v2, author = {Fernando Morales}, title = {Gaceta UNAM BGE-M3 V2 (dense + sparse)}, year = {2026}, url = {https://huggingface.co/datasets/ferMorales/Gaceta_UNAM_BGE_M3_V2}, } ```
提供机构:
ferMorales
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作