ferMorales/Gaceta_UNAM_BGE_M3
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/Gaceta_UNAM_BGE_M3
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Gaceta UNAM Embeddings (Parquet)
language:
- es
license: other
task_categories:
- feature-extraction
- text-retrieval
tags:
- embeddings
- rag
- semantic-search
- ocr
- spanish
- unam
---
# Gaceta UNAM Embeddings (Parquet)
Dataset of semantic embeddings for text fragments (chunks) from Gaceta UNAM issues, in Parquet format, ready for vector indexing and RAG workflows.
Generated with the BAAI/BGE-M3 model.
## Summary
- File: embeddings.parquet
- Embedding model: BAAI/bge-m3
- Records (chunks): 170,424
- Unique documents (doc_id): 5,536
- Unique chunks (chunk_id): 170,424
- Embedding dimension: 1024
- Generated at UTC: 2026-02-27T09:24:57.375456+00:00
- Time coverage (issue_date, non-empty): from 1954-08-23 to 2026-02-09
- Empty issue_date: 481 records
## Schema (columns)
- doc_id (string): document/issue identifier.
- chunk_id (string): unique chunk identifier.
- chunk_index (int64): chunk position within the document.
- corpus (string): corpus segment/family (e.g., gum10, gum80).
- decade (string): time grouping derived from the document.
- issue_date (string): issue date (expected format YYYY-MM-DD; may be empty in a few cases).
- page_start (int64): chunk start page.
- page_end (int64): chunk end page.
- source_pdf (string): source PDF path.
- source_file (string): source JSONL file path.
- embedding (list<float64>): normalized semantic vector (length 1024).
- text (string): chunk text used for embedding.
## Quality and coverage
- All schema fields are present in every row (no null).
- issue_date has some empty values (481), so for temporal filtering it is recommended to exclude empty strings.
- Observed page range:
- page_start: 1 to 126
- page_end: 1 to 128
## Corpus distribution (Decades)
- gum10: 46,317
- gum80: 41,576
- gum90: 41,088
- gum00: 21,283
- gum70: 12,655
- gum60: 4,438
- gum50: 3,067
## Recommended use
This parquet is intended for:
1. Semantic search by vector similarity (embedding).
2. Historical RAG with source traceability (source_pdf, source_file, pages).
3. Metadata filtering (issue_date, corpus, doc_id).
## Notes
- The text comes from OCR/historical document processing; minor noise may remain.
- Due to scraper issues, all data for the year 2004 was lost.
## Shoutout to BAAI
https://huggingface.co/BAAI/bge-m3
提供机构:
ferMorales



