five

ferMorales/Gaceta_UNAM_BGE_M3

收藏
Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/Gaceta_UNAM_BGE_M3
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Gaceta UNAM Embeddings (Parquet) language: - es license: other task_categories: - feature-extraction - text-retrieval tags: - embeddings - rag - semantic-search - ocr - spanish - unam --- # Gaceta UNAM Embeddings (Parquet) Dataset of semantic embeddings for text fragments (chunks) from Gaceta UNAM issues, in Parquet format, ready for vector indexing and RAG workflows. Generated with the BAAI/BGE-M3 model. ## Summary - File: embeddings.parquet - Embedding model: BAAI/bge-m3 - Records (chunks): 170,424 - Unique documents (doc_id): 5,536 - Unique chunks (chunk_id): 170,424 - Embedding dimension: 1024 - Generated at UTC: 2026-02-27T09:24:57.375456+00:00 - Time coverage (issue_date, non-empty): from 1954-08-23 to 2026-02-09 - Empty issue_date: 481 records ## Schema (columns) - doc_id (string): document/issue identifier. - chunk_id (string): unique chunk identifier. - chunk_index (int64): chunk position within the document. - corpus (string): corpus segment/family (e.g., gum10, gum80). - decade (string): time grouping derived from the document. - issue_date (string): issue date (expected format YYYY-MM-DD; may be empty in a few cases). - page_start (int64): chunk start page. - page_end (int64): chunk end page. - source_pdf (string): source PDF path. - source_file (string): source JSONL file path. - embedding (list<float64>): normalized semantic vector (length 1024). - text (string): chunk text used for embedding. ## Quality and coverage - All schema fields are present in every row (no null). - issue_date has some empty values (481), so for temporal filtering it is recommended to exclude empty strings. - Observed page range: - page_start: 1 to 126 - page_end: 1 to 128 ## Corpus distribution (Decades) - gum10: 46,317 - gum80: 41,576 - gum90: 41,088 - gum00: 21,283 - gum70: 12,655 - gum60: 4,438 - gum50: 3,067 ## Recommended use This parquet is intended for: 1. Semantic search by vector similarity (embedding). 2. Historical RAG with source traceability (source_pdf, source_file, pages). 3. Metadata filtering (issue_date, corpus, doc_id). ## Notes - The text comes from OCR/historical document processing; minor noise may remain. - Due to scraper issues, all data for the year 2004 was lost. ## Shoutout to BAAI https://huggingface.co/BAAI/bge-m3
提供机构:
ferMorales
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作