ferMorales/GACETA_UNAM_GTE_Qwen2
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ferMorales/GACETA_UNAM_GTE_Qwen2
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Gaceta UNAM Embeddings — GTE-Qwen2
language:
- es
license: other
task_categories:
- feature-extraction
- text-retrieval
tags:
- embeddings
- rag
- semantic-search
- ocr
- spanish
- unam
- gte-qwen2
---
# Gaceta UNAM Embeddings — GTE-Qwen2
Dataset of semantic embeddings for text fragments from Gaceta UNAM.
Generated with the **Alibaba-NLP/gte-Qwen2** model.
## Summary
- **Embedding model:** Alibaba-NLP/gte-Qwen2
- **Records (chunks):** 207,888
- **Unique documents (source_file):** 5,628
- **Embedding dimension:** 1,536
- **Time coverage (issue_date):** 1954-08-23 to 2026-02-09
- **Invalid/empty issue_date:** 623 records
## Schema (columns)
| Column | Type | Description |
|---|---|---|
| `id` | string | Unique chunk identifier (UUID). |
| `vector` | fixed_size_list\<float32\>[1536] | Normalized semantic vector from GTE-Qwen2. |
| `payload` | string (JSON) | Full metadata and text for the chunk (see below). |
### Payload JSON fields
| Field | Type | Description |
|---|---|---|
| `doc_id` | string | Document/issue identifier. |
| `chunk_id` | string | Unique chunk identifier. |
| `chunk_index` | int | Chunk position within the document. |
| `corpus` | string | Corpus segment (e.g., gum10, gum80). |
| `decade` | string | Time grouping derived from the document. |
| `issue_date` | string | Issue date (YYYY-MM-DD; may be empty or invalid in some cases). |
| `page_start` | int/null | Chunk start page (null in this version). |
| `page_end` | int/null | Chunk end page (null in this version). |
| `source_pdf` | string | Source PDF path. |
| `source_file` | string | Source JSON file path. |
| `char_count` | int | Character count of the chunk text. |
| `total_chunks` | int | Total chunks in the parent document. |
| `section_hierarchy` | list\<string\> | Section headings for the chunk. |
| `metadata` | object | Source PDF metadata (title, date, page_count, extraction method, etc.). |
| `text` | string | Chunk text used for embedding generation. |
## Corpus distribution
| Corpus | Chunks |
|---|---|
| gum10 | 60,285 |
| gum90 | 48,962 |
| gum80 | 48,687 |
| gum00 | 27,988 |
| gum70 | 13,989 |
| gum60 | 4,695 |
| gum50 | 3,282 |
## Recommended use
This parquet is intended for:
1. **Semantic search** by vector similarity (vector).
2. **Historical RAG** with source traceability (source_pdf, source_file).
3. **Metadata filtering** (issue_date, corpus, decade).
## Notes
- The text comes from OCR/historical document processing; minor noise may remain.
- Due to scraper issues, all data for the year 2004 was lost.
- `page_start` and `page_end` are null across all records in this version.
提供机构:
ferMorales



