ferMorales/GACETA_UNAM_GTE_Qwen2

Name: ferMorales/GACETA_UNAM_GTE_Qwen2
Creator: ferMorales
Published: 2026-04-06 05:51:31
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ferMorales/GACETA_UNAM_GTE_Qwen2

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Gaceta UNAM Embeddings — GTE-Qwen2 language: - es license: other task_categories: - feature-extraction - text-retrieval tags: - embeddings - rag - semantic-search - ocr - spanish - unam - gte-qwen2 --- # Gaceta UNAM Embeddings — GTE-Qwen2 Dataset of semantic embeddings for text fragments from Gaceta UNAM. Generated with the **Alibaba-NLP/gte-Qwen2** model. ## Summary - **Embedding model:** Alibaba-NLP/gte-Qwen2 - **Records (chunks):** 207,888 - **Unique documents (source_file):** 5,628 - **Embedding dimension:** 1,536 - **Time coverage (issue_date):** 1954-08-23 to 2026-02-09 - **Invalid/empty issue_date:** 623 records ## Schema (columns) | Column | Type | Description | |---|---|---| | `id` | string | Unique chunk identifier (UUID). | | `vector` | fixed_size_list\<float32\>[1536] | Normalized semantic vector from GTE-Qwen2. | | `payload` | string (JSON) | Full metadata and text for the chunk (see below). | ### Payload JSON fields | Field | Type | Description | |---|---|---| | `doc_id` | string | Document/issue identifier. | | `chunk_id` | string | Unique chunk identifier. | | `chunk_index` | int | Chunk position within the document. | | `corpus` | string | Corpus segment (e.g., gum10, gum80). | | `decade` | string | Time grouping derived from the document. | | `issue_date` | string | Issue date (YYYY-MM-DD; may be empty or invalid in some cases). | | `page_start` | int/null | Chunk start page (null in this version). | | `page_end` | int/null | Chunk end page (null in this version). | | `source_pdf` | string | Source PDF path. | | `source_file` | string | Source JSON file path. | | `char_count` | int | Character count of the chunk text. | | `total_chunks` | int | Total chunks in the parent document. | | `section_hierarchy` | list\<string\> | Section headings for the chunk. | | `metadata` | object | Source PDF metadata (title, date, page_count, extraction method, etc.). | | `text` | string | Chunk text used for embedding generation. | ## Corpus distribution | Corpus | Chunks | |---|---| | gum10 | 60,285 | | gum90 | 48,962 | | gum80 | 48,687 | | gum00 | 27,988 | | gum70 | 13,989 | | gum60 | 4,695 | | gum50 | 3,282 | ## Recommended use This parquet is intended for: 1. **Semantic search** by vector similarity (vector). 2. **Historical RAG** with source traceability (source_pdf, source_file). 3. **Metadata filtering** (issue_date, corpus, decade). ## Notes - The text comes from OCR/historical document processing; minor noise may remain. - Due to scraper issues, all data for the year 2004 was lost. - `page_start` and `page_end` are null across all records in this version.

提供机构：

ferMorales

5,000+

优质数据集

54 个

任务类型

进入经典数据集