vlsp-2023-vllm/lambada_vi

Name: vlsp-2023-vllm/lambada_vi
Creator: vlsp-2023-vllm
Published: 2023-11-19 08:53:04
License: 暂无描述

Hugging Face2023-11-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/vlsp-2023-vllm/lambada_vi

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: context dtype: string - name: target_word dtype: string - name: metadata struct: - name: num_sents dtype: int64 - name: target_word struct: - name: appeared_in_prev_sents dtype: bool - name: pos_tag dtype: string - name: title dtype: string - name: url dtype: string - name: word_type dtype: string splits: - name: test num_bytes: 18460415.77200859 num_examples: 10000 - name: validation num_bytes: 454126.2279914113 num_examples: 246 download_size: 10704436 dataset_size: 18914542.0 configs: - config_name: default data_files: - split: test path: data/test-* - split: validation path: data/validation-* --- # Lambada (Vietnamese) ## Install To install `lm-eval` from the github repository main branch, run: ```bash git clone https://github.com/hieunguyen1053/lm-evaluation-harness cd lm-evaluation-harness pip install -e . ``` ## Basic Usage > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info. ### Hugging Face `transformers` To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. vlsp-2023-vllm/hoa-1b4) on `lambada_vi` you can use the following command: ```bash python main.py \ --model hf-causal \ --model_args pretrained=vlsp-2023-vllm/hoa-1b4 \ --tasks lambada_vi \ --device cuda:0 ``` Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model: ```bash python main.py \ --model hf-causal \ --model_args pretrained=vlsp-2023-vllm/hoa-1b4,revision=step100000,dtype="float" \ --tasks lambada_vi \ --device cuda:0 ``` To evaluate models that are loaded via `AutoSeq2SeqLM` in Huggingface, you instead use `hf-seq2seq`. *To evaluate (causal) models across multiple GPUs, use `--model hf-causal-experimental`* > **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.

The dataset includes four main features: text, context, target_word, and metadata. Metadata is a structured feature containing multiple sub-features such as num_sents, detailed information about target_word (appeared_in_prev_sents, pos_tag), title, url, and word_type. The dataset is divided into two parts: test (containing 10000 samples) and validation (containing 246 samples). The total size of the dataset is 18914542.0 bytes, with a download size of 10704436 bytes.

提供机构：

vlsp-2023-vllm

原始信息汇总

数据集概述

数据集信息

特征

text: 数据类型为字符串。
context: 数据类型为字符串。
target_word: 数据类型为字符串。
metadata: 结构化数据，包含以下字段：
- num_sents: 数据类型为整数64位。
- target_word: 结构化数据，包含以下字段：
  - appeared_in_prev_sents: 数据类型为布尔值。
  - pos_tag: 数据类型为字符串。
- title: 数据类型为字符串。
- url: 数据类型为字符串。
- word_type: 数据类型为字符串。

数据分割

test: 包含10000个样本，总字节数为18460415.77200859。
validation: 包含246个样本，总字节数为454126.2279914113。

数据集大小

下载大小: 10704436字节。
数据集大小: 18914542.0字节。

配置

default: 包含以下数据文件：
- test: 路径为data/test-*。
- validation: 路径为data/validation-*。

搜集汇总

数据集介绍

构建方式

LAMBADA（LAnguage Model Benchmark for A Discourse-level Anaphora）数据集最初为英语设计，旨在评估语言模型在篇章级语境下预测目标词的能力。vlsp-2023-vllm/lambada_vi 是其越南语版本，构建过程中严格遵循原版范式，从越南语自然语料中提取包含完整篇章结构的文本片段。每个样本由连续上下文（context）和位于末尾的目标词（target_word）构成，并附带丰富的元数据，包括句子数量、目标词在先前句子中是否出现、词性标注、标题、来源URL及词汇类型。数据集划分为测试集（10,000条）和验证集（246条），确保评估的统计稳健性。

特点

该数据集的核心特点在于其篇章级理解评估定位，要求模型不仅捕捉局部语义，还需利用跨句线索进行词汇预测。元数据设计精细，记录了目标词在上下文中是否出现过及其词性，为分析模型预测行为提供了多维度视角。数据来源涵盖多样化的越南语文本，确保了语言场景的广泛性。此外，验证集与测试集规模差异显著，验证集较小适合快速调试，而测试集规模可观，能够支持可靠的性能比较。

使用方法

数据集集成于 lm-evaluation-harness 框架，用户可通过命令行便捷调用。以 Hugging Face 模型为例，使用 --model hf-causal 参数指定模型类型，通过 --model_args 传入预训练模型标识（如 vlsp-2023-vllm/hoa-1b4）及可选的检查点版本与数据类型。任务名称直接指定为 lambada_vi，配合 --device cuda:0 即可在 GPU 上运行评估。对于序列到序列模型，可切换至 hf-seq2seq 模式，多 GPU 场景则使用 hf-causal-experimental 后端，确保评估流程灵活适配不同模型架构。

背景与挑战

背景概述

在自然语言处理领域，语言模型对长距离依赖关系的捕捉能力是衡量其语义理解水平的关键指标。LAMBADA数据集最初由Google DeepMind于2016年提出，旨在测试模型在广泛上下文语境中预测目标词汇的能力，尤其关注篇章级语义连贯性。vlsp-2023-vllm/lambada_vi数据集由越南语言与语音处理研究团队（VLSP）于2023年构建，专注于越南语场景，核心研究问题在于评估预训练语言模型在越南语长篇文本中的词汇预测表现。该数据集包含10,000条测试样本和246条验证样本，每条样本由完整文本段落、上下文片段及目标词汇构成，并附带词性标注等元数据。作为VLSP 2023评测任务的重要组成部分，lambada_vi推动了越南语语言模型在复杂语义推理任务上的标准化评估，为低资源语言的自然语言处理研究提供了关键基准。

当前挑战

该数据集面临的核心挑战包括：其一，在领域问题层面，越南语作为黏着语，其词形变化和句法结构复杂，模型需在缺乏显式词边界的情况下从长文本中准确推断目标词，这对捕获跨句语义依赖提出了较高要求；其二，在构建过程中，数据来源于网络爬取，需人工筛选以确保文本质量，同时需处理越南语特有的多义词和歧义现象，标注一致性难以完全保障；此外，数据集规模相对较小，测试样本仅10,000条，可能导致模型评估的统计显著性不足，且验证集仅246条，对模型调优的指导作用有限。这些挑战共同制约了越南语语言模型在篇章理解任务上的鲁棒性提升。

常用场景

经典使用场景

LAMBADA（越南语版）数据集的核心设计在于评估语言模型对长距离语义依赖关系的捕捉能力，其经典使用场景是作为阅读理解与篇章级语言理解的基准测试。该数据集要求模型在阅读一段长文本后，准确预测最后一个缺失的单词，这一任务不仅考验模型对局部语法的掌握，更深入考察其对整体语境、叙事逻辑及隐含信息的推理能力。在自然语言处理领域，这一范式被广泛用于衡量从传统统计语言模型到现代预训练语言模型的演进效果，尤其适用于检验模型在复杂篇章结构中的语义连贯性建模水平。

衍生相关工作

LAMBADA（越南语版）数据集的发布催生了一系列相关研究工作，最为突出的是将其纳入越南语语言模型预训练与微调的标准评测套件。例如，VLSP 2023竞赛中的多个参赛团队基于该数据集开发了针对越南语的因果语言模型评估流程，并衍生出跨语言迁移学习的对比分析工作。此外，研究者借鉴其任务范式，构建了越南语版的篇章级完形填空与指代消解数据集，进一步拓展了长文本理解评估的维度。这些衍生工作共同推动了越南语自然语言处理从词级评测向篇章级理解的范式跃迁。

数据集最近研究