mjbommar/ogbert-v1-mlm
收藏Hugging Face2025-12-13 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/ogbert-v1-mlm
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含来自OpenGloss的渲染词典条目,用于掩码语言模型(MLM)预训练。数据集包含721,977个条目,分为707,537个训练条目和14,440个评估条目。每个条目包含格式化的词典文本、阅读难度级别(如“初级”、“高级”)和领域标签(如“法律”、“医学”、“通用”)。文本结构包括定义、同义词、反义词、词源摘要和百科全书条目等。数据集设计用于MLM预训练,支持分块/跨步处理长文本,并提供了元数据的使用建议,如分层抽样、课程学习、领域特定分析和质量过滤。
This dataset contains rendered dictionary entries from OpenGloss for Masked Language Model (MLM) pretraining. It includes 721,977 entries, divided into 707,537 training entries and 14,440 evaluation entries. Each entry contains formatted dictionary text, reading difficulty level (e.g., "elementary", "advanced"), and domain tag (e.g., "law", "medicine", "general"). The text structure includes definitions, synonyms, antonyms, etymology summary, and encyclopedia entries. The dataset is designed for MLM pretraining with chunking/striding to handle long texts and provides usage suggestions for metadata, such as stratified sampling, curriculum learning, domain-specific analysis, and quality filtering.
提供机构:
mjbommar



