词向量训练数据集

Name: 词向量训练数据集
Creator: 阿里云天池
Published: 2026-06-07 01:50:38
License: 暂无描述

阿里云天池2026-06-07 更新2024-10-12 收录

下载链接：

https://tianchi.aliyun.com/dataset/187348

下载链接

链接失效反馈

官方服务：

资源简介：

语料来源（中文维基百科） https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 2. 中文分词使用开源分词工具对语料进行分词（比如哈工大的 LTP、结巴分词、stanza 等） 3. 基于给定语料，分别手动实现（不调包）Skip-Gram 和 CBOW 两种 Word2Vec 算法，对分词后的每个词进行向量化表示 4. 挑选 10 个词，使用余弦相似度计算并输出每个词最相近的词，以及他们的词向量表示 5. 挑选不同类型的词（比如水果、任务、动物等），对他们的词向量进行二维可视化，观察学习到的词向量好坏 6. 探索类比实验，比如计算 v (王子)-v(男)+v(女)最相近的词向量是不是 v(公主)

Corpus Source (Chinese Wikipedia): https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 2. Chinese Word Segmentation: Use open-source word segmentation tools (e.g., "Harbin Institute of Technology's LTP, Jieba, Stanza, etc.") to perform word segmentation on the corpus. 3. Based on the provided corpus, manually implement both Skip-Gram and CBOW Word2Vec algorithms without using third-party libraries to generate vector representations for each segmented word. 4. Select 10 words, calculate and output the most similar words for each selected word using cosine similarity, along with their respective word vector representations. 5. Select words belonging to different categories (e.g., fruits, tasks, animals, etc.), perform 2D visualization on their word vectors to evaluate the quality of the learned word embeddings. 6. Conduct analogy experiments: for example, compute the vector derived from v(Prince) - v(Male) + v(Female), and verify whether the most similar word vector matches v(Princess).

提供机构：

阿里云天池

创建时间：

2024-10-05

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个基于中文维基百科语料的词向量训练数据集，主要用于手动实现Skip-Gram和CBOW算法，进行词向量化表示和评估。数据集包含分词后的语料，并提供了词向量相似度计算、可视化和类比实验的具体方法。

以上内容由遇见数据集搜集并总结生成