vietnamese_curated_dataset

Name: vietnamese_curated_dataset
Creator: maas
Published: 2025-11-12 16:18:11
License: 暂无描述

魔搭社区2025-11-12 更新2024-12-07 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/vietnamese_curated_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

### Dataset Description Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) - **Developed by:** Viettel Solutions - **Language:** Vietnamese ### Details Please visit our Tech Blog post on NVIDIA's plog page for details. [Link](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/) #### Data Collection We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the [C4 dataset](https://huggingface.co/datasets/allenai/c4/viewer/vi) . - The Vietnamese subset of the [OSCAR dataset, version 23.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta). - [Wikipedia's Vietnamese articles](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi). - [Binhvq's Vietnamese news corpus](https://huggingface.co/datasets/jetaudio/binhvq_news). #### Preprocessing We use [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality. **[Notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb)** #### Dataset Statistics **Content diversity** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset" width="500"/> **Character based metrics** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths" width="900"/> **Token count distribution** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)" width="500"/> **Embedding visualization** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset" width="650"/> *UMAP visualization of 5% of the dataset*

### 数据集说明越南语精选文本数据集。本数据集采集自多个公开越南语数据集，并借助NeMo Curator完成数据精选流程。 - **开发机构：** Viettel Solutions - **语言：** 越南语 ### 详细信息如需了解详细信息，请访问NVIDIA官方开发者博客发布的技术博文：[链接](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/) #### 数据采集我们整合了多份包含越南语样本的数据集，以构建覆盖全面、具有代表性的文本语料库。本次采集涉及的数据集包括： - [C4数据集](https://huggingface.co/datasets/allenai/c4/viewer/vi)的越南语子集 - [OSCAR数据集23.01版本](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta)的越南语子集 - [维基百科越南语条目](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi) - [Binhvq越南语新闻语料库](https://huggingface.co/datasets/jetaudio/binhvq_news) #### 预处理流程我们借助[NeMo Curator](https://github.com/NVIDIA/NeMo-Curator)对采集到的数据进行精选整理，该数据处理流水线包含以下核心步骤： 1. 编码标准化：将所有文本统一为标准Unicode格式，避免编码冲突问题 2. 精确去重：移除完全重复的文本内容，降低数据冗余度 3. 质量过滤： 4. 启发式过滤：采用基于规则的过滤策略，移除低质量文本内容 5. 基于分类器的过滤：借助机器学习模型对文档进行质量分类，进而完成过滤操作 **配套教程笔记本**：[链接](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb) #### 数据集统计信息 **内容多样性分布** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="精选数据集的领域占比" width="500"/> **基于字符的统计指标** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="符号、数字与空白字符占总字符比例、词数及平均词长的箱线图" width="900"/> **Token数量分布** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="文档规模（以Token数量计）的分布情况" width="500"/> **嵌入向量可视化** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="5%数据集样本的UMAP可视化结果" width="650"/> *注：上述为5%数据集样本的UMAP可视化结果*

提供机构：

maas

创建时间：

2024-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集