vietnamese_curated_dataset
收藏魔搭社区2025-11-12 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/vietnamese_curated_dataset
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Description
Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator)
- **Developed by:** Viettel Solutions
- **Language:** Vietnamese
### Details
Please visit our Tech Blog post on NVIDIA's plog page for details. [Link](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/)
#### Data Collection
We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include:
- The Vietnamese subset of the [C4 dataset](https://huggingface.co/datasets/allenai/c4/viewer/vi) .
- The Vietnamese subset of the [OSCAR dataset, version 23.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta).
- [Wikipedia's Vietnamese articles](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi).
- [Binhvq's Vietnamese news corpus](https://huggingface.co/datasets/jetaudio/binhvq_news).
#### Preprocessing
We use [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) to curate the collected data. The data curation pipeline includes these key steps:
1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues.
2. Exact Deduplication: Removes exact duplicates to reduce redundancy.
3. Quality Filtering:
4. Heuristic Filtering: Applies rules-based filters to remove low-quality content.
5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality.
**[Notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb)**
#### Dataset Statistics
**Content diversity**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset" width="500"/>
**Character based metrics**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths" width="900"/>
**Token count distribution**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)" width="500"/>
**Embedding visualization**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset" width="650"/>
*UMAP visualization of 5% of the dataset*
### 数据集说明
越南语精选文本数据集。本数据集采集自多个公开越南语数据集,并借助NeMo Curator完成数据精选流程。
- **开发机构:** Viettel Solutions
- **语言:** 越南语
### 详细信息
如需了解详细信息,请访问NVIDIA官方开发者博客发布的技术博文:[链接](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/)
#### 数据采集
我们整合了多份包含越南语样本的数据集,以构建覆盖全面、具有代表性的文本语料库。本次采集涉及的数据集包括:
- [C4数据集](https://huggingface.co/datasets/allenai/c4/viewer/vi)的越南语子集
- [OSCAR数据集23.01版本](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta)的越南语子集
- [维基百科越南语条目](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi)
- [Binhvq越南语新闻语料库](https://huggingface.co/datasets/jetaudio/binhvq_news)
#### 预处理流程
我们借助[NeMo Curator](https://github.com/NVIDIA/NeMo-Curator)对采集到的数据进行精选整理,该数据处理流水线包含以下核心步骤:
1. 编码标准化:将所有文本统一为标准Unicode格式,避免编码冲突问题
2. 精确去重:移除完全重复的文本内容,降低数据冗余度
3. 质量过滤:
4. 启发式过滤:采用基于规则的过滤策略,移除低质量文本内容
5. 基于分类器的过滤:借助机器学习模型对文档进行质量分类,进而完成过滤操作
**配套教程笔记本**:[链接](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb)
#### 数据集统计信息
**内容多样性分布**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="精选数据集的领域占比" width="500"/>
**基于字符的统计指标**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="符号、数字与空白字符占总字符比例、词数及平均词长的箱线图" width="900"/>
**Token数量分布**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="文档规模(以Token数量计)的分布情况" width="500"/>
**嵌入向量可视化**
<img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="5%数据集样本的UMAP可视化结果" width="650"/>
*注:上述为5%数据集样本的UMAP可视化结果*
提供机构:
maas
创建时间:
2024-11-26



