five

vietnamese_curated_dataset

收藏
魔搭社区2025-11-12 更新2024-12-07 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/vietnamese_curated_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
### Dataset Description Vietnamese Curated Text Dataset. This dataset is collected from multiple open Vietnamese datasets, and curated with [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) - **Developed by:** Viettel Solutions - **Language:** Vietnamese ### Details Please visit our Tech Blog post on NVIDIA's plog page for details. [Link](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/) #### Data Collection We utilize a combination of datasets that contain samples in Vietnamese language, ensuring a robust and representative text corpus. These datasets include: - The Vietnamese subset of the [C4 dataset](https://huggingface.co/datasets/allenai/c4/viewer/vi) . - The Vietnamese subset of the [OSCAR dataset, version 23.01](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta). - [Wikipedia's Vietnamese articles](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi). - [Binhvq's Vietnamese news corpus](https://huggingface.co/datasets/jetaudio/binhvq_news). #### Preprocessing We use [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator) to curate the collected data. The data curation pipeline includes these key steps: 1. Unicode Reformatting: Texts are standardized into a consistent Unicode format to avoid encoding issues. 2. Exact Deduplication: Removes exact duplicates to reduce redundancy. 3. Quality Filtering: 4. Heuristic Filtering: Applies rules-based filters to remove low-quality content. 5. Classifier-Based Filtering: Uses machine learning to classify and filter documents based on quality. **[Notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb)** #### Dataset Statistics **Content diversity** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="Domain proportion in curated dataset" width="500"/> **Character based metrics** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="Box plots of percentage of symbols, numbers, and whitespace characters compared to the total characters, word counts and average word lengths" width="900"/> **Token count distribution** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="Distribution of document sizes (in terms of token count)" width="500"/> **Embedding visualization** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="UMAP visualization of 5% of the dataset" width="650"/> *UMAP visualization of 5% of the dataset*

### 数据集说明 越南语精选文本数据集。本数据集采集自多个公开越南语数据集,并借助NeMo Curator完成数据精选流程。 - **开发机构:** Viettel Solutions - **语言:** 越南语 ### 详细信息 如需了解详细信息,请访问NVIDIA官方开发者博客发布的技术博文:[链接](https://developer.nvidia.com/blog/processing-high-quality-vietnamese-language-data-with-nvidia-nemo-curator/) #### 数据采集 我们整合了多份包含越南语样本的数据集,以构建覆盖全面、具有代表性的文本语料库。本次采集涉及的数据集包括: - [C4数据集](https://huggingface.co/datasets/allenai/c4/viewer/vi)的越南语子集 - [OSCAR数据集23.01版本](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301/tree/main/vi_meta)的越南语子集 - [维基百科越南语条目](https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.vi) - [Binhvq越南语新闻语料库](https://huggingface.co/datasets/jetaudio/binhvq_news) #### 预处理流程 我们借助[NeMo Curator](https://github.com/NVIDIA/NeMo-Curator)对采集到的数据进行精选整理,该数据处理流水线包含以下核心步骤: 1. 编码标准化:将所有文本统一为标准Unicode格式,避免编码冲突问题 2. 精确去重:移除完全重复的文本内容,降低数据冗余度 3. 质量过滤: 4. 启发式过滤:采用基于规则的过滤策略,移除低质量文本内容 5. 基于分类器的过滤:借助机器学习模型对文档进行质量分类,进而完成过滤操作 **配套教程笔记本**:[链接](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-vietnamese-data-curation/pretraining-vietnamese-data-curation.ipynb) #### 数据集统计信息 **内容多样性分布** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/mW6Pct3uyP_XDdGmE8EP3.png" alt="精选数据集的领域占比" width="500"/> **基于字符的统计指标** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/W9TQjM2vcC7uXozyERHSQ.png" alt="符号、数字与空白字符占总字符比例、词数及平均词长的箱线图" width="900"/> **Token数量分布** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/PDelYpBI0DefSmQgFONgE.png" alt="文档规模(以Token数量计)的分布情况" width="500"/> **嵌入向量可视化** <img src="https://cdn-uploads.huggingface.co/production/uploads/661766c00c68b375f3f0ccc3/sfeoZWuQ7DcSpbmUOJ12r.png" alt="5%数据集样本的UMAP可视化结果" width="650"/> *注:上述为5%数据集样本的UMAP可视化结果*
提供机构:
maas
创建时间:
2024-11-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作