billingsmoore/NLLB-bo-en
收藏Hugging Face2025-02-04 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/billingsmoore/NLLB-bo-en
下载链接
链接失效反馈官方服务:
资源简介:
这是一个基于NLLB数据集的藏英句对数据集。它包含了大约450GB的双语文本对,使用stopes挖掘库和LASER3编码器创建。数据集经过语言识别、表情符号过滤以及某些高资源语言的模型过滤。数据来源于OPUS,并经过格式化以适应Hugging Face格式。数据集的主题标签是通过easy_text_clustering生成的。
This dataset is a Tibetan-English sentence pairs derived from the NLLB dataset. It contains approximately 450GB of bilingual text pairs created using the stopes mining library and LASER3 encoders. The dataset has been filtered through language identification, emoji-based filtering, and model-based filtering for some high-resource languages. The data originates from OPUS and has been formatted to fit the Hugging Face format. The topic labels in the dataset were generated using easy_text_clustering.
提供机构:
billingsmoore



