baconnier/deepsynth-fr
收藏Hugging Face2025-11-02 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/baconnier/deepsynth-fr
下载链接
链接失效反馈官方服务:
资源简介:
这是一个大规模的法语新闻摘要数据集,来源于主要的法国报纸,用于训练多语言的DeepSeek-OCR模型,并且能够正确处理Unicode字符和重音符号。数据集是DeepSynth项目的一部分,该项目使用视觉文本编码技术进行多语言摘要,将文本文档转换为图像并通过视觉编码器处理,实现了20倍的token压缩比例,同时保持了文档的布局和结构。
This is a large-scale French news summarization dataset from major French newspapers, designed for training multilingual DeepSeek-OCR models with proper Unicode/diacritics handling. It is part of the DeepSynth project, which uses visual text encoding for multilingual summarization, converting text documents into images and processing them through a visual encoder to achieve a 20x token compression ratio while preserving document layout and structure.
提供机构:
baconnier



