five

SLTrans

收藏
arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/ukplab/sltrans
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为SLTrans,包含了大约400万个自包含源代码文件,这些文件与12种编程语言的相应中间表示(IR)配对,总共有262亿个标记。该数据集不仅包括为大小优化和性能优化的IR,还针对长度进行了筛选,以确保在语言建模中的可用性。该数据集的规模约为400万个样本,其任务涉及多语言代码生成和理解。

This dataset is named SLTrans. It contains approximately 4 million self-contained source code files, which are paired with their corresponding Intermediate Representations (IR) across 12 programming languages, totaling 26.2 billion tokens. This dataset includes not only IR optimized for both code size and performance, but also has been filtered by sequence length to ensure its usability for language modeling. The dataset has roughly 4 million samples, and its associated tasks cover multilingual code generation and comprehension.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作