five

t4-full

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/mlfoundations/t4-full
下载链接
链接失效反馈
官方服务:
资源简介:
The Tremendous TabLib Trawl (T4) is a dataset for training tabular foundation models. The dataset is described in detail in our paper, ["Large Scale Transfer Learning for Tabular Data via Language Modeling."](https://arxiv.org/abs/2406.12031) The paper also includes a datasheet for this dataset. T4 consists of a set of Parquet files (described below). For examples and infrastructure showing how to train a lannguage model on T4, see our open-source Python library, [rtfm](https://github.com/mlfoundations/rtfm), which was used to train TabuLa-8B on T4. # Files and Directory Structure The T4 dataset contains approximately 3.1M tables. Each table is a separate Parquet file, named according to the `content_hash` of the dataset in TabLib. The dataset is stored in "chunk" subdirectores, which represent batches of tables from the preprocessing phase. Each chunk directory (e.g. `chunk-0000`) is stored as a single .zip file; unzip these files to access the underlying Parquet files. The dataset occupies a total of 219GB compressed (1.34TB uncompressed) on disk. # License and Acceptable Use We release this dataset under the same license as the original corpuse from which it was derived, TabLib. **By using this dataset, you are acknowledging that you have permission to access the TabLib dataset, and you agree to abide by the terms of use and license of TabLib.** TabLib can be accessed on [HF Datasets](https://huggingface.co/datasets/approximatelabs/tablib-v1-full), and you can read more about TabLib in the associated [paper](https://arxiv.org/abs/2310.07875) and [blog post](https://www.approximatelabs.com/blog/tablib). We claim no affiliation with the original creators of TabLib, and this dataset release is not associated with Approximate Labs (but we are grateful to the original TabLib authors for their contributions to the research community and for releasing TabLib).

巨型TabLib数据集(Tremendous TabLib Trawl,简称T4)是一款面向表格类基础模型训练的专用数据集。 该数据集的详细说明可参阅我们的论文《基于语言建模的表格数据大规模迁移学习》(Large Scale Transfer Learning for Tabular Data via Language Modeling,https://arxiv.org/abs/2406.12031)。 该论文同时附带了本数据集的数据表文档。 T4由一系列Parquet格式文件组成(具体说明见下文)。如需了解基于T4训练语言模型的示例与配套工具链,请参阅我们的开源Python库[rtfm](https://github.com/mlfoundations/rtfm),该库正是用于在T4数据集上训练TabuLa-8B模型的工具。 # 文件与目录结构 T4数据集共包含约310万张表格,每张表格对应一个独立的Parquet格式文件,文件名按照TabLib中该数据集的`内容哈希(content_hash)`命名。 该数据集以块(chunk)子目录的形式存储,每个块子目录对应预处理阶段的一批表格。 每个块目录(例如`chunk-0000`)均打包为单个.zip压缩文件,解压后即可获取其中包含的Parquet格式文件。 该数据集压缩后总占用空间为219GB,解压后总占用空间达1.34TB。 # 许可与使用规范 本数据集的许可协议与原始来源数据集TabLib保持一致。 **使用本数据集即代表您确认已获得访问TabLib数据集的权限,并同意遵守TabLib的使用条款与许可协议。** TabLib数据集可在[Hugging Face数据集平台(HF Datasets)](https://huggingface.co/datasets/approximatelabs/tablib-v1-full)获取,更多关于TabLib的信息可参阅其配套论文(https://arxiv.org/abs/2310.07875)与博客文章(https://www.approximatelabs.com/blog/tablib)。 我们与TabLib的原始创作者无任何隶属关系,本次数据集发布也与Approximate Labs无关(但我们衷心感谢TabLib的原始作者为研究社区做出的贡献,以及他们公开发布TabLib数据集的行为)。
提供机构:
maas
创建时间:
2025-10-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作