five

aguvis-stage1-mixture

收藏
魔搭社区2025-12-11 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/scallioncake/aguvis-stage1-mixture
下载链接
链接失效反馈
官方服务:
资源简介:
# aguvis-stage1-mixture 本数据集来源于 HuggingFace 上的 `smolagents/aguvis-stage-1` ,使用随机种子 42 对合并后的数据集进行shuffle。合并数据集加载顺序如下dataset_mixture_phase_1所示。随后,按约 12GB 每个 shard 的目标大小切分为 Parquet 分片(内部默认单文件大小约 450MB,按文件数自动聚合成若干 `shard_xxx` 目录),最终形成 `shard_000` ~ `shard_018` 的子集。 ```python dataset_mixture_phase_1 = [ {"id": "smolagents/aguvis-stage-1", "config": "guienv", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "omniact", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ricoig16k", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ricosca", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "seeclick", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ui_refexp", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "webui350k", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "widget_captioning", "split": "train", "columns": ["images", "texts"]}, ] ``` #### 使用示例 ```python from modelscope.msdatasets import MsDataset # 加载整个数据集 ds = MsDataset.load('scallioncake/aguvis-stage1-mixture') # 加载指定的 shard ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_000') ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_001') # ... 可以加载任意 shard_xxx (shard_000 到 shard_018) ```

# aguvis-stage1-mixture 本数据集源自HuggingFace平台上的`smolagents/aguvis-stage-1`,以随机种子42对合并后的数据集执行打乱(shuffle)操作。合并数据集的加载顺序如下`dataset_mixture_phase_1`所示: python dataset_mixture_phase_1 = [ {"id": "smolagents/aguvis-stage-1", "config": "guienv", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "omniact", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ricoig16k", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ricosca", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "seeclick", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "ui_refexp", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "webui350k", "split": "train", "columns": ["images", "texts"]}, {"id": "smolagents/aguvis-stage-1", "config": "widget_captioning", "split": "train", "columns": ["images", "texts"]}, ] 随后,以每个分片(shard)目标大小约12GB为基准,将数据集切分为Parquet格式分片;其内部默认单文件大小约450MB,会根据文件数量自动聚合为若干`shard_xxx`目录,最终生成`shard_000`至`shard_018`的子集。 #### 使用示例 python from modelscope.msdatasets import MsDataset # 加载完整数据集 ds = MsDataset.load('scallioncake/aguvis-stage1-mixture') # 加载指定分片(shard) ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_000') ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_001') # ... 可加载任意`shard_xxx`分片(覆盖shard_000至shard_018)
提供机构:
maas
创建时间:
2025-12-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作