aguvis-stage1-mixture
收藏魔搭社区2025-12-11 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/scallioncake/aguvis-stage1-mixture
下载链接
链接失效反馈官方服务:
资源简介:
# aguvis-stage1-mixture
本数据集来源于 HuggingFace 上的 `smolagents/aguvis-stage-1` ,使用随机种子 42 对合并后的数据集进行shuffle。合并数据集加载顺序如下dataset_mixture_phase_1所示。随后,按约 12GB 每个 shard 的目标大小切分为 Parquet 分片(内部默认单文件大小约 450MB,按文件数自动聚合成若干 `shard_xxx` 目录),最终形成 `shard_000` ~ `shard_018` 的子集。
```python
dataset_mixture_phase_1 = [
{"id": "smolagents/aguvis-stage-1", "config": "guienv", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "omniact", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ricoig16k", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ricosca", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "seeclick", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ui_refexp", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "webui350k", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "widget_captioning", "split": "train", "columns": ["images", "texts"]},
]
```
#### 使用示例
```python
from modelscope.msdatasets import MsDataset
# 加载整个数据集
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture')
# 加载指定的 shard
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_000')
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_001')
# ... 可以加载任意 shard_xxx (shard_000 到 shard_018)
```
# aguvis-stage1-mixture
本数据集源自HuggingFace平台上的`smolagents/aguvis-stage-1`,以随机种子42对合并后的数据集执行打乱(shuffle)操作。合并数据集的加载顺序如下`dataset_mixture_phase_1`所示:
python
dataset_mixture_phase_1 = [
{"id": "smolagents/aguvis-stage-1", "config": "guienv", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "omniact", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ricoig16k", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ricosca", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "seeclick", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "ui_refexp", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "webui350k", "split": "train", "columns": ["images", "texts"]},
{"id": "smolagents/aguvis-stage-1", "config": "widget_captioning", "split": "train", "columns": ["images", "texts"]},
]
随后,以每个分片(shard)目标大小约12GB为基准,将数据集切分为Parquet格式分片;其内部默认单文件大小约450MB,会根据文件数量自动聚合为若干`shard_xxx`目录,最终生成`shard_000`至`shard_018`的子集。
#### 使用示例
python
from modelscope.msdatasets import MsDataset
# 加载完整数据集
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture')
# 加载指定分片(shard)
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_000')
ds = MsDataset.load('scallioncake/aguvis-stage1-mixture', subset_name='shard_001')
# ... 可加载任意`shard_xxx`分片(覆盖shard_000至shard_018)
提供机构:
maas
创建时间:
2025-12-01



