garbage_competition

Name: garbage_competition
Creator: maas
Published: 2026-04-23 10:16:41
License: 暂无描述

魔搭社区2026-04-23 更新2025-02-22 收录

下载链接：

https://modelscope.cn/datasets/swift/garbage_competition

下载链接

链接失效反馈

官方服务：

资源简介：

## 数据集数据集地址：https://modelscope.cn/datasets/swift/garbage_competition 数据集下载： ```python from modelscope import MsDataset dataset = MsDataset.load('swift/garbage_competition', split='train') test_dataset = MsDataset.load('swift/garbage_competition', split='test') print(dataset) print(test_dataset) """ Dataset({ features: ['images', 'label', 'label_name'], num_rows: 100000 }) Dataset({ features: ['images', 'label', 'label_name'], num_rows: 2650 }) """ print(dataset[0]) print(test_dataset[0]) """ {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'} {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None} """ ``` ## Baseline 以下介绍使用ms-swift大模型训练框架，对Qwen2.5-VL-3B-Instruct使用该数据集进行LoRA微调的baseline - ms-swift github：https://github.com/modelscope/ms-swift - 以下提供的baseline所需显存资源14GB，可在魔搭免费算力A10上运行 - 该baseline的准确率为: 0.8528301886792453 环境准备： ```shell pip install ms-swift -U ``` 单卡训练： ```python # GPU Memory: 14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = 'Task: Sorting Waste.' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) if __name__ == '__main__': sft_main(TrainArguments( model='Qwen/Qwen2.5-VL-3B-Instruct', dataset=['swift/garbage_competition:train#20000'], # 节约时间，只选择20000条数据集 train_type='lora', torch_dtype='bfloat16', num_train_epochs=1, per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=1e-4, lora_rank=8, lora_alpha=32, target_modules=['all-linear'], freeze_vit=True, gradient_accumulation_steps=16, eval_steps=50, save_steps=50, save_total_limit=2, logging_steps=5, max_length=2048, output_dir='output', warmup_ratio=0.05, dataset_num_proc=4, dataloader_num_workers=4, num_labels=265, task_type='seq_cls', use_chat_template=False )) ``` ## 提交结果我们提供了推理脚本, 最终需要将以下推理脚本产生的`infer_result`目录中的jsonl文件进行提交 (由于比赛界面只允许传递json后缀的文件, 请重命名为`result.json`, 不需要改内容). ```python # GPU Memory: 14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = 'Task: Sorting Waste.' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # last_checkpoint result = infer_main(InferArguments( adapters=[ckpt_dir], temperature=0, val_dataset="swift/garbage_competition:test", infer_backend='pt')) # 结果会保存在`{ckpt_dir}/infer_result/xxx-xxx.jsonl`中, 提交该文件即可. # (由于比赛界面只能传递json后缀的文件, 请重命名为`result.json`, 不需要改内容). ``` 提交的jsonl文件格式如下，顺序与`test_dataset`顺序一致，共2650条， ``` {"response": "25"} {"response": "xxx"} {"response": "xxx"} ``` 打分脚本： ```python from swift.utils import read_from_jsonl from datasets import load_dataset labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label'] results = read_from_jsonl('result.jsonl') count = 0 for i, (res, label) in enumerate(zip(results, labels)): if int(res['response']) == label: count += 1 print(f'acc: {count / len(results)}') ```

# 数据集数据集地址：https://modelscope.cn/datasets/swift/garbage_competition 数据集下载： python from modelscope import MsDataset dataset = MsDataset.load('swift/garbage_competition', split='train') test_dataset = MsDataset.load('swift/garbage_competition', split='test') print(dataset) print(test_dataset) """ Dataset({ features: ['images', 'label', 'label_name'], num_rows: 100000 }) Dataset({ features: ['images', 'label', 'label_name'], num_rows: 2650 }) """ print(dataset[0]) print(test_dataset[0]) """ {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=750x643 at 0x7F03D82937C0>, 'label': 169, 'label_name': '可回收物-电话'} {'images': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x540 at 0x7F03D81FCD00>, 'label': None, 'label_name': None} """ ## 基准基线方案下文将介绍基于ms-swift大模型训练框架，针对Qwen2.5-VL-3B-Instruct模型，使用本数据集开展LoRA低秩适配微调的基准方案： - ms-swift 仓库地址：https://github.com/modelscope/ms-swift - 本基准方案所需显存为14GB，可在魔搭平台免费提供的A10算力设备上运行 - 本基准方案的准确率为：0.8528301886792453 ### 环境配置 shell pip install ms-swift -U 执行以下命令安装并更新ms-swift库： ### 单卡训练脚本 python # 显存占用：14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( TrainArguments, sft_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = '任务：垃圾分类。' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) if __name__ == '__main__': sft_main(TrainArguments( model='Qwen/Qwen2.5-VL-3B-Instruct', dataset=['swift/garbage_competition:train#20000'], # 为缩短训练时长，仅选取20000条训练样本 train_type='lora', torch_dtype='bfloat16', num_train_epochs=1, per_device_train_batch_size=1, per_device_eval_batch_size=1, learning_rate=1e-4, lora_rank=8, lora_alpha=32, target_modules=['all-linear'], freeze_vit=True, gradient_accumulation_steps=16, eval_steps=50, save_steps=50, save_total_limit=2, logging_steps=5, max_length=2048, output_dir='output', warmup_ratio=0.05, dataset_num_proc=4, dataloader_num_workers=4, num_labels=265, task_type='seq_cls', use_chat_template=False )) ## 结果提交我们已提供推理脚本，最终需提交下述推理脚本生成的`infer_result`目录下的jsonl格式文件（因比赛界面仅支持上传json后缀文件，请将其重命名为`result.json`，无需修改文件内容）。 ### 推理脚本 python # 显存占用：14GB import os from typing import Dict, Any os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) from swift.llm import ( InferArguments, infer_main, register_dataset, DatasetMeta, ResponsePreprocessor, SubsetDataset ) class CustomPreprocessor(ResponsePreprocessor): def preprocess(self, row: Dict[str, Any]) -> Dict[str, Any]: row['query'] = '任务：垃圾分类。' return super().preprocess(row) register_dataset( DatasetMeta( ms_dataset_id='swift/garbage_competition', preprocess_func=CustomPreprocessor(), subsets=[SubsetDataset('train', split=['train']), SubsetDataset('test', split=['test'])] )) ckpt_dir = 'output/vx-xxx/checkpoint-xxx' # 最新训练的检查点目录 result = infer_main(InferArguments( adapters=[ckpt_dir], temperature=0, val_dataset="swift/garbage_competition:test", infer_backend='pt')) # 推理结果将保存至`{ckpt_dir}/infer_result/xxx-xxx.jsonl`路径下，提交该文件即可。 # （因比赛界面仅支持上传json后缀文件，请将其重命名为`result.json`，无需修改文件内容） ### 提交文件格式提交的jsonl文件格式如下，样本顺序需与`test_dataset`的样本顺序保持一致，总计2650条： {"response": "25"} {"response": "xxx"} {"response": "xxx"} ### 打分脚本 python from swift.utils import read_from_jsonl from datasets import load_dataset # 加载带有标签的测试集parquet文件，提取label字段作为标准答案 labels = load_dataset('parquet', data_files='test_with_labels.parquet', split='train')['label'] # 读取提交的result.jsonl文件中的推理结果 results = read_from_jsonl('result.jsonl') count = 0 for i, (res, label) in enumerate(zip(results, labels)): if int(res['response']) == label: count += 1 # 计算并输出准确率 print(f'acc: {count / len(results)}')

提供机构：

maas

创建时间：

2025-02-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集