niwang66/mobile-actions-language-modeling
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/niwang66/mobile-actions-language-modeling
下载链接
链接失效反馈官方服务:
资源简介:
# Mobile Actions SFT Dataset
A converted version of the `google/mobile-actions` dataset for supervised fine-tuning (SFT) of Qwen models with tool calling capabilities.
## Dataset Description
This dataset is derived from the [google/mobile-actions](https://huggingface.co/datasets/google/mobile-actions) dataset, which contains human-AI conversations about performing actions on mobile devices. The original dataset has been converted to the Qwen chat template format for efficient training of Qwen models.
### Conversion Process
The conversion is performed using the `convert_mobile_actions.py` script, which:
1. Reads the original JSONL format from `/data/datasets/mobile-actions/dataset.jsonl`
2. Reformats the tools and messages to match Qwen's expected format
3. Applies the Qwen chat template using `transformers.AutoTokenizer.apply_chat_template()`
4. Saves the result as a Parquet file with `source` and `text` columns
### Dataset Structure
The dataset is stored in Parquet format with the following columns:
| Column | Type | Description |
|--------|------|-------------|
| `source` | string | Always `'mobile-actions'` to identify the data source |
| `text` | string | The complete conversation formatted using Qwen's chat template |
The data is organized in a single Parquet file:
```
data/train-00000-of-00001.parquet
```
### Data Format
Each sample in the original dataset contains:
- `tools`: List of available tools/functions with their JSON schemas
- `messages`: Conversation history with roles (`developer`, `user`, `assistant`)
After conversion, the `text` column contains the fully formatted conversation ready for language modeling training. The format follows Qwen's tool-calling template, which includes:
- System prompt with tool definitions
- User query
- Assistant response with tool calls (when applicable)
### Usage Example
```python
import pandas as pd
from transformers import AutoTokenizer
# Load the dataset
df = pd.read_parquet('/data/datasets/mobile-actions-sft/data/train-00000-of-00001.parquet')
# Sample text
sample_text = df.iloc[0]['text']
print(sample_text[:500]) # Print first 500 characters
# For training with TRL SFTTrainer
# The dataset is in standard language modeling format: {"text": "full_conversation"}
```
### Training with TRL
This dataset is in the **standard language modeling format** as defined in the TRL documentation:
- **Type**: Language modeling
- **Format**: Standard (plain text strings)
- **Expected columns**: `{"text": "The sky is blue."}`
It can be used directly with `SFTTrainer` for supervised fine-tuning of Qwen models.
### Original Dataset Information
- **Name**: google/mobile-actions
- **Description**: Human-AI conversations about performing actions on mobile devices
- **Size**: ~100k conversations
- **Tasks**: Tool calling, function calling, mobile assistant
- **License**: Apache 2.0
### Conversion Script
The conversion script is available at `convert_mobile_actions.py` in the project root. Key parameters:
```bash
python convert_mobile_actions.py \
--input_path /data/datasets/mobile-actions/dataset.jsonl \
--output_dir /data/datasets/mobile-actions-sft/data \
--model_path /data/models/Qwen2.5-0.5B-Instruct \
--max_samples 1000 # Optional: limit for testing
```
### Citation
If you use this dataset, please cite the original work:
```bibtex
@inproceedings{shah2024mobile,
title={Mobile-actions: A dataset for instruction-based mobile UI navigation},
author={Shah, Pratyush and Dhekane, Eshaan and Gholami, Saghar and Narayan, Apurva and Wang, Bing and Narayanan, Vijay},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={--},
year={2024}
}
```
### License
This converted dataset inherits the Apache 2.0 license from the original `google/mobile-actions` dataset.
### Contact
For questions about the conversion process, refer to the `convert_mobile_actions.py` script documentation.
提供机构:
niwang66



