five

niwang66/mobile-actions-language-modeling

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/niwang66/mobile-actions-language-modeling
下载链接
链接失效反馈
官方服务:
资源简介:
# Mobile Actions SFT Dataset A converted version of the `google/mobile-actions` dataset for supervised fine-tuning (SFT) of Qwen models with tool calling capabilities. ## Dataset Description This dataset is derived from the [google/mobile-actions](https://huggingface.co/datasets/google/mobile-actions) dataset, which contains human-AI conversations about performing actions on mobile devices. The original dataset has been converted to the Qwen chat template format for efficient training of Qwen models. ### Conversion Process The conversion is performed using the `convert_mobile_actions.py` script, which: 1. Reads the original JSONL format from `/data/datasets/mobile-actions/dataset.jsonl` 2. Reformats the tools and messages to match Qwen's expected format 3. Applies the Qwen chat template using `transformers.AutoTokenizer.apply_chat_template()` 4. Saves the result as a Parquet file with `source` and `text` columns ### Dataset Structure The dataset is stored in Parquet format with the following columns: | Column | Type | Description | |--------|------|-------------| | `source` | string | Always `'mobile-actions'` to identify the data source | | `text` | string | The complete conversation formatted using Qwen's chat template | The data is organized in a single Parquet file: ``` data/train-00000-of-00001.parquet ``` ### Data Format Each sample in the original dataset contains: - `tools`: List of available tools/functions with their JSON schemas - `messages`: Conversation history with roles (`developer`, `user`, `assistant`) After conversion, the `text` column contains the fully formatted conversation ready for language modeling training. The format follows Qwen's tool-calling template, which includes: - System prompt with tool definitions - User query - Assistant response with tool calls (when applicable) ### Usage Example ```python import pandas as pd from transformers import AutoTokenizer # Load the dataset df = pd.read_parquet('/data/datasets/mobile-actions-sft/data/train-00000-of-00001.parquet') # Sample text sample_text = df.iloc[0]['text'] print(sample_text[:500]) # Print first 500 characters # For training with TRL SFTTrainer # The dataset is in standard language modeling format: {"text": "full_conversation"} ``` ### Training with TRL This dataset is in the **standard language modeling format** as defined in the TRL documentation: - **Type**: Language modeling - **Format**: Standard (plain text strings) - **Expected columns**: `{"text": "The sky is blue."}` It can be used directly with `SFTTrainer` for supervised fine-tuning of Qwen models. ### Original Dataset Information - **Name**: google/mobile-actions - **Description**: Human-AI conversations about performing actions on mobile devices - **Size**: ~100k conversations - **Tasks**: Tool calling, function calling, mobile assistant - **License**: Apache 2.0 ### Conversion Script The conversion script is available at `convert_mobile_actions.py` in the project root. Key parameters: ```bash python convert_mobile_actions.py \ --input_path /data/datasets/mobile-actions/dataset.jsonl \ --output_dir /data/datasets/mobile-actions-sft/data \ --model_path /data/models/Qwen2.5-0.5B-Instruct \ --max_samples 1000 # Optional: limit for testing ``` ### Citation If you use this dataset, please cite the original work: ```bibtex @inproceedings{shah2024mobile, title={Mobile-actions: A dataset for instruction-based mobile UI navigation}, author={Shah, Pratyush and Dhekane, Eshaan and Gholami, Saghar and Narayan, Apurva and Wang, Bing and Narayanan, Vijay}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={--}, year={2024} } ``` ### License This converted dataset inherits the Apache 2.0 license from the original `google/mobile-actions` dataset. ### Contact For questions about the conversion process, refer to the `convert_mobile_actions.py` script documentation.
提供机构:
niwang66
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作