Open-AutoGLM Phone Agent
Skill by ara.so — Daily 2026 Skills collection.
Open-AutoGLM is an open-source AI phone agent framework that enables natural language control of Android, HarmonyOS NEXT, and iOS devices. It uses the AutoGLM vision-language model (9B parameters) to perceive screen content and execute multi-step tasks like "open Meituan and search for nearby hot pot restaurants."
Architecture Overview
User Natural Language → AutoGLM VLM → Screen Perception → ADB/HDC/WebDriverAgent → Device Actions
- Model: AutoGLM-Phone-9B (Chinese-optimized) or AutoGLM-Phone-9B-Multilingual
- Device control: ADB (Android), HDC (HarmonyOS NEXT), WebDriverAgent (iOS)
- Model serving: vLLM or SGLang (self-hosted) or BigModel/ModelScope API
- Input: Screenshot + task description → Output: structured action commands
Installation
Prerequisites
- Python 3.10+
- ADB installed and in PATH (Android) or HDC (HarmonyOS) or WebDriverAgent (iOS)
- Android device with Developer Mode + USB Debugging enabled
- ADB Keyboard APK installed on Android device (for text input)
Install the framework
git clone https://github.com/zai-org/Open-AutoGLM.git
cd Open-AutoGLM
pip install -r requirements.txt
pip install -e .
Verify ADB connection
adb devices
hdc list targets
Model Deployment Options
Option A: Third-party API (Recommended for quick start)
BigModel (ZhipuAI)
export BIGMODEL_API_KEY="your-bigmodel-api-key"
python main.py \
--base-url https://open.bigmodel.cn/api/paas/v4 \
--model "autoglm-phone" \
--apikey $BIGMODEL_API_KEY \
"打开美团搜索附近的火锅店"
ModelScope
export MODELSCOPE_API_KEY="your-modelscope-api-key"
python main.py \
--base-url https://api-inference.modelscope.cn/v1 \
--model "ZhipuAI/AutoGLM-Phone-9B" \
--apikey $MODELSCOPE_API_KEY \
"open Meituan and find nearby hotpot"
Option B: Self-hosted with vLLM
pip install vllm
python3 -m vllm.entrypoints.openai.api_server \
--served-model-name autoglm-phone-9b \
--allowed-local-media-path / \
--mm-encoder-tp-mode data \
--mm_processor_cache_type shm \
--mm_processor_kwargs '{"max_pixels":5000000}' \
--max-model-len 25480 \
--chat-template-content-format string \
--limit-mm-per-prompt '{"image":10}' \
--model zai-org/AutoGLM-Phone-9B \
--port 8000
Option C: Self-hosted with SGLang
python3 -m sglang.launch_server \
--model-path zai-org/AutoGLM-Phone-9B \
--served-model-name autoglm-phone-9b \
--context-length 25480 \
--mm-enable-dp-encoder \
--mm-process-config '{"image":{"max_pixels":5000000}}' \
--port 8000
Verify deployment
python scripts/check_deployment_cn.py \
--base-url http://localhost:8000/v1 \
--model autoglm-phone-9b
Expected output includes a <think>...</think> block followed by <answer>do(action="Launch", app="..."). If the chain-of-thought is very short or garbled, the model deployment has failed.
Running the Agent
Basic CLI usage
python main.py \
--base-url http://localhost:8000/v1 \
--model autoglm-phone-9b \
"打开小红书搜索美食"
python main.py \
--base-url http://localhost:8000/v1 \
--model autoglm-phone-9b \
--device-type hdc \
"打开设置查看WiFi"
python main.py \
--base-url http://localhost:8000/v1 \
--model autoglm-phone-9b-multilingual \
"Open Instagram and search for travel photos"
Key CLI parameters
| Parameter |
Description |
Default |
--base-url |
Model service endpoint |
Required |
--model |
Model name on server |
Required |
--apikey |
API key for third-party services |
None |
--device-type |
adb (Android) or hdc (HarmonyOS) |
adb |
--device-id |
Specific device serial number |
Auto-detect |
Python API Usage
Basic agent invocation
from phone_agent import PhoneAgent
from phone_agent.config import AgentConfig
config = AgentConfig(
base_url="http://localhost:8000/v1",
model="autoglm-phone-9b",
device_type="adb",
)
agent = PhoneAgent(config)
result = agent.run("打开淘宝搜索蓝牙耳机")
print(result)
Custom task with device selection
from phone_agent import PhoneAgent
from phone_agent.config import AgentConfig
import os
config = AgentConfig(
base_url=os.environ["MODEL_BASE_URL"],
model=os.environ["MODEL_NAME"],
apikey=os.environ.get("MODEL_API_KEY"),
device_type="adb",
device_id="emulator-5554",
)
agent = PhoneAgent(config)
result = agent.run(
"在京东购买最便宜的蓝牙耳机",
confirm_sensitive=True
)
Direct model API call (for testing/integration)
import openai
import base64
import os
from pathlib import Path
client = openai.OpenAI(
base_url=os.environ["MODEL_BASE_URL"],
api_key=os.environ.get("MODEL_API_KEY", "dummy"),
)
screenshot_path = "screenshot.png"
with open(screenshot_path, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="autoglm-phone-9b",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
},
{
"type": "text",
"text": "Task: 搜索附近的咖啡店\nCurrent step: Navigate to search",
},
],
}
],
)
print(response.choices[0].message.content)
Parsing model action output
import re
def parse_action(model_output: str) -> dict:
"""Parse AutoGLM model output into structured action."""
answer_match = re.search(r'<answer>(.*?)(?:</answer>|$)', model_output