2025 Complete Guide: How Alibaba Tongyi's UI-Ins Model Revolutionizes GUI Grounding and Automation

🎯 Key Takeaways (TL;DR)

UI-Ins Model: Alibaba Tongyi Lab has released UI-Ins-7B and UI-Ins-32B, aiming to significantly enhance Graphical User Interface (GUI) Grounding and automation capabilities through an innovative "Instruction-as-Reasoning" paradigm.
Addressing Pain Points: Traditional GUI grounding datasets have an instruction error rate as high as 23.3%. UI-Ins addresses the impact of instruction diversity and quality on model performance through multi-perspective instruction reasoning and reinforcement learning.
Outstanding Performance: UI-Ins-32B sets SOTA (State-Of-The-Art) records across multiple benchmarks, achieving 87.3% accuracy on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2.
Powerful Agent Potential: UI-Ins-7B as an executor achieves a 74.1% success rate on AndroidWorld, surpassing Gemini 2.5 Computer Use, demonstrating tremendous application prospects in GUI automation.
Innovative Training Framework: Employs a two-stage training mechanism: first, supervised fine-tuning (SFT) on synthesized diverse instructions to instill multi-perspective reasoning capabilities; second, reinforcement learning (RL) to optimize path selection and combination.

What is GUI Grounding and GUI Agent?
Alibaba Tongyi UI-Ins Model: Core Technology and Innovation
- Instruction-as-Reasoning Paradigm
- Two-Stage Training Framework
How Does UI-Ins Model Perform?
- Benchmark Results
- Comparison with Existing Models
How to Quickly Use the UI-Ins Model?
Why is the UI-Ins Model So Important?
🤔 Frequently Asked Questions
Conclusion and Recommendations

What is GUI Grounding and GUI Agent?

GUI Grounding is a critical task in the field of artificial intelligence, with the core objective of accurately mapping human natural language instructions to actionable elements in a Graphical User Interface (GUI). Simply put, it enables AI to understand instructions like "click the search button" or "fill in the username" and know how to locate and operate the corresponding UI elements on the screen.

A GUI Agent, built upon GUI grounding capabilities, can autonomously execute complex operations in a GUI environment based on a series of natural language task instructions, thereby achieving task automation. For example, a GUI agent can automatically open a flight booking app, fill in information, select flights, and complete the booking process based on an instruction like "book a flight from Beijing to Shanghai."

💡 Pro Tip
GUI grounding is the cornerstone of achieving true "human-computer interaction" and "GUI automation." Its accuracy and robustness directly impact the practicality of GUI agents.

Alibaba Tongyi UI-Ins Model: Core Technology and Innovation

The UI-Ins-7B and UI-Ins-32B models released by Alibaba Tongyi Lab aim to address long-standing challenges in the GUI grounding field through innovative approaches.

Instruction-as-Reasoning Paradigm

Previous research mostly treated natural language instructions as static user intents, overlooking the impact of instruction diversity and quality on model understanding and execution capabilities. The UI-Ins model proposes a revolutionary "Instruction-as-Reasoning" paradigm.

In this paradigm, instructions are no longer merely commands but are viewed as dynamic analysis paths providing different perspectives. This means that upon receiving an instruction, the model, like humans, attempts to understand and "reason" about the true intent of the instruction from multiple angles and selects the most effective operational approach based on these reasoning paths. This method allows the model to better handle ambiguous, incomplete, or polysemous instructions in complex GUI environments.

Two-Stage Training Framework

To achieve this "Instruction-as-Reasoning" capability, UI-Ins introduces a unique and efficient two-stage training framework:

Supervised Fine-Tuning (SFT) Stage:
- Objective: Instill multi-perspective reasoning capabilities in the model.
- Method: Supervised learning on a large volume of synthesized diverse instructions. These synthetic instructions are carefully designed to simulate various instruction expressions and potential intents that may appear in the real world, enabling the model to learn how to parse instructions from different perspectives.
Reinforcement Learning (RL) Optimization Stage:
- Objective: Optimize the model's path selection and combination capabilities during reasoning.
- Method: Through reinforcement learning, the model can continuously trial-and-error and learn to identify which reasoning paths yield the best operational results in a given GUI environment. This not only improves model accuracy but also enables flexible combination of different reasoning approaches to solve problems.

✅ Best Practice
This two-stage training framework effectively mitigates the "policy collapse" issues that may occur in traditional SFT+RL frameworks, ensuring the model maintains stable performance while acquiring reasoning capabilities.

How Does UI-Ins Model Perform?

The UI-Ins model has achieved remarkable results across multiple challenging benchmarks, demonstrating its exceptional GUI grounding capabilities and powerful GUI agent potential.

Benchmark Results

The UI-Ins-32B model excels in the following key benchmarks:

Benchmark	Grounding Accuracy (UI-Ins-32B)	Description
UI-I2E-Bench	87.3%	Measures the model's ability to map instructions to UI elements.
ScreenSpot-Pro	57.0%	Evaluates the model's localization accuracy on complex, diverse screens.
MMBench-GUI L2	84.9%	Assesses advanced tasks requiring implicit intent understanding.

Comparison with Existing Models

Particularly noteworthy is that in AndroidWorld testing, the UI-Ins-7B model as an executor achieved a 74.1% success rate. According to Twitter user @karminski3, this performance even exceeds Gemini 2.5 Computer Use, demonstrating UI-Ins's strong competitiveness in practical GUI automation scenarios.

⚠️ Note
Existing open-source GUI benchmark datasets (such as OS-Atlas, AMEX) have instruction error rates as high as 23.3%. UI-Ins's "Instruction-as-Reasoning" paradigm partially addresses challenges caused by data quality issues, enabling the model to better handle imperfect data.

How to Quickly Use the UI-Ins Model?

The ModelScope platform provides quick-start scripts for the UI-Ins model, facilitating developer inference. Here are the basic steps for using UI-Ins for inference:

import torch
import re
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration

# 1. Define model path, image path, and instruction
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"  # or "Qwen/Qwen2.5-VL-32B-Instruct"
IMAGE_PATH = "path/to/your/image.jpg"      # Replace with your screenshot path
INSTRUCTION = "Click the 'Search' button"   # Replace with your natural language instruction

# 2. Coordinate parsing function (to extract coordinates from model output)
def parse_coordinates(raw_string: str) -> tuple[int, int]:
    matches = re.findall(r'\[(\d+),\s*(\d+)\]', raw_string)
    if matches:
        return tuple(map(int, matches[0]))
    return -1, -1

print("Loading model...")
# 3. Load pretrained model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,  # Use bfloat16 to save memory and accelerate
    device_map="auto"            # Automatically allocate to available device (e.g., GPU)
).eval()  # Set to evaluation mode
processor = AutoProcessor.from_pretrained(MODEL_PATH)

# 4. Load image and convert to RGB format
image = Image.open(IMAGE_PATH).convert("RGB")

# 5. Construct input messages (following Qwen2.5-VL-Instruct dialogue format)
messages = [
    {
        "role":"system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."},
            {"type": "text", "text": """You are a GUI agent. You will be given a task and your action history, along with a screenshot. You need to perform the next action to complete the task.\n\n## Output Format\nReturn a reasoning process contained in tags, and coordinates in <action> tags"""}
        ]
    },
    {
        "role":"user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": INSTRUCTION}
        ]
    }
]

# 6. Process input and generate model response
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = processor(text, return_tensors="pt").input_ids.to(model.device)
generated_ids = model.generate(
    input_ids,
    max_new_tokens=512
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(f"Model output: {generated_text}")
coordinates = parse_coordinates(generated_text)
print(f"Parsed coordinates: {coordinates}")

💡 Pro Tip
The example code uses Qwen/Qwen2.5-VL-7B-Instruct as the base model, on which the UI-Ins model has been fine-tuned. Please ensure that torch, Pillow, and transformers libraries are installed in your environment.

Why is the UI-Ins Model So Important?

The importance of the UI-Ins model is reflected in the following aspects:

Enhancing GUI Automation Efficiency: By more accurately understanding user instructions and GUI elements, UI-Ins can significantly improve the robustness and efficiency of automation scripts, reducing manual intervention. This holds tremendous value for software testing, RPA (Robotic Process Automation), intelligent assistants, and other fields.
Lowering Development Barriers: Enables non-professional developers to build complex GUI automation workflows through natural language instructions, lowering the barrier to AI application development.
Advancing Human-Computer Interaction: More intelligent GUI grounding capabilities mean more natural and intuitive human-computer interaction methods, allowing users to control complex applications through simple conversations.
Addressing Data Challenges: Through "Instruction-as-Reasoning" and two-stage training, the model can better handle imperfect training data, improving generalization capabilities and adapting to diverse GUI scenarios.
Industry Leadership: Surpassing existing SOTA models across multiple benchmarks consolidates Alibaba Tongyi Lab's leading position in visual multimodal understanding and GUI automation.

🤔 Frequently Asked Questions

Q: Which languages does the UI-Ins model support?

A: According to the model tags, the UI-Ins model primarily supports English. While the example code uses Chinese instructions, its core training and optimization may focus on English environments. In practical applications, the performance of Chinese instructions may require further testing.

Q: Is the UI-Ins model fully open source?

A: The UI-Ins model provides model weights (UI-Ins-7B and UI-Ins-32B) on ModelScope and Hugging Face, following the cc-by-nc-sa-4.0 open source license. This means it can be freely used, distributed, and modified for non-commercial purposes, but attribution is required and the same license must be maintained.

Q: How does the "Instruction-as-Reasoning" paradigm help the model understand ambiguous instructions?

A: The "Instruction-as-Reasoning" paradigm encourages the model to analyze instructions from multiple perspectives, generating different reasoning paths. When instructions are ambiguous, the model can use these different reasoning paths to infer the user's most likely intent and select the most reasonable action. For example, if a user says "click the large blue button," the model might generate two reasoning paths: "click the blue button" and "click the largest button," then make a comprehensive judgment based on context or visual information.

Q: Can the UI-Ins model handle dynamically changing GUI interfaces?

A: The UI-Ins model, through its powerful GUI grounding capabilities, can map natural language instructions to actionable UI elements. While the webpage doesn't detail its handling of dynamic interfaces, as a core capability of GUI agents, excellent GUI grounding models typically possess some adaptability to dynamic interfaces, such as responding to changes in element positions or states through visual recognition and contextual understanding.

Conclusion and Recommendations

The UI-Ins-7B and UI-Ins-32B models released by Alibaba Tongyi Lab have achieved significant breakthroughs in GUI grounding and automation through their innovative "Instruction-as-Reasoning" paradigm and two-stage training framework. They not only set new SOTA records across multiple benchmarks but also demonstrate potential surpassing existing leading models (such as Gemini 2.5 Computer Use).

For developers and researchers engaged in UI recognition, GUI automation, or building intelligent GUI agents, the UI-Ins model is undoubtedly a powerful tool worth exploring and applying in depth.

Action Recommendations:

Experience the Model: Visit ModelScope or Hugging Face to download UI-Ins model weights and try the provided quick-start scripts for inference to personally experience its GUI grounding capabilities.
Study the Paper: Review the official paper "UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning" to gain in-depth understanding of its technical details and innovations.
Follow Developments: Closely monitor subsequent updates from Alibaba Tongyi Lab and community feedback to stay informed about the latest model progress and application cases.
Explore Application Scenarios: Consider how the UI-Ins model can be applied in your specific business scenarios (such as automated testing, RPA, accessibility assistance, intelligent customer service, etc.) and conduct proof-of-concept validation.

Related Resource Links:

UI-Ins-32B Model Repository: https://modelscope.cn/models/Tongyi-MiA/UI-Ins-32B
UI-Ins-7B Model Repository: https://modelscope.cn/models/Tongyi-MiA/UI-Ins-7B
Paper Link: https://arxiv.org/pdf/2510.20286
GitHub Code: https://github.com/alibaba/UI-Ins

Alibaba Tongyi UI-Ins Model Complete Guide

CurateClick