How well can AI build Android apps? Google aims to find out

Google has introduced a new benchmark designed to evaluate how effectively artificial intelligence models can develop Android applications. The platform, called Android Bench, measures the performance of different AI systems on tasks related to app development and ranks them on a public leaderboard.

The company said the initiative aims to help developers identify the most capable AI tools when building apps and experiences for the Android ecosystem.

Google's Android Bench

According to a post on the Android Developers Blog, Android Bench is the official benchmark for large language models (LLMs) used for Android app development. The post noted that the benchmark includes a set of tasks that have been curated to reflect common problems encountered during app development.

The set of tasks includes aspects of network programming for wearables and app migration to new versions of Jetpack Compose. The assignments have been sourced from public repositories hosted on GitHub and have been verified based on input from several developers of AI models.

Google noted that the benchmark was created to set a standard for the assessment of AI programming assistance within the Android ecosystem.

Google has also released the methodology, dataset, and testing framework for the benchmark on GitHub. The company aims to help developers and AI researchers verify the results and contribute to the improvement of the process.

To avoid data contamination, in which answers to tests could be used as part of model training data, the benchmark mostly targets reasoning-based tests as opposed to memorisation tests.

Initial results show Gemini 3.1 Pro as the best model in the Android Bench leaderboard. Other top-performing AI models include Claude Opus 4.6, GPT 5.2 Codex, Opus 4.5, and Gemini 3 Pro.

Android developers can use API access to test these AI models in the latest stable version of Android Studio.

Google aims to extend Android Bench in future versions by adding more tests, ensuring it remains a viable benchmark for AI-based Android development tools.