Microsoft released the open-source Phi-4-reasoning-vision-15B model. This hardware-efficient AI processes multimodal files like scientific charts. The architecture integrates the SigLIP-2 image algorithm and the Phi-4 Reasoning model.
A "mid-fusion" approach limits multimodal processing to specific layers. This method reduces the hardware resources required for operation. The model competes with larger systems in math and science reasoning.
The release enables developers to build AI agents. These agents interact with user interfaces by interpreting screenshots.