Meta's 11B multimodal model supporting text and image inputs. Optimized for visual reasoning and image understanding tasks.