Vision Transformer (ViT)

An architecture that allows models to process images by breaking them into patches.

What it means

For a long time, text and images were handled by totally different types of AI. A Vision Transformer treats an image like a sentence. It chops the picture into little squares ('words') and analyzes the relationship between them.

Why it matters

This unified how AI understands the world. It is the key technology that allows Multimodal models to understand a meme or analyze a chart in a PDF.