AI-PEDIA

Vision Transformer (ViT)

An architecture that allows models to process images by breaking them into patches.

What it means

For a long time, text and images were handled by totally different types of AI. A Vision Transformer treats an image like a sentence. It chops the picture into little squares ('words') and analyzes the relationship between them.

Why it matters

This unified how AI understands the world. It is the key technology that allows Multimodal models to understand a meme or analyze a chart in a PDF.

Keep reading

A few adjacent definitions to lock in the concept.

View all →

Mixture of Experts (MoE)

A architecture that picks the best specialized 'expert' models for each part of a query.

Read definition

Parameters

The internal variables or 'settings' learned by the model during training.

Read definition