A multimodal model processes and generates data across two or more modalities—text, images, audio, or video—within a unified architecture.
Representative Models
Architecture Patterns
- Single encoder per modality feeding shared decoder.
- Shared token space (text + vision patches) processed by same transformer.
- Late fusion where modality-specific experts merge at decision layer.
Design Trade-offs
- Shared encoders reduce params but may sacrifice modality-specific accuracy.
- Late fusion preserves specialist performance but increases latency.
Current Trends (2025)
- Token-based audio patches allow audio-text joint training.
- Visual grounding tasks integrated during instruction tuning1.
Implementation Tips
- Normalize input resolutions to reduce positional embedding mismatch.
- Use mix of unimodal and multimodal batches during fine-tuning.
- Evaluate each modality separately and jointly to detect regressions.