
Self-supervised vision model based on the Joint-Embedding Predictive Architecture, learning semantic image representations by predicting embeddings of masked image regions.
Parameters
~632M (ViT-H/14) โ ~1B (ViT-g/16)
parameters
Release date
19 January 2023
Access:DownloadDeployment:๐ป Local
Overview
Access & deployment
Download
Local
Weights: Open weights
Key parameters
๐งฉ Parameters: ~632M (ViT-H/14) โ ~1B (ViT-g/16)
โ Fine-tuning
๐ฅ Input: image
Technical specification
Parameters
~632M (ViT-H/14) โ ~1B (ViT-g/16)
parameters
Hardware requirements
The reference ViT-H/14 training on ImageNet-1K was performed on 16 NVIDIA A100 80GB GPUs (effective batch size 2048) in under 72 hours. Inference is feasible on a single consumer-grade GPU.
Features:โ Fine-tuning
Modalities
โฌ Input
image
โฌ Output
structured_data
Capabilities and applications
Native model capabilities
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Technical architecture
Core Architecture
Training Techniques