Advertisement

Vision Transformer (ViT)

the difference. Today, I learned that the latest Sora-like video generation technology mainly adopts Vision Transformer. I don't really understand it, and I might be explaining it incorrectly; it's mainly for my own learning purposes.

Vision Transformer, Overview of ViT

It is a model for image classification that processes image patches using an architecture similar to Transformers. ViT was first successfully applied to large-scale image recognition tasks in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" published by Alexey Dosovitskiy et al. in 2020, showing excellent performance and promoting the development of visual representation learning and modern computer vision.

Core Concept

  • Divide the image into non-overlapping patches of fixed size (e.g., 16x16 pixels), and perform linear embedding after flattening each patch.
  • Add positional encoding to preserve spatial information, as the Transformer itself is not sensitive to the order of arrangement.
  • The sequence of embedded image patches is input into a standard Transformer encoder for processing.
  • A learnable [CLS] token is added to aggregate the full-image information for classification tasks.

Research Contributions

  • It has been proven that without relying on Convolutional Neural Networks (CNNs), a pure Transformer architecture can also achieve excellent performance in image classification tasks.
  • After pre-training on large-scale datasets (such as ImageNet-21k), Vision Transformers (ViT) perform well when transferred to medium and small-scale image recognition benchmarks (such as ImageNet, CIFAR-100, VTAB), while requiring significantly fewer computational resources for training.

Detailed architecture of ViT

1. Image processing flow

  • Divide the input image into fixed-size, non-overlapping blocks (e.g., 16x16 pixels).
  • Each block is flattened and embedded as a vector through a linear layer.
  • Add absolute position encoding to each block embedding to retain spatial information.
  • Input the sequence of embeddings of all blocks into a standard Transformer encoder.

2. Classification mechanism

  • A special [CLS] token is added to the input sequence, and after processing by the Transformer encoder, the output vector of this token is used for classification tasks.

Comparative analysis

a. Architecture and design

CharacteristicsVision Transformer (ViT)Autoregressive Transformer (AR)Diffusion Transformer (DiT)
Data processingTreating images as a sequence of patchesProcessing sequence data (text, images)Modeling data through noise perturbation and denoising
Positional encodingCrucial for spatial informationCritical for maintaining sequence orderUsed to maintain structure during the diffusion process
Model componentsBlock embedding, Transformer encoderMasked self-attention, Transformer decoderTransformer layers in diffusion steps
Generation capabilityLimited (mainly used for discriminative tasks)Strong generation capabilityStrong generation capability, with high fidelity

b. Application fields

Application FieldViTAutoregressive Transformer (AR)Diffusion Transformer (DiT)
Image ClassificationMain PurposeLess common, may be achievable on image sequencesTypically not used for classification tasks
Image generationLimited, requires modificationEffective when images are treated as sequencesHighly efficient and state-of-the-art quality
Natural Language ProcessingNot directly applicableCore applications (such as GPT models)With more limitations, unless integrated into multimodal models
Other fieldsObject detection, segmentationMusic generation, code generation, etc.Audio synthesis, video generation, etc.

c. Advantages and merits

AspectViTAutoregressive Transformer (AR)Diffusion Transformer (DiT)
PerformanceCompete with CNNs in visual tasksSuperior performance in generation tasksLeading in high-fidelity generation
ScalabilityScales well with increasing data and model sizeHighly scalable, benefiting from large-scale datasetsExtensible, but computationally intensive due to the multi-step diffusion process
FlexibilityMainly used for visual tasks, adaptable to some tasksMultifunctionality across multiple domainsMainly used for generative tasks, can be adapted through conditioning
InterpretabilityThe block-based method provides a certain level of interpretabilityThe sequential nature helps in understanding the generation processDue to the complexity of the diffusion process, it is more difficult to explain

d. Limitations and Challenges

AspectViTAutoregressive Transformer (AR)Diffusion Transformer (DiT)
Data efficiencyRequires a large amount of data to perform wellMay require a large amount of data, especially for long sequencesExtremely requires data and computational resources
Computational costDue to the Transformer layers, especially for high-resolution images, the computational cost is highDue to self-attention, the computational cost is high for long sequencesThe computational cost is very high due to the iterative denoising steps.
Training complexityTraining from scratch without pretraining can be challenging.Requires careful handling of sequence length and masking.Training is complex due to the dual process (diffusion and Transformer).
Generation qualityLimited compared to specialized generative modelsWithout sufficient training, it may be difficult to achieve high-fidelity generationMay produce artifacts if improperly trained, but generally the quality is high