Vision Transformer (ViT)

the difference. Today, I learned that the latest Sora-like video generation technology mainly adopts Vision Transformer. I don't really understand it, and I might be explaining it incorrectly; it's mainly for my own learning purposes.

Vision Transformer, Overview of ViT

It is a model for image classification that processes image patches using an architecture similar to Transformers. ViT was first successfully applied to large-scale image recognition tasks in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" published by Alexey Dosovitskiy et al. in 2020, showing excellent performance and promoting the development of visual representation learning and modern computer vision.

Core Concept

Divide the image into non-overlapping patches of fixed size (e.g., 16x16 pixels), and perform linear embedding after flattening each patch.
Add positional encoding to preserve spatial information, as the Transformer itself is not sensitive to the order of arrangement.
The sequence of embedded image patches is input into a standard Transformer encoder for processing.
A learnable [CLS] token is added to aggregate the full-image information for classification tasks.

Research Contributions

It has been proven that without relying on Convolutional Neural Networks (CNNs), a pure Transformer architecture can also achieve excellent performance in image classification tasks.
After pre-training on large-scale datasets (such as ImageNet-21k), Vision Transformers (ViT) perform well when transferred to medium and small-scale image recognition benchmarks (such as ImageNet, CIFAR-100, VTAB), while requiring significantly fewer computational resources for training.

Detailed architecture of ViT

1. Image processing flow

Divide the input image into fixed-size, non-overlapping blocks (e.g., 16x16 pixels).
Each block is flattened and embedded as a vector through a linear layer.
Add absolute position encoding to each block embedding to retain spatial information.
Input the sequence of embeddings of all blocks into a standard Transformer encoder.

2. Classification mechanism

A special [CLS] token is added to the input sequence, and after processing by the Transformer encoder, the output vector of this token is used for classification tasks.

Comparative analysis

a. Architecture and design

Characteristics	Vision Transformer (ViT)	Autoregressive Transformer (AR)	Diffusion Transformer (DiT)
Data processing	Treating images as a sequence of patches	Processing sequence data (text, images)	Modeling data through noise perturbation and denoising
Positional encoding	Crucial for spatial information	Critical for maintaining sequence order	Used to maintain structure during the diffusion process
Model components	Block embedding, Transformer encoder	Masked self-attention, Transformer decoder	Transformer layers in diffusion steps
Generation capability	Limited (mainly used for discriminative tasks)	Strong generation capability	Strong generation capability, with high fidelity

b. Application fields

Application Field	ViT	Autoregressive Transformer (AR)	Diffusion Transformer (DiT)
Image Classification	Main Purpose	Less common, may be achievable on image sequences	Typically not used for classification tasks
Image generation	Limited, requires modification	Effective when images are treated as sequences	Highly efficient and state-of-the-art quality
Natural Language Processing	Not directly applicable	Core applications (such as GPT models)	With more limitations, unless integrated into multimodal models
Other fields	Object detection, segmentation	Music generation, code generation, etc.	Audio synthesis, video generation, etc.

c. Advantages and merits

Aspect	ViT	Autoregressive Transformer (AR)	Diffusion Transformer (DiT)
Performance	Compete with CNNs in visual tasks	Superior performance in generation tasks	Leading in high-fidelity generation
Scalability	Scales well with increasing data and model size	Highly scalable, benefiting from large-scale datasets	Extensible, but computationally intensive due to the multi-step diffusion process
Flexibility	Mainly used for visual tasks, adaptable to some tasks	Multifunctionality across multiple domains	Mainly used for generative tasks, can be adapted through conditioning
Interpretability	The block-based method provides a certain level of interpretability	The sequential nature helps in understanding the generation process	Due to the complexity of the diffusion process, it is more difficult to explain

d. Limitations and Challenges

Aspect	ViT	Autoregressive Transformer (AR)	Diffusion Transformer (DiT)
Data efficiency	Requires a large amount of data to perform well	May require a large amount of data, especially for long sequences	Extremely requires data and computational resources
Computational cost	Due to the Transformer layers, especially for high-resolution images, the computational cost is high	Due to self-attention, the computational cost is high for long sequences	The computational cost is very high due to the iterative denoising steps.
Training complexity	Training from scratch without pretraining can be challenging.	Requires careful handling of sequence length and masking.	Training is complex due to the dual process (diffusion and Transformer).
Generation quality	Limited compared to specialized generative models	Without sufficient training, it may be difficult to achieve high-fidelity generation	May produce artifacts if improperly trained, but generally the quality is high