the difference. Today, I learned that the latest Sora-like video generation technology mainly adopts Vision Transformer. I don't really understand it, and I might be explaining it incorrectly; it's mainly for my own learning purposes.

Vision Transformer, Overview of ViT
It is a model for image classification that processes image patches using an architecture similar to Transformers. ViT was first successfully applied to large-scale image recognition tasks in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" published by Alexey Dosovitskiy et al. in 2020, showing excellent performance and promoting the development of visual representation learning and modern computer vision.
Core Concept
- Divide the image into non-overlapping patches of fixed size (e.g., 16x16 pixels), and perform linear embedding after flattening each patch.
- Add positional encoding to preserve spatial information, as the Transformer itself is not sensitive to the order of arrangement.
- The sequence of embedded image patches is input into a standard Transformer encoder for processing.
- A learnable [CLS] token is added to aggregate the full-image information for classification tasks.
Research Contributions
- It has been proven that without relying on Convolutional Neural Networks (CNNs), a pure Transformer architecture can also achieve excellent performance in image classification tasks.
- After pre-training on large-scale datasets (such as ImageNet-21k), Vision Transformers (ViT) perform well when transferred to medium and small-scale image recognition benchmarks (such as ImageNet, CIFAR-100, VTAB), while requiring significantly fewer computational resources for training.
Detailed architecture of ViT

1. Image processing flow
- Divide the input image into fixed-size, non-overlapping blocks (e.g., 16x16 pixels).
- Each block is flattened and embedded as a vector through a linear layer.
- Add absolute position encoding to each block embedding to retain spatial information.
- Input the sequence of embeddings of all blocks into a standard Transformer encoder.
2. Classification mechanism
- A special [CLS] token is added to the input sequence, and after processing by the Transformer encoder, the output vector of this token is used for classification tasks.
Comparative analysis
a. Architecture and design
Characteristics | Vision Transformer (ViT) | Autoregressive Transformer (AR) | Diffusion Transformer (DiT) |
---|
Data processing | Treating images as a sequence of patches | Processing sequence data (text, images) | Modeling data through noise perturbation and denoising |
Positional encoding | Crucial for spatial information | Critical for maintaining sequence order | Used to maintain structure during the diffusion process |
Model components | Block embedding, Transformer encoder | Masked self-attention, Transformer decoder | Transformer layers in diffusion steps |
Generation capability | Limited (mainly used for discriminative tasks) | Strong generation capability | Strong generation capability, with high fidelity |
b. Application fields
Application Field | ViT | Autoregressive Transformer (AR) | Diffusion Transformer (DiT) |
---|
Image Classification | Main Purpose | Less common, may be achievable on image sequences | Typically not used for classification tasks |
Image generation | Limited, requires modification | Effective when images are treated as sequences | Highly efficient and state-of-the-art quality |
Natural Language Processing | Not directly applicable | Core applications (such as GPT models) | With more limitations, unless integrated into multimodal models |
Other fields | Object detection, segmentation | Music generation, code generation, etc. | Audio synthesis, video generation, etc. |
c. Advantages and merits
Aspect | ViT | Autoregressive Transformer (AR) | Diffusion Transformer (DiT) |
---|
Performance | Compete with CNNs in visual tasks | Superior performance in generation tasks | Leading in high-fidelity generation |
Scalability | Scales well with increasing data and model size | Highly scalable, benefiting from large-scale datasets | Extensible, but computationally intensive due to the multi-step diffusion process |
Flexibility | Mainly used for visual tasks, adaptable to some tasks | Multifunctionality across multiple domains | Mainly used for generative tasks, can be adapted through conditioning |
Interpretability | The block-based method provides a certain level of interpretability | The sequential nature helps in understanding the generation process | Due to the complexity of the diffusion process, it is more difficult to explain |
d. Limitations and Challenges
Aspect | ViT | Autoregressive Transformer (AR) | Diffusion Transformer (DiT) |
---|
Data efficiency | Requires a large amount of data to perform well | May require a large amount of data, especially for long sequences | Extremely requires data and computational resources |
Computational cost | Due to the Transformer layers, especially for high-resolution images, the computational cost is high | Due to self-attention, the computational cost is high for long sequences | The computational cost is very high due to the iterative denoising steps. |
Training complexity | Training from scratch without pretraining can be challenging. | Requires careful handling of sequence length and masking. | Training is complex due to the dual process (diffusion and Transformer). |
Generation quality | Limited compared to specialized generative models | Without sufficient training, it may be difficult to achieve high-fidelity generation | May produce artifacts if improperly trained, but generally the quality is high |