RE: What does the [class] token in BERT and ViT actually do?
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed- ded patches (z0 0 = xclass), whose state at the output of the Transformer encoder (z0 L) serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is at- tached to z0 L. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
I don't understand this and have not understood this in BERT. Why would I prepend a token to the sequence and not just use the last one?
The [CLS] token in both models - BERT (Bidirectional Encoder Representations from Transformers) and ViT (Vision Transformer), serves a special purpose. When a model such as BERT or ViT is fed input (it could be a sequence of words, or a sequence of image patches), the model converts each input into its corresponding embedding. These vectors are then processed by layers of the transformer. The [CLS] token is an extra token that is added to the beginning of the input. The purpose of [CLS] (which stands for classification) token is not to carry any meaning but to provide a specific position in input sequence where the model's final contextualized representation could be pooled. Its output embedding serves as an aggregate representation of the entire sequence of embeddings and is used for downstream tasks, particularly in classification problems. It is at this position that the model learns to encode information relevant to the specific task at hand, say for instance, sentiment analysis, or image classification. Using the final token in the sequence for these tasks would not be as effective. This is because the final token's output embedding, theoretically, carries more context about the latter parts of a given sequence. The [CLS] token on the other hand, receives context from all tokens through multiple layers of attention and encoding, thereby supposedly carrying a more comprehensive sense of the entire input sequence. So, the [CLS] token acts as a sensible and useful choice for pooling an aggregate sequence representation for downstream tasks. Remember, this is under the transformers' architecture where the influence of each input token on every other is dynamically computed based on their interactions and relationships.