What does the [class] token in BERT and ViT actually do?

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed- ded patches (z0 0 = xclass), whose state at the output of the Transformer encoder (z0 L) serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is at- tached to z0 L. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

I don't understand this and have not understood this in BERT. Why would I prepend a token to the sequence and not just use the last one?

Add Comment
2 Answer(s)
The [class] token (also known as the [CLS] token in the BERT model) is a special token that is prepended to the sequence for a few reasons: 1. Global Context: By prepending a [class] token to each sequence, the model treats it as a special token, carrying contextual representation of the entire input sequence. Due to the nature of the transformers, where each position interacts with each other, the [cls] token has the potential to capture global context of the entire sequence which makes it useful for classification tasks. 2. Signal for Start: The [class] token also serves as a clear signal of the beginning of a sequence. This is particularly helpful in models like BERT which can process two sentences at once (e.g., for next sentence prediction). The [cls] token can clearly distinguish between these two sentences. 3. Sentence-level Representation: In models like BERT and ViT, the final hidden state of the [class] token is used as the aggregate sequence representation for classification tasks. This state is expected to contain meaningful information regarding the whole input, which is used by the classification head for sentence level prediction tasks. Simply taking the final token of the sequence might not always accomplish these, especially when you need to classify the whole sentence or aggregate input, because the final token would carry more local context. Hence [class] token plays a vital role in these transformer-based models.
Answered on August 2, 2023.
Add Comment
The [CLS] token in both models - BERT (Bidirectional Encoder Representations from Transformers) and ViT (Vision Transformer), serves a special purpose. When a model such as BERT or ViT is fed input (it could be a sequence of words, or a sequence of image patches), the model converts each input into its corresponding embedding. These vectors are then processed by layers of the transformer. The [CLS] token is an extra token that is added to the beginning of the input. The purpose of [CLS] (which stands for classification) token is not to carry any meaning but to provide a specific position in input sequence where the model's final contextualized representation could be pooled. Its output embedding serves as an aggregate representation of the entire sequence of embeddings and is used for downstream tasks, particularly in classification problems. It is at this position that the model learns to encode information relevant to the specific task at hand, say for instance, sentiment analysis, or image classification. Using the final token in the sequence for these tasks would not be as effective. This is because the final token's output embedding, theoretically, carries more context about the latter parts of a given sequence. The [CLS] token on the other hand, receives context from all tokens through multiple layers of attention and encoding, thereby supposedly carrying a more comprehensive sense of the entire input sequence. So, the [CLS] token acts as a sensible and useful choice for pooling an aggregate sequence representation for downstream tasks. Remember, this is under the transformers' architecture where the influence of each input token on every other is dynamically computed based on their interactions and relationships.
Answered on August 24, 2023.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.