RE: What does the [class] token in BERT and ViT actually do?
Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed- ded patches (z0 0 = xclass), whose state at the output of the Transformer encoder (z0 L) serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is at- tached to z0 L. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
I don't understand this and have not understood this in BERT. Why would I prepend a token to the sequence and not just use the last one?
The [class] token (also known as the [CLS] token in the BERT model) is a special token that is prepended to the sequence for a few reasons:
1. Global Context: By prepending a [class] token to each sequence, the model treats it as a special token, carrying contextual representation of the entire input sequence. Due to the nature of the transformers, where each position interacts with each other, the [cls] token has the potential to capture global context of the entire sequence which makes it useful for classification tasks.
2. Signal for Start: The [class] token also serves as a clear signal of the beginning of a sequence. This is particularly helpful in models like BERT which can process two sentences at once (e.g., for next sentence prediction). The [cls] token can clearly distinguish between these two sentences.
3. Sentence-level Representation: In models like BERT and ViT, the final hidden state of the [class] token is used as the aggregate sequence representation for classification tasks. This state is expected to contain meaningful information regarding the whole input, which is used by the classification head for sentence level prediction tasks.
Simply taking the final token of the sequence might not always accomplish these, especially when you need to classify the whole sentence or aggregate input, because the final token would carry more local context. Hence [class] token plays a vital role in these transformer-based models.