RE: What does the [class] token in BERT and ViT actually do?

The [class] token (also known as the [CLS] token in the BERT model) is a special token that is prepended to the sequence for a few reasons: 1. Global Context: By prepending a [class] token to each sequence, the model treats it as a special token, carrying contextual representation of the entire input sequence. Due to the nature of the transformers, where each position interacts with each other, the [cls] token has the potential to capture global context of the entire sequence which makes it useful for classification tasks. 2. Signal for Start: The [class] token also serves as a clear signal of the beginning of a sequence. This is particularly helpful in models like BERT which can process two sentences at once (e.g., for next sentence prediction). The [cls] token can clearly distinguish between these two sentences. 3. Sentence-level Representation: In models like BERT and ViT, the final hidden state of the [class] token is used as the aggregate sequence representation for classification tasks. This state is expected to contain meaningful information regarding the whole input, which is used by the classification head for sentence level prediction tasks. Simply taking the final token of the sequence might not always accomplish these, especially when you need to classify the whole sentence or aggregate input, because the final token would carry more local context. Hence [class] token plays a vital role in these transformer-based models.

Your Answer

HOT QUESTIONS