Attention
Discussion questions for attention
Concrete
- What made modelling long sequences hard with RNN?
- In simplified attention, how do we compute context vectors for each token?
- How does attention help with modelling long sequences? What are the downsides and how are they accounted for?
- How do we extend simplified attention to allow for trainable weights?
- What kind of attention mask do text-generating LLMs use? Is it possible to use other masks? Why would we do that and what the implications would be?
- What is
register_buffer
used for? - Is softmax necessary? What role does it play?
- How many context vectors do we need to compute?
- What are the different ways of implementing multi-head attention?
- Why do we need multi-head attention and what can multi-head attention encode that larger single-head attention can’t encode?
- When implementing multi-head attention, what are the tensor operations that we need to do and why?
Extra questions
- Why do we need to divide by the square root of the dimension of key/value embedding in softmax?
- Is output projection layer necessary?
- Why do you need dropout? When and where dropout is typically applied?
More Abstract
- How hard is it to change context length of a model?
- What are the implications of using dot-product when computing attention weights? Are there any alternatives and what would change if they were used?
- How can you go about interpreting attention weights?
- What are some of the things one can try to reduce amount of computation required to compute context vectors?
- How does attention in LLM differ from human attention?
- Is attention limited to text?
Redundant
- How does attention mechanism help maintain contextual relevance over long passages of text?