Attention

Discussion questions for attention

Concrete

What made modelling long sequences hard with RNN?
In simplified attention, how do we compute context vectors for each token?
How does attention help with modelling long sequences? What are the downsides and how are they accounted for?
How do we extend simplified attention to allow for trainable weights?
What kind of attention mask do text-generating LLMs use? Is it possible to use other masks? Why would we do that and what the implications would be?
What is register_buffer used for?
Is softmax necessary? What role does it play?
How many context vectors do we need to compute?
What are the different ways of implementing multi-head attention?
Why do we need multi-head attention and what can multi-head attention encode that larger single-head attention can’t encode?
When implementing multi-head attention, what are the tensor operations that we need to do and why?

Why do we need to divide by the square root of the dimension of key/value embedding in softmax?
Is output projection layer necessary?
Why do you need dropout? When and where dropout is typically applied?

How hard is it to change context length of a model?
What are the implications of using dot-product when computing attention weights? Are there any alternatives and what would change if they were used?
How can you go about interpreting attention weights?
What are some of the things one can try to reduce amount of computation required to compute context vectors?
How does attention in LLM differ from human attention?
Is attention limited to text?

How does attention mechanism help maintain contextual relevance over long passages of text?