A saccade-inspired approach to image classification using vision transformer attention maps

Saccade selection method: (a.) The input image of dimensionH× Wis split intoH16×Wnsized patches and embeddedinto token vectors. (b.) The tokens are passed through the DINO transformer, and attention flow from patch tokens to [CLS]token (white arrows) are extracted and reshaped into one attention map per attention-head. (c.) The multiple attention maps arefused into one by taking the maximum value across heads. (d.) The highest-attention locations define square regions(“saccades”) whose tokens are retained. (e.) Selected regions are revealed sequentially, and the image variants are classified bya pre-trained linear head.
Saccade selection method: (a.) The input image of dimensionH× Wis split intoH16×Wnsized patches and embeddedinto token vectors. (b.) The tokens are passed through the DINO transformer, and attention flow from patch tokens to [CLS]token (white arrows) are extracted and reshaped into one attention map per attention-head. (c.) The multiple attention maps arefused into one by taking the maximum value across heads. (d.) The highest-attention locations define square regions(“saccades”) whose tokens are retained. (e.) Selected regions are revealed sequentially, and the image variants are classified bya pre-trained linear head.
Matthis Dallain
Matthis Dallain
Phd candidate in Computational Neuroscience

During my PhD, I am focusing on spiking neural networks with heterogeneous delays.

Laurent U Perrinet
Laurent U Perrinet
Researcher in Computational Neuroscience

My research interests include Machine Learning and computational neuroscience applied to Vision.