A cluster of conspicuity representations for eye fixation selection.
J.K. Tsotsos, Y. Kotseruba, and C. Wloka (2016) A cluster of conspicuity representations for eye fixation selection. Society for Neuroscience (SfN)
Abstract: A computational explanation of how visual attention, interpretation of visual stimuli, and eye movements combine to produce visual behavior seems elusive. Here, we focus on one component: how selection is accomplished for the next fixation. The popularity of saliency map models drives the inference that this is solved, but we argue otherwise. We advocate for a cluster of complementary, conspicuity representations to drive fixation selection, modulated by task goals and fixation history. This design is constrained by the architectural characteristics of the visual processing pathways, specifically, the photoreceptor distribution in the retina, the pyramidal architecture of the visual pathways, and the poor representation of the visual periphery in the late stages of the visual pathways. Added to these are constraints due to the inappropriateness of an early attentional selection strategy for complex stimuli (eg., non-target displays, figure-ground not easily separated, etc.). Together, these factors led us to a hybrid method that combines early and late selection, i.e., feature and object-based attentional selection. We incorporate attentional microsaccades, saccades and pursuit eye movements into a unified scheme where true covert fixations (zero eye movement) might only be appropriate if the target of a new attentional fixation is represented in the retina at a resolution sufficient for the task. Finally, elements of a visual working memory structure are included that link fixations across space and provide a means for extracting details of an attentional fixation and communicating them to the rest of the system. These elements combine into a new strategy for computing fixation targets.
Using this new fixation controller, we show results that not only out-perform saliency models with respect to human fixation patterns, but also match very well with human saccade amplitude distribution patterns. Perhaps most importantly, it provides a substrate for a richer exploration of how attention and eye movements are related than possible with other models because it is explicitly designed to be an integrative framework. It is a fully computational framework that can be tested with real images (and image sequences) with no hidden representations or statistically inscrutable elements, generating fixations that can be fully analyzed and providing the full sequence of representations than can also be inspected, analyzed and used to drive experimental testing. It is certainly the case that the specific representations presented will require many refinements; however, the framework offers the ability to understand why those refinements will be needed.