Abstract
Mammalian vision systems do not view an entire scene in one go. Instead, rapid eye movements
known as saccades point the high density areas of photoreceptors in the retina toward areas of
detail. Consequently, a detailed view of the scene can be built by the brain using a relatively
small amount of information. By integrating the imaging in this manner the quality of the visual
processing found deeper within the brain is improved as it only has to process the salient details.
A scanning pixel camera presents a way of realising this in hardware. A low cost, low power
sensor system that builds up an image of a scene by rapidly sampling a sensor that sits behind a
moveable set of optics. Advances in micro-actuation allows the low-cost optics to be scanned
across the scene in a programmable manner. This can lead to the lens-less zooming effects by
simply varying the scan speed or the sample rate. Furthermore, the amount of information that
this type of sensor provides can be varied by simply changing the scan pattern.
However, a major drawback of this type of sensor system is that it takes a long time to image a
full scene when compared to a traditional CCD camera. This motivates the work of this thesis
to find a scan pattern that allows the best use of the saccade-like behaviour of a scanning pixel
camera. By focusing on scene details relevant to a predefined computer vision task, this thesis
demonstrates that it is possible to produce a scan pattern that allows us to overcome this major
issue. In this thesis we provide methods of generating useful sample maps that enhance the
abilities of a scanning pixel camera and make it an efficient part of a computer vision pipeline.
By actively providing sample patterns to the scanning pixel camera, the sensor becomes an
active part of the computer vision system, rather than simply a source of data. This is similar to
the purpose of saccades in a mammalian vision system. In doing this we create another challenge
that is addressed in this thesis. Namely, the downstream computer vision task has only a partial
view of the scene, that may be affected by different types of artefacting found in scanning pixel
cameras. Therefore, how do these tasks need to be adapted to deal with data in this form, both
during training and inference.
This thesis approaches this problem by first making several assumptions about a scanning pixel
camera to adapt existing computer vision techniques to find useful sample patterns. These
initial assumptions include that scene is static and is imaged with full knowledge of its contents.
These are then used to create simple model of an scanning pixel camera to establish the best
possible way of generating sampling positions for a downstream task. These assumptions are
then progressively removed in order to finally reach a method that can be deployed on a real
system. The end result is a technique that requires no prior knowledge of the scene to begin with,
forcing the scanning pixel camera to explore the scene before it knows what it is looking at.
The sample maps generated are designed to generate images to be used by a downstream
computer vision, rather than viewed by a human. To evaluate this we apply this technique to
a variety of computer vision tasks and demonstrate that such a piece of hardware can form a
useful part of a computer vision system. These tasks include object classification, tracking and
instance segmentation.