Reference
J. Winn and N. Jojic, LOCUS: Learning Object Classes with Unsupervised
Segmentation, Proc. IEEE Intl. Conf. on Computer Vision (ICCV),
Beijing 2005.
How it works Object registration Learning object parts Motion segmentation
LOCUS addresses the problem of learning a model of an object class (e.g. horses, cars) from a 'bucket' of images each containing an object in that class. LOCUS does not require any human annotation of the images - it discovers the location and pose of the object in each image and also gives a segmentation of each object. The motivation is that by avoiding the need for human labelling, we can quickly scale up to the large number of object classes required for a practical object recognition system.
For example, given a bucket of 20 images of horses, LOCUS learns a class shape model and infers the segmentation of each horse (as shown below for four of the 20 images):
![]() |
| Figure 1: The class model and example segmentations when LOCUS is applied to 20 images of horses. |
The accuracy of LOCUS's automatic segmentations rivals that of state-of-the-art methods which require hand-segmented training data (see Section 5 of the paper for details). However, LOCUS does not require the training images to be marked up in any way.
LOCUS uses a hierarchical generative model of the set of images, as shown in the diagram below. Class specific information is contained in the class mask and edge probability models, which govern the broad shape and typical edge locations for instances of that class in a neutral position. The position, size, deformation and the appearance of individual object instances are allowed to vary from image to image. During inference, we learn the class shape and edge models simultaneously with learning the variables associated with each image. Approximate Bayesian inference is carried out using Variational Message Passing with some enhancements for the grid-structured parts of the graph (see paper for details).
![]() |
| Figure 2: The Bayesian network used in LOCUS as a generative model of a set of images. The images indicate the inferred state of each variable given a set of 20 face images. |
The key to LOCUS's ability to cope with varying object appearance and illumination is the use of a separate appearance model for the object and background in each image. Hence, the object is defined primarily by its shape and edge map, and by the self-similarity of its appearance (color or texture) within a single image, instead of a strong global appearance model.
LOCUS learns a dense registration from each object instance to the class model. Hence, we can illustrate the accuracy of the automatic registration/segmentation by showing each instance 'morphing' into the next by interpolating between the deformation field and appearance of the two objects. The three videos below were created automatically from three image sets using exactly the same algorithm and parameter settings.
![]() [Download AVI] |
![]() [Download AVI] |
![]() [Download AVI] |
Use download links if embedded videos
do not play.
Figure 3: Videos showing the automatic registration and segmentation of objects
from three image sets: cars (side), horses, cars (rear).
|
||
We can extend the LOCUS model so that, rather than being binary, the masks are multi-valued. We still allow the appearance model for each index to vary from image to image. This allows us to learn which parts of an object are self-similar, but which also occur in the same relative position on the object. The class shape model then becomes a deformable probabilistic index map (dPIM). Interestingly, the parts that are discovered are often semantically meaningful e.g. wheels, windows, eyes etc.
![]() |
| Figure 4: Deformable probabilistic index maps (top image in each group) along with label images for three example images for each of four object classes: cars, horses, faces, planes. In many cases, the learned parts are semantically meaningful e.g. car window, wheel, hair, eyes, wings. |
The LOCUS model can also be applied to the frames of a video sequence, which results in unsupervised segmentation of video into two layers. The palette invariance gives robustness to large changes in illumination, pose and background clutter.
![]() [Download AVI] |
![]() Shape model |
![]() Edge model |
|
| Figure 5: Segmentation of a video sequence (left) and learned object shape and edge models (right). | |
These results are for when each frame is treated as a separate image and are hence still somewhat inaccurate. Recently, this work has been extended to use motion and tracking cues, which gives improved stability and accuracy.