Model-based tracking of people in video

Introduction

In order to make sense of the visual world, it is often necessary to have strong prior models of what we expect to see. This leads to a model-based approach to machine vision, whereby predefined models are used to track and identify objects in the scene.

This project uses a model of a human body to track people in a still image or video clip.

The Tracking System

Matching of the model to the video is achieved at a fairly low computational cost by comparing the silhouette of the person in the video with the silhouette of the model in a given pose. An optimal pose can be found by selecting the one which gives the silhouette closest to that extracted from the video.

The basic structure of the system is shown below.

Frame from video Silhouette
Background
subtraction
Silhouettematching
Apply
pose
Body model Matched model

The silhouette of the body model is matched to that extracted from the video by a process of simulated annealing. The matching occurs incrementally, starting with the root of the body model hierarchy (the rigid body position) and moving downwards to include movement of limbs and then movement of all flexible joints (including hands, feet and neck).

Currently the matching process is not optimised for speed and takes about 30 seconds per frame. I believe that it would be relatively easy to reduce this by at least a factor of ten, by use of more optimised algorithms.

There is currently a weak prior model on the pose. This helps to move the body into more reasonable positions in the cases when occlusion or the use of silhouettes leads to ambiguity (such as when an arm passes behind the main body).

Results

A video clip of me walking was recorded. You can download this clip (902K) or, browser-permitting, it should appear below.

You need Netscape or Internet Explorer to see this

A silhouette sequence was extracted from this video. The tracking system was used to find a corresponding model position for each silhouette. The resulting animation is below or can be downloaded (666K).

The model silhouette is in the left panel with the corresponding video silhouette in the right panel. Note that, because a dynamic model is not used, there is no constraint on smooth motion. This leads to ambiguity errors, such as when the legs pass by each other.

You need Netscape or Internet Explorer to see this

Future Work

I intend to develop this idea further to achieve the following: