Corporate Research & Development Center

Toshiba's New AI Technology Contributes to Safety and Security in Large Facilities by Tracking Multiple People at Once

Toshiba Corporation


Toshiba has developed AI technology that uses video feeds from multiple surveillance cameras to track the movements of people in shopping malls, train stations, arenas and other large facilities. Although it receives feeds from cameras in locations throughout the facility, the technology realizes high precision with low computation load, allowing routes taken by multiple people to be tracked at the same time. The technology will be reported at the 20th Meeting on Image Recognition and Understanding (MIRU2017), a major Japanese conference on image recognition, on August 8.

Development Background

The success of millions of surveillance cameras installed worldwide in contributing to crime prevention and identifying criminals, along with the need for enhanced security to monitor potential terrorist threats, is driving demand for advanced crime-prevention solutions based on image-recognition technology. Most attention is being directed to technologies for facial recognition and individual detection that analyzes human attributes and behaviors, such as age and gender, using camera footage.

Up until now, systems that detect human attributes and behaviors have been based on the field-of-view of a single camera. However, in large public facilities, the preferred solution is the ability to identify and track individuals in videos from multiple cameras at locations throughout the site. As the way a person is captured in a video differs from camera to camera, precisely identifying the same individual in many videos is difficult, and successfully identifying multiple people simultaneously is a feat that imposes a huge computational load, as the number of potential combinations is enormous.

Features of This Technology

Toshiba's AI technology achieves high precision with low computational load. It tracks multiple people captured on numerous cameras, using three essential capabilities to do so.

(1) Robust feature extraction(Note 1)
By increasing luminance and color (multi-channeling), basic information for feature extraction, the features of individuals are extracted without being affected by differences in settings between cameras. Robust feature extraction not influenced by changing poses or similar traits shared by people is secured by dividing each video into several blocks (multilayer block division) and analyzing color distribution in each block (introduction of histogram feature quantities).

(2) Simultaneous feature extraction of multiple people to significantly reduce computational loads
Feature extraction is done with high-precision, even when people overlap in videos. This is achieved by tracking each person in the video feeds from every camera, and simultaneously extracting feature quantities from every image. Feature extraction is 2.3 times faster than when done frame-by-frame, as the computational load is reduced.(Note 2)

(3) Identifying a person in different videos through similarities between videos
In operation, the multiple cameras are clustered to form a single system, and the operational constraint that any individual person can only appear only once in any video at any given point in time is applied. This allows simultaneous similarity extraction of all the videos and selection of the best combination of features for recognizing the same person in different videos over time. The system identifies any particular individual 1,300 times faster than the conventional approach.(Note 3)
Toshiba evaluated the technology using the "CUHK03(Note 4)" a public image database, and found much higher precision than with current technologies*5. The computational load required to recognize the same person in feeds from multiple cameras was greatly reduced, making it possible to infer the movements of multiple people in close to real-time.

<Demo video>
* This demo video is made with the permission of the people in it.
* Some of the video was modified before publication to protect personal information.

Future Prospects

Toshiba aims to build the technology into its communication AI "RECAIUS™" by mid-2018.
This new technology makes it much easier to track the paths and current positions in large facilities of particular individual, from lost children to suspicious individuals. Toshiba is also investigating application of the system to provide statistical analysis of attributes that will allow, for example, identification of where large numbers of people gather within facilities.

*Users of this system in practical application will be required to take appropriate measures to protect privacy and ensure adherence to the Personal Information Protection Act.


  About Toshiba Communication AI "RECAIUS™"

This is a service that understands the intentions of people from audio and video, and supports business activities, a safe and secure lifestyle, and appropriate activities. This is a fusion and systematization of various media recognition processing technologies (media intelligence technologies) researched and developed by Toshiba over many years, including voice recognition, speech synthesis, translation, dialogue, intention understanding, and image recognition (facial- and human-image recognition). RECAIUS™ will contribute to the creation of new lifestyles and businesses.

RECAIUS™ understands people's intentions by analyzing audio and video feeds to support business activities and other activities and to contribute to safe, secure lifestyles. It fuses and systematizes media recognition processing technologies (media intelligence technologies) researched and developed by Toshiba over many years, including voice recognition, speech synthesis, translation, spoken interactions, understanding intentions, and image recognition (recognition of human forms and faces). RECAIUS™ will contribute to the creation of new lifestyles and businesses.

*RECAIUS™ is a trademark or registered trademark of Toshiba Digital Solutions Limited in Japan and other countries.
*Other company names and product names on this website may be used as trademarks or registered trademarks of their respective owners.

(Note 1)
Numerical values representing what kinds of features (colors, shapes, etc.) are present in an input image.
Specific features—colors, shapes, etc.—extracted from images are assigned numerical values.
(Note 2)
Source: Toshiba
(Note 3)
Source: Toshiba
(Note 4) (The Chinese University of Hong Kong)
A public image database of 13,164 images of 1,360 people, acquired using multiple cameras.
(Note 5)
The Toshiba method achieved an accuracy of 84.8%.