Digitizing text overlay to maximize video asset utilization: AI pioneers the future of content

Broadcasters maintain vast archives of video content, tagging their material with metadata at both the program and scene level. This enables users to search, identify, and extract specific content from these archives. However, tasks such as video editing or rights management require the exact timing of the target scenes to be specified using time codes. Existing metadata alone is not sufficient for this. Therefore, after locating videos using metadata, staff must manually review them and record the exact time codes of the specific content—a process that is both time-consuming and labor-intensive.

This process could be made more efficient by assigning metadata at a finer level than scenes—down to individual frames. To acquire data at this level of granularity, Toshiba focused on the text overlay displayed within video content. Text overlay displays text information such as program titles and performer names, making these details clear to viewers. It often includes decorative elements to enhance visual appeal. This poses a challenge: conventional optical character recognition (OCR) often fails to recognize text overlay. Toshiba has successfully adapted its OCR technology for video content, enabling high-accuracy digitization of text overlay. This enables rapid identification of the exact positions of specific scenes and streamlines video content workflows.

Here, we introduce Mojimeta, Toshiba's AI-powered text overlay OCR solution designed to maximize the use of video assets.


Finding specific scenes in massive video archives: Tackling TV station challenges with text overlay conversion


Every day, broadcasters produce and broadcast large volumes of video content, which is then archived for future use. These archives contain video content spanning several decades across genres such as news, variety shows, documentaries, and sports. Archived videos are often reused for streaming or as material for new programs. These archives are invaluable assets for broadcasters.

However, locating the exact moments you need within this vast collection of video content is a challenging task. For example, when looking for past programs or scenes featuring a specific person, the usual approach is to search by name. Such text data added to video content to describe programs or scenes is called metadata. However, older video content often lacks sufficient metadata, making text-based searches difficult. As a result, each archived video must be reviewed manually and tagged with metadata—a highly labor-intensive process with clear limitations.

The key to solving this challenge lies in the text overlay, often referred to as captions. Text overlay contains a wealth of useful information, including program titles, performer names, scene-specific keywords, product names, emotional nuances, and closing credits information. This information plays a key role in enriching metadata.

However, because text overlay is embedded in the video as images, it does not exist as separate text data and cannot be processed automatically as text. This is where optical character recognition (OCR) comes in. OCR can accurately detect text overlay in video content and convert it into text data, making it usable as metadata.

By performing this recognition process several times per second, the system can precisely identify what text overlay appears at each point in the video and convert it into text data. This significantly enhances the accuracy and efficiency of video searches. This allows not only quick and accurate extraction of specified scenes but also comprehensive searches that include previously overlooked footage, enabling more optimal video selection.


Cutting-edge AI text overlay recognition backed by 50+ years of image processing innovation


When using OCR to recognize text overlay in video, results often fall short of expectations. Text overlay in videos often appears over complex backgrounds, uses decorative elements, or employs unusual fonts—features that pose significant challenges for conventional OCR technology and lead to a substantial drop in recognition accuracy.

Toshiba has been refining image processing technology for more than 50 years, starting with the development of an automatic postal code reading and sorting device in 1967. Character recognition has evolved by incorporating advanced techniques such as machine learning and deep learning. Mojimeta, an AI-powered OCR solution, applies these technologies to accurately recognize text overlay in video content.

* Learn more about Toshiba's OCR technology here.

Mojimeta is an OCR-based solution that accurately recognizes text overlay in videos and converts it into text data. Unlike conventional OCR, which is limited to processing still images such as documents and forms, Mojimeta is designed for video consisting of sequential frames. This point is particularly important.

Japanese television broadcasts run at a frame rate of 29.97 fps (frames per second). For example, if text overlay remains visible for three seconds, it spans approximately 90 frames.

When OCR is applied to each individual frame, the results can become inconsistent, with the same text overlay recognized differently across frames. Mojimeta treats a sequence of video frames as a single unit and merges recognition results for text overlay that spans multiple frames.

The processing flow of Mojimeta is as follows (Fig. 1).

Frames are captured at regular intervals. As with conventional OCR, areas of the screen containing text overlay are detected, and the text within those areas is recognized. The steps Mojimeta performs beyond this point are specific to text overlay: it merges recognition results for text detected across multiple frames that are identified as the same. The merged text overlay is classified by attributes such as “Name,” “Date/time,” and “Address,” before outputting the final results.*

The output includes not only position coordinates, recognized text strings, and detection/recognition confidence scores, similar to conventional OCR, but also merge processing results and classification data. This enables text overlay information in videos to be utilized more effectively.

* The text overlay categorization feature is scheduled for future implementation.


Enhancing recognition accuracy with advanced techniques for text overlay processing


Recognition results may vary when the same text overlay spans multiple frames. A common visual effect used for text overlay is fade-in and fade-out, which gradually makes the text appear more clearly or disappear smoothly. During transitions between text overlay, this effect often causes characters to blur, shake, or appear incomplete, reducing recognition accuracy.

As shown in Figure 2, text overlay such as “Who will save the Earth?!” can be misrecognized during the initial fade-in frame, for example as “ho will save the Earth?!”. Performing simple frame-by-frame OCR would result in text overlay that actually remains the same being treated as different text across frames.

Mojimeta adds an additional step: when text overlay appearing across multiple frames is identified as the same, it applies corrections and merges the recognition results. This additional step improves recognition accuracy and makes the output easier to use (Fig. 2).

Additionally, in developing Mojimeta, we focused on improving the accuracy of the underlying OCR engine. We enhanced the recognition engine by training it on text featuring elements typical of text overlay, such as complex backgrounds, highly decorative fonts, and mixed vertical and horizontal layouts. Mojimeta also supports highly compressed, low-resolution video commonly used for verification at TV stations, and flexibly addresses video-specific challenges.

Furthermore, Mojimeta also supports recognition of kanji variant characters commonly found in personal names. For example, some kanji share the same reading and meaning but have different forms, such as “高” and its variant “髙,” “邉” and “邊,” or “斎” and “齋.” Careful attention is required when training the recognition engine on complex character variants to ensure that recognition of standard characters is not affected. This process is highly challenging. Toshiba’s experienced engineers applied careful AI tuning to ensure well-balanced training. As a result, Mojimeta achieves high accuracy in recognizing both standard characters and variant characters while minimizing errors.


Expanding video workflow support with frame-level processing efficient quality checks, search, and extraction with Mojimeta


The frame-level text overlay information recognized by Mojimeta greatly improves the efficiency of video quality checks. Video quality checks are carried out from various perspectives. In particular, text overlay involves a wide range of review items, including typos or missing characters, inappropriate expressions, and compliance with broadcasting guidelines.

Traditionally, video quality checks have been a time- and labor-intensive process, relying on manual inspection by staff. However, the text overlay data generated by Mojimeta can be compared against terminology dictionaries to detect expressions inappropriate for broadcasting. It can also work with other AI systems to identify typos or missing characters and perform fact-checking. These capabilities are expected to improve video quality and streamline checking tasks.

Moreover, Mojimeta results can be reviewed through an interface designed with expertise gained from operational suitability evaluations at TV stations. The display can be switched between one-second, 10-second, and program-level views, allowing flexible use depending on the intended purpose. This provides adaptability for different tasks. For example, one-second units can be selected for transcribing end credits, while 10-second units are suitable for checking text overlay in program segments.

Additionally, Mojimeta outputs not only text data but also coordinate data indicating where text overlay appears. This coordinate data can be used to filter text overlay by size, position, and orientation (horizontal or vertical). It also helps extract small text elements, such as copyright notices or disclaimers. With the time bar feature, users can visually check when each text string appears on the video timeline and how long it remains displayed.

The interface incorporates various features to help users quickly locate the target scenes (Fig. 3).


Toshiba's vision for the future: Next-generation content management to unlock the value of video


Toshiba also offers another solution specialized in video content: Kaometa*. Kaometa is a face recognition AI that recognizes and identifies individuals in video content. While Mojimeta extracts text from video and converts it into text data, Kaometa identifies individuals, and both are solutions specialized in video content that provide frame-level metadata.

* For more details on Kaometa, see this article.

This high level of granularity—frame-level metadata extraction—was made possible through AI technology. In addition to metadata for entire programs or scenes, frame-level metadata is now available, greatly simplifying video handling.

For example, when producing a new program that leverages archived footage, creators need to identify and extract the most relevant scenes from vast video archives. Program-level metadata helps determine which program to use, while frame-level metadata helps decide which specific scene within that program to select. When managing rights for streaming or other multi-use scenarios, or performing video quality checks, specific scenes must be directly verified. Frame-level metadata greatly streamlines these processes.

Mojimeta is an innovative solution that unlocks new value in previously underutilized text overlay within video content. By converting text overlay into text data, Mojimeta enhances the searchability of archived video through enriched metadata, streamlines video quality checks, and enables new value creation.

We plan to integrate Mojimeta with next-generation content management systems. This will streamline metadata preparation and the sharing and management of information throughout the entire video content lifecycle, while preventing redundant tasks and omissions and improving overall operational efficiency.

Toshiba will drive broader utilization of video assets and streamline video operations through advanced AI technologies and continuously evolving next-generation content management systems. We look forward to sharing future developments with you.

  • The corporate names, organization names, job titles and other names and titles appearing in this article are those as of September 2025.
  • All other company names or product names mentioned in this article may be trademarks or registered trademarks of their respective companies.

Related articles