Captions make a huge difference to deaf and hard of hearing people who access video on the web. Providing real time translation of dialogue and sound effects into text format also provides a much wider group of people with a more engaging experience.
Captions were first used on American television in the early 1970s. The first programmes used open captioning, which meant that the text was displayed on screen for everyone to see. At the time this wasn’t popular with hearing viewers, so a system of closed captioning was developed.
Closed captions are hidden within the television signal. A specific decoder is needed to access them, which means that closed captions are only visible to people who need them.
Captions are not only used on television. Both open and closed captioning can also be found on the web. It’s worth taking a moment to look at some of the different terminology used first though.
In the UK, captions on television are often called subtitles. In America subtitles are foreign language translations and don’t include any sound effects. Captions are presented in the same language as the original sound track, do include sound effects, and are for the benefit of hearing impaired people.
This distinction has carried across to the web. For example the Web Content Accessibility Guidelines talk about captions, rather than subtitles.
The Web Content Accessibility Guidelines recommend that captions are provided for video content. Checkpoint 1.2.2 advises that captions should be provided for pre-recorded video as a Level A priority. Checkpoint 1.2.4 recommends that captions are provided for live video content as well, as a Level AA priority.
In spite of the high priority the guidelines give captioning, it’s still rare to find captioned video on the web. Often expensive, and always time consuming, captioning is one of the least established forms of web accessibility.
There are three principle reasons for this: Time, technology and cost.
It takes time to create captions. The sound effects and dialogue must be transcribed into text format. The text must be cut up into chunks, and the individual chunks matched to time sequences in the original sound track. Although tools do exist, it’s largely a manual process.
There is no standard technology for providing captions. Adobe Flash, Real Networks Real Player, Apple Quicktime and Microsoft Media Player are among the most popular multimedia platforms on the web, and they all handle captions in different ways.
For example, Quicktime and Real Player both use Synchronised Multimedia Integration Language (SMIL) to control the presentation of captions. Where Quicktime uses Text Track files to store the captions themselves, Real Player uses RealText files instead. Media Player uses a different technology altogether. It uses Synchronised Accessible Media Interchange (SAMI) to store and control the presentation of captions.
Inevitably, the cost of producing captions is high. Coupled with the sheer volume of video on the web these days, it’s perhaps understandable that captioning remains so scarce.
With over 20 hours of video uploaded to YouTube every minute, it could be thought that captioning would be washed away in the flood of user generated content. If professional web developers struggled to find the time and resources to provide captions, what could the world at large be expected to do?
The answer may lie in a recent announcement from Google and YouTube. Although it’s been possible to add captions to YouTube content for some time now, it’s still been a time consuming process. Now, Google’s Automatic Speech Recognition (ASR) technology will be combined with YouTube’s captioning system to provide a new feature called Automatic Captions (Auto-caps).
Google’s ASR technology will recognise the dialogue spoken in a video and transcribe it into text format for display on the screen. Coupled with YouTube’s captioning system, it will cut down the time needed to create captions dramatically.
The same process will be used to make manual captioning easier. The transcribed dialogue can be provided in a simple text file, and ASR’s Auto-timing will match the words as they are spoken on screen.
For more information, visit http://googleblog.blogspot.com/2009/11/automatic-captions-in-youtube.html