A Survey on MPEG-7

A Multimedia Content Description Interface

Author: Randa Hilal
email: rhilal@mcs.kent.edu, homepage: http://www.mcs.kent.edu/~rhilal

Prepared for Prof. Javed I. Khan
Department of Computer Science, Kent State University
Date: November 2001

Abstract: As the name suggests “Multimedia Content Description Interface”, MPEG-7 is a standard for describing multimedia data or multimedia file contents in order to facilitate searching multimedia databases, and libraries surfing the web for multimedia resources using criteria such as text, sound, or even graphics. This paper will look at the MPEG-7 standard as of March 2001, and will look at some of its applications in searching, filtering, classifying, and indexing of multimedia resources.

Other Survey's on Internetwork-based Applications
Back to Javed I. Khan's Home Page

Introduction

The Moving Picture Experts Group, MPEG for short, started its first standard MPEG-1 in January 1988, which was intended for audio and video compression and all the functions needed for multiplexing and synchronizing audio and video streams into one stream called systems. While MPEG-1, when designed, was intended for specific applications such as interactive CD and digital audio broadcasting, yet it was generic enough for use for other applications.

While MPEG-1 was designed with specific applications in mind MPEG-2, which started in July 1990, addressed the functions of multiplexing one or more elementary streams of video and audio, as well as other data streams, into single or multiple streams suitable for storage or transmission. This was done by developing two system layers; the Transport Stream (TS) designed for environment where errors are likely, such as storage or transmission in lossy or noisy media (cable, satellite and terrestrial), and the Program Stream (PS) similar to MPEG-1 designed for relatively error- free environment.

In July 1993 MPEG started working on its third standard MPEG-4. Unlike MPEG-1 and MPEG-2; which required a bit rate no less than 1Mbps, MPEG-4, as its first title suggests, was a “very low bitrate audio-visual coding” standard that allowed the implementation of the decoding on a wide range of programmable devices. MPEG-4 can encode units of aural, visual, or audiovisual content, called “media objects”. These objects can be of natural or synthetic origin. It can also describe the composition of these objects, multiplex and synchronize the data associated with these objects so they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects.

At this point we had all the tools needed for digitizing, encoding, decoding, compressing, and transferring multimedia contents in a wide variety of medium and capacity. One important link was still missing in the chain; how can we search for multimedia contents in a wealth of multimedia resources available to us from media libraries and databases, and what criteria can we use to search for, and select a certain resource? In October 1998 there was a call for a new standard proposal MPEG-7 that will devise standard ways for searching, filtering, classifying, indexing of multimedia data.

In the following sections of this paper we will look at the MPEG-7 standard, its scope, and objectives, its parts, its application areas, and its functionalities. We will also look at some examples of research papers that used the MPEG-7 standard to implement a variety of applications.

Overview of the MPEG-7 Standard

MPEG-1, MPEG-2, and MPEG-4 made a wealth of audiovisual information available in digital form, but the value of these information is greatly dependent on the ease of finding it, retrieving it, accessing it, filtering it, and managing it. MPEG-7, an ISO/IEC standard is not the first attempt to use metadata to describe, organize, and manage multimedia resources. There have been many attempts to use varied forms of metadata and description schemes to facilitate the different ways of managing and finding digital multimedia data when needed. Some of these attempts is the Dublin Core scheme that is widely used for simple description such as author names, date of publication, etc., and XML/RDF for defining the relationship between any two entities and give this relationship a name and use XML format to describe this relationship. What MPEG-7 is doing is defining standard schemes and using a standard language to describe the content of audio and video records, movies, speech clips, graphics, text, and even still pictures. The use of MPEG-7 is not just restricted to database retrieval applications such as digital libraries, but it also extends to applications for running broadcast channel selection, multimedia editing, and multimedia directory services. These applications are widely varied they can run in real or non-real time, they can be push or pull applications and they are intended for human user consumption as well as computational systems consumptions.

Objectives of MPEG-7 standard

MPEG-7 was not unique in what it intended to do, providing a description of the content of multimedia resources, but it was unique in the way it standardized the core technology and extended the limited capabilities of proprietary solutions by including more data types, such as still pictures, graphics, 3D models, audio, speech, video, and special cases of these data types such as facial expressions, personal characteristics, music mood, and so on.

MPEG-7 description tools are independent from the way these contents are coded or stored, it works for digital data as well as analogue data, or even material printed on paper.

MPEG-7 offers different granularities; the description can be as general or as detailed as we want. Although MPEG-7 is not application specific and it does not depend on the way the contents are coded, but it can build on some of the features that other standards offer when available. For example it can use the feature of MPEG-4 in using the object as a unit of encoding to attach its description to the object within the audio or video file.

MPEG-7 can match its description of a resource to the application it is used for, for a visual-based application it can give description that relates to the shape, color, size, texture, position, or movement. For an audio-based application it gives description of the mood, tempo, tempo changes, etc. On a more sophisticated level it can give description that include semantic information. Some low level features included in the description can be automatically extracted; other more sophisticated features may need a manual extraction. Besides the description of the multimedia data MPEG-7 has to include some other information such as:

Ø The form: such as coding scheme, and data size.

Ø Condition for accessing the material: such as intellectual property rights information, and price.

Ø Classifications: such as parental rating.

Ø Links to other relevant material: it can help in finding more search material.

Ø The context: such as in a documentary or teaching material, the title of the material, name of the author, or the date or place the material was created.

Ø Information describing the creation and production processes of the content: such as director, title.

Ø Information relevant to the usage: copyright pointers, usage history, and broadcast schedule.

Ø Information of the storage features: format, encoding.

Ø Structural information on spatial, temporal, or spatio-temporal components: scene cuts, segmentation in regions, region motion tracking.

Ø Conceptual information about the content: such as objects, events, and their interaction.

Ø Information about how to browse the contents: such as use of summaries.

Ø Information about the interaction of the user with the content: such as user preferences, and usage history.

MPEG-7 tools

To accomplish its task MPEG-7 defined a set of tools. These tools may or may not appear entirely in a description, not only that the separation between them may not always be clear depending on the contents described and the application using the description. Further more the descriptions may be stored with the audio-visual content on the same storage medium or may be stored remotely on some other system. When this is the case then some additional tools need to be implemented to link the contents to their descriptions. Contents and querying the contents do not have to match in type; some visual contents can be queried using visual description or textual description, or even speech description. MPEG-7 tools are very flexible; they can work for many different applications and environments. This allows the coexistence of MPEG-7 and other leading standards such as SMPTE Metadata Dictionary, Dublin Core, TV Anytime, etc.

The main tools of MPEG-7 are:

Ø Descriptors (D): can be used to describe the various features of multimedia contents, they define the syntax and semantics of each content feature. Figure 1 shows an example of a Descriptor.

Ø Description Schemes (DS): pre-defined structures of Descriptors and Description Schemes that specify the semantics of their relationships.

Ø Description Definition Language (DDL): language to define new Description Schemes and Descriptors or extend the existing ones. It provides standardized grammar and syntax for unambiguously defining the Descriptors and Description Schemes so they can be parsed by a variety of systems. In March 2000 it was decided to adapt W3C’s XML schema language as the DDL language with the provision of extending it to satisfy all the MPEG-7 requirements.

[9] lists some of these extensions; Parameterized array sizes; Typed references; Built-in array and matrix data types; Enumerated data types for MimeType, CountryCode, RegionCode, CurrencyCode and CharacterSetCode. MPEG-7-specific parsers will be developed by adding validation of these additional constructs to standard XML Schema parsers...

Ø System tools: they support the multiplexing and synchronization of Descriptors with the content they describe.

<CatalogueEntry xsi:type=”NewsDoc">

                                    <Title>”CNN 6 oclock News” </Title>

                                    <Producerr>David James</Author>

                                    <Date>1999</Date>

                                    <Broadcaster>CNN</Channel>

</CatalogueEntry>

Figure 1: Descriptor

Scope of the standard

MPEG-7 will deal with applications stored on-line or off-line or streamed, and it can operate in real-time and non real-time environments. A real time environment in this context is that the description is generated while the content is being captured. It can also work for pull applications such as retrieval from digital libraries and push applications such as filtering audio-visual streams broadcasted over the Internet. Generating descriptions requires first extraction of the features (analysis), then the description is generated, and finally search engine applications are employed to complete the job. Feature extraction can be automated at lower levels of description or can be interactive when we require higher levels of feature extractions. But automatic and manual or even semi-automatic feature extraction is beyond the scope of MPEG-7. Implementation of the feature extraction was left for the industry to compete with since interoperability is not required for that. Also implementation of search engines and filter agents algorithms is beyond the scope of MPEG-7 and it was left to the industry to develop. Figure 2 shows a schematic representation of the scope of MPEG-7 Standard.

Figure 2: scope of the MPEG-7 standard

MPEG-7 emphasizes greatly on describing the audio-visual data, but this data could contain within it some text that need to be searched and filtered that’s why MPEG-7 considered existing solutions for doing that.

MPEG-7 application’s areas

MPEG-7 standard tools will make it possible to support a wide range of applications such as digital libraries searching and indexing, broadcast media selection, and media editing. It will make it possible to search the web for multimedia data using a variety of criteria the same manner it is searchable for text data using textual criteria. This list of possible applications that will benefit from MPEG-7 is only an example of the endless possibilities.

Ø Architecture, real estate, and interior design.

Ø Cultural services in history museums, and art galleries.

Ø Digital library searches for all kinds of archived multimedia resources.

Ø E-commerce, advertising, on-line catalogues, e-shops.

Ø Education, such as searching for support material.

Ø Home Entertainment, management of personal multimedia collections, home video editing, karaoke, etc.

Ø Investigation services, human characteristics recognition, forensics.

Ø Journalism, searching for famous people speeches.

Ø Directory services, yellow pages, tourist information, etc.

Ø Remote sensing, cartography, ecology, natural resources management, etc.

Ø Shopping for different items.

Ø Social services, such as dating.

Ø Surveillance, traffic control and such.

And the list goes on and on…

Querying a resource can be done in so many different ways:

Ø Play a few notes of a song.

Ø Draw a few lines on a screen and find an image that matches the drawing.

Ø Defining an object by its color, texture, shape, etc. and find an object that matches the definition.

Ø Defining multimedia objects and the relationship between them, and finding what matches.

Ø Describe an action and get scenarios that match.

MPEG-7 parts and their functionalities

1. MPEG-7 Systems – it includes the Descriptors (D) and Description Schemes (DS) tools, those are the tools used to create descriptions and synchronize the contents and their descriptions. And it also includes tools for managing and protecting intellectual property. It defines the terminal architecture and normative interfaces.

2. MPEG-7 Description Definition Language – the language used to create new Description Schemes and eventually new Descriptors. It also allows for the extension and modification of existing DSs. XML Schema language was chosen to be the basic DDL language with its structural and datatype components. Some MPEG-7 specific components will be added to it too.

3. MPEG-7 visual – Descriptors and Description Schemes dealing only with visual descriptions. It includes color, texture, shape, motion, localization, and other descriptors. Each of them can be a basic descriptor or a sophisticated one. Table 1 shows some of the current descriptors.

4. MPEG-7 Audio - Descriptors and Description Schemes dealing only with audio descriptions. This included six technologies: the audio description framework (scale tree, low-level descriptors), sound effect description tools, instrumental timbre description tools, spoken content description, uniform silence segment, and finally the melodic descriptors to facilitate query-by-humming. Table 1 shows some of the current descriptors

5. MPEG-7 Multimedia Description Schemes (MDS) – Descriptors and Description Schemes dealing with generic and multimedia features. Generic features are features pertaining to all types of media such as vector and time. Multimedia description tools are tools used when more than one medium need to be described at the same time. They are grouped in five groups:

o Content description: representation of perceivable information.

o Content management: information about the media features, the creation and the usage of the audio-visual content.

o Content organization: representing the analysis and classification of several audio-visual contents.

o Navigation and access: specification of summaries and variations of the audio-visual content.

o User interaction: description of user preferences and usage history pertaining to the consumption of the multimedia material.

6. MPEG-7 Reference Software the eXperimental Model (XM) – an experimental software implementation of the standard. It include the simulation platform of MPEG-7; Descriptors (Ds), Description Schemes (DSs), Coding Schemes (CSs), and Description Definition Language (DDL). The experimental model has normative and non-normative parts. Normative parts consist of the Descriptor’s and Descriptor Schemes syntax, semantics and binary representations of both. The optional non-normative parts of the software are the recommended data structures and procedures performed on them for extraction and similarity matching procedures.

7. MPEG-7 Conformance – guidelines and procedures for testing the conformance of MPEG-7 implementations.

Type	Feature	Descriptors
Visual	Basic Structures	Grid layout
	Basic Structures	Histogram
	Color	Color space
		Dominant color
		Color histogram
		Color quantification
	Texture	Spatial image intensity distribution
	Texture	Homogeneous texture
	Shape	Object bounding box
		Region-based shape
		Contour-based shape
		3D shape descriptor
	Motion	Camera motion
		Object motion trajectory
		Parametric object motion
		Motion activity
		Motion trajectory features e.g., speed, direction, acceleration
Audio	Speech Annotation	Lattice of words and phonemes plus metadata
	Timbre	Ratio of even to odd harmonics
	Timbre	Harmonic attack coherence
	Melody	Melodic contour and rhythm

Table 1 – Overview of the current descriptors

Examples of MPEG-7 application areas

I included in this section some of the studies that were built on the preliminary idea of the MPEG-7 Standard. Those examples will clarify the idea of Descriptors, Description Schemes, and Description Definition Language, and they will show how MPEG-7 will come in handy for use in conjunction with a lot of other applications and technologies.

Representing Internet Streaming Media Metadata using MPEG-7 Multimedia Description Schemes

This study was done by Eric Rehm of Singingfish.com an Internet startup company in Seattle, Washington that began the construction and population of a searchable database of Internet streaming media. This study used the MPEG-7 Multimedia Description Scheme (MDS) as a guiding model to build on. The Multimedia Description Group of MPEG-7 created a top-level entity and called it “Generic AV DS”. This entity describes the audio and visual contents of a single AV document. It was used in this study to build on in order to create an implementation of an Internet streaming media searchable database. Figure1 shows the MPEG-7 AV Description Schemes. This paper is very important in the sense that it shows the hierarchy of the Description Schemes, which basically are generic forms of structures to build Descriptors on.

Figure 3- Streaming AV Description Scheme

Rehm found out that some data and relationships hade to be modeled:

1. Overall Structure: Single content item, playlist, SMIL authored content, etc.

2. Media Information: URL link(s) to stream, bit rate, media format (RealMedia, Window Media, etc…), duration, MIME type, media type (audio, video, animation, etc…).

3. Creation Information: Title, Author, Copyright, Artist, Album, Record label, Language, etc…

4. Classification: Category, and Genre. Categories are the root nodes of the taxonomy. Genre represents a path from a root node using controlled vocabulary.

5. Related Material: Referencing page URL(s), title, anchor text, HTML Meta tags (description keywords).

6. Usage Information: Copyright, Content Owner.

7. Spoken Text: Transcript from speech recognition.

8. Summary Information: Key frame(s).

The Overall Structure

The MPEG-7 Segment DS and Segment Decomposition allowed them to model any hierarchical playlist format they encountered on the Internet. Figure 4 shows the Structure Support from MPEG-7 MDS.

Figure 4 – Structure Support from MPEG-7 MDS

A segment DS is actually an abstract class. Subclasses of Segment DS, the VideoSegment, the AudioSegment, and TextSegment DS were designed to contain information about the audio, video, and text in an AV content item.

Media Information

The MPEG-7 MediaInformation DS contains descriptions that are specific to the storage media. It can contain one or more MediaProfile DSs. Each MediaProfile represents one of possibly many variations that can be produced form a master media depending on the values chosen for the MediaCoding, MediaFormat (storage format) etc. Internet streaming media content is often encoded in more than one commercial format (RealMedia, Windows Media, QuickTime), each at several bit rates, with each variation at a separate URL. So they encoded the commercial format with the MediaFormat’s System element. See figure 5 Media Information DS.

Figure 5- Media Information DS

If two identical instances of particular stream exist on the Internet (very common with MP3 for example), they can be simply represented with multiple MediaInstance description.

Creation Information

The CreationMetaInformation DS binds together creation and classification information about AV content and other material related to it. It contains by whom and with what name, when, and where the content was created. This information can be extracted in many different ways. Automatic extraction form the header of the stream is one way, and automatic extraction from the referring web page that contain the URL of the stream is another way. Figure 6- the CreationMetaInformation

Figure 6- CreationMetaInformation DS

Classification

The Classification DS is used as part of the larger CreationMetaInformation DS, to categorize Internet streams into a proprietary taxonomy.

Related material

Related Material DS is also part of the CreationMetaInformation DS, to hold information about the web page(s) that contain links to streaming media. Such data is proven to increase the search precision and recall.

Usage Information

The Rights DS within the UsageMetaInformation DS is used to capture the copyrights.

Spoken Text

Spoken text is extracted by using speech recognition tools, close caption decoding, or transcripts provided by the content producer. It is captured in the SpokenContent DS as part of the AudioSegment DS. See figure 4.

Summary Information

The MPEG-7 SequentialSummary DS is only used when there is a need to represent multiple key frames extracted from a single Internet streaming video.

TV Anytime as an application scenario for MPEG-7

[11] Will show how TV Anytime Forum can make use of the MPEG-7 Description Schemes to realize what is intended from the TV Anytime technology.

TV Anytime an organization for development and standardization of the tools and technologies needed towards the creation of an integrated entertainment/information gateway. It aims at providing value-added services such as personalizing and controlling material of special interest to the end user accessed via TVs or computational systems. In order to do that TV Anytime specified three required technologies: metadata, content referencing, and rights management.

Metadata

Metadata is the core of the MPEG-7 standard. TV Anytime could use the rich library of MPEG-7 Description Schemes without having to reinvent them. But TV Anytime would not need to use all the tools offered by MPEG-7, such not needed tools are the low-level audio-visual features like color and loudness. This will require a mechanism to profile MPEG-7 to certain type of applications.

Content referencing

The AV material stream and the metadata stream in MPEG-7 are two separate streams that may not reside on the same storage medium. Consequently the AV stream can be digital, or analog, and the transfer medium can be cable or satellite. The important functionality required is the ability of the receiver to synchronize and link the two streams together. MPEG-7 supports this linkage functionality via different reference and time Descriptors Schemes. Such DSs are MidiaLocator DS to specify the link, MediaTimePoint DS to specify the absolute start time, and MediaDuration DS to specify the segment’s duration.

Rights management

MPEG-7 has a specific Description Scheme to manage the copyrights and other rights. But MPEG-7 cannot deal with the security issues of TV Anytime.

How can TV Anytime use MPEG-7 metadata?

An XML schema language parser can extract the information included in an MPEG-7 generic metadata Description Scheme validate it and map it to memory according to the specific TV Anytime metadata format. Then, an application that perform services such as searching and accessing AV material can use the metadata information in memory to do that.

Example of a possible scenario

Since every broadcast contains segments of higher of lesser importance, an end user may request to view only the highlights of the broadcast such as the goals in a soccer game broadcast. Figure 7 shows a simplified extract of MPEG-7 Description Scheme, which is a simplified high-level DS that has all the components required for time-exact linking into AV material (MediaTime), for physical location of the AV material (MediaLocator), for human comments on some AV material (StructuredAnnotation), and for specification of a certain segment of an AV material (HighlightSegment). Figure 8 shows a simple example of description scheme specific to TV Anytime application. It used the generic MPEG-7 schemes to create a TV Anytime specific metadata scheme. Figure 9 shows a sample of a metadata Descriptor that uses XML schema language. The sample featured in figure 9 is about the soccer final of the1999 Champions League in Europe.

`<schema xmlns="http://www.w3.org/1999/XMLSchema"`
   `xmlns:mp7="http://www.mpeg7.org/MP7Schema"`
   `targetNamespace="http://www.mpeg7.org/MP7Schema"`
   `elementFormDefault="unqualified" attributeFormDefault="unqualified">`
`...`
``
   `<complexType name="MediaTime">`
      `<choice>`
         `<element name="MediaTimePoint" type="mp7:MediaTimePoint"/>`
         `<element name="MediaRelTime" type="mp7:MediaRelTime"/>`
      `</choice>`
      `<element name="MediaDuration" type="mp7:MediaDuration" minoccurs=0/>`
   `</complexType>`

``
   `<complexType name="MediaLocator">`
      `<element name="MediaURL" type="mp7:MediaURL"/>`
      `<element name="MediaTime" type="mp7:MediaTime" minOccurs="0"/>`
   `</complexType>`

``
   `<complexType name="StructuredAnnotation">`
      `<element name="Who" type="mp7:ControlledTerm" minOccurs="0"/>`
      `<element name="WhatObject" type="mp7:ControlledTerm" minOccurs="0"/>`
      `<element name="WhatAction" type="mp7:ControlledTerm" minOccurs="0"/>`
      `<element name="Where" type="mp7:ControlledTerm" minOccurs="0"/>`
      `<element name="When" type="mp7:ControlledTerm" minOccurs="0"/>`
      `<element name="TextAnnotation" type="string" minOccurs="0"/>`
      `<attribute ref="xml:lang"/>`
   `</complexType>`

``
   `<complexType name="HighlightSegment">`
      `<element name="VideoSegmentLocator" type="mp7:VideoSegmentLocator" minOccurs="0"/>`
      `<element name="AudioSegmentLocator" type="mp7:AudioSegmentLocator" minOccurs="0"/>`
      `<attribute name="name" type="string" use="optional"/>`
      `<attribute name="themeIds" type="IDREFS" use="optional"/>`
`</complexType>`

`</schema>`

Figure 7- simplified extract of MPEG-7 Description Scheme

`<schema xmlns=http://www.w3.org/1999/XMLSchemaxmlns:mp7="http://www.mpeg7.org/MP7Schema"`
      `xmlns:tva="http://www.tv-anytime.org/TVASchema"`
      `targetNamespace="http://www.tv-anytime.org/TVASchema"`
      `elementFormDefault="unqualified"      attributeFormDefault="unqualified">`

   `<import namespace="http://www.mpeg7.org/MP7Schema"/>`

   `<element name="program">`
      `<complexType>`
         `<element name="generalInfo" type="tva:generalInfoType" />`
         `<element name="highlight" type="tva:highlightType" minoccurs="0" maxoccurs="unbounded" />`
      `</complexType>`
   `</element>`

   `<complexType name="generalInfoType">`
      `<element name="annotation" type="mp7:StructuredAnnotation" />`
      `<element name="link" type="mp7:MediaLocator" />`
   `</complexType>`

   `<complexType name="highlightType">`
      `<element name="segment" type="mp7:HighlightSegment" />`
      `<element name="annotation" type="mp7:StructuredAnnotation" />`
   `</complexType>`
`</schema>`

Figure 8- simple example of description scheme specific to TV Anytime application

`<program xmlns="http://www.tv-anytime.org/TVASchema">`
     `<generalInfo>`
          `<annotation lang="eng">`
               `<Who>Manchester United - Bayern Munich</who>`
               `<WhatAction>soccer champions league final Europe</WhatAction>`
               `<Where>Barcelona, Spain</Where>`
               `<When><Y>1999</Y><M>5</M><D>29</D></When>`
         `</annotation>`
          `<link><MediaURL>http://...</MediaURL></link>`
     `</generalInfo>`

     `<highlight>`
          `<segment>`
               `<videoSegmentLocator>...</videoSegmentLocator>`
               `<themeIds>goal</themeIds>`
          `</segment>`
          `<annotation>`
               `<Who>Mario Basler</Who>`
               `<WhatObject>Bayern Munich</WhatObject>`
               `<When><M>6</M></When>`
          `</annotation>`
     `</highlight>`

     `<highlight>`
          `<segment>`
               `<videoSegmentLocator>...</videoSegmentLocator>`
               `<themeIds>goal</themeIds>`
          `</segment>`
          `<annotation>`
               `<Who> Teddy Sheringham</Who>`
               `<WhatObject>Manchester United</WhatObject>`
               `<When><M>91</M></When>`
          `</annotation>`
     `</highlight>`
`...`
`</program>`

Figure 4: Sample instance

Figure 9 – Sample of a metadata Descriptor

Spoken Content Metadata and MPEG-7

There are two level of descriptions in MPEG-7, one is low-level description that can be automatically extracted such as image color for visual items and Fourier power spectrum of audio items, the other is high-level description, semantic, that requires human intervention because it contain a lot of abstractions of humanly understood concepts. With the increasing need to cut on cost and automate most of these extractions a new mid level aroused that uses a lot of automation for extracting the abstract concepts. As an example of these automation attempts, is spoken content of audio, topic identification in text, and object identification in images. But these applications are not perfect because they contain so many variable such as in non-canonical English, the sound of the words “picture” and “pitcher” can be identical but their meaning is different. The only way to disambiguate this is through topical of positional context.

As an example we are going to look at the spoken contents and how MPEG-7 deals with the shortcomings of the design of the tools for automatic speech extraction (ASR). Spoken content form an essential component of the audio-visual description. This content may be extracted at a number of levels, from phonetic subword units (phones) through syllables to words. To illustrate the design considerations, we are going to look at two examples, one is annotation of images, when taking the picture, a person can include a short comment about the person in the picture or the place where the picture was taken. These comments can be used to construct metadata structures, by using the ASR tools. The end user can use these descriptors later to query the pictures database via audio or textual queries.

The ASR system suffers some problems; its accuracy is limited by the ambient noise, out-of-vocabulary words, ungrammatical construction, and poor enunciation. A special attention must be given to the limitations of the current ASRs and the methods by which the metadata may be utilized for retrieval or other purposes. Two problems must be considered:

Extraction failures: as shown in figure 10- Hypothetical lattice representing the phrase “please be quite sure”, the ASR system decoding results are stored in some form of lattice, these lattices represent a large number of hypotheses and many of these decoding contain the correct lattice while the most probable possibility is incorrect. The solution to this is to retain all the possible lattices in the metadata. This will work well for short audio captions, but is not practical or accurate for large audio files

Figure 10 – Hypothetical lattice representing the phrase “please be quite sure”

Extraction limitations: the ASR system has a vocabulary dictionary of 20,000 to 60,000 words that it uses for comparisons. These words do not include many of the nouns that can be found in an audio file. These nouns are crucial to the meaning most of the time. By retaining the phonetic representation of these sounds we may be able to retrieve an audio document by retrieval by example through a combined word and phone retrieval. As a result we need to use lattices that contain a combination of words and phones.

MPEG-7 SpokenContent Descriptor

After looking at the problems that ASR suffers from, we can now look at how we are going to use the results of the decoding in the MPEG-7 SpokenContent Descriptor. The authors of [12] believe that there will be need for representing a speech as a combined word and phoneme lattice. Some audio documents may contain more than one spoken annotation, such as the case in a photographic library where each photo has its own annotation. In this case we need to retain multiple lattices with links attaching them to other metadata. A separate header will contain information pertaining to all lattices.

For dealing with usability issues, the multiple lattices by themselves do not form an adequate metadata of the spoken contents, so a special SpokenContent Descriptor stored in the header contains the language, a word lexicon, a phone lexicon, and optionally a word phone indexes.

For dealing with interoperability issues, we need to include in the metadata and the ASR decoder that decode the query the same phone set of a language because this is the only reliable way of retrieval-by-example, since both ASR systems used for extracting the spoken contents of a metadata, and the one used for retrieval by comparison are widely different in their capabilities. The ASR used for the metadata extraction is much more advanced. [12] Suggest to include the phone set in the header as well.

On the Evolution of Videotext Description Scheme and Its Validation Experiments for MPEG-7

Videotext is the superimposed text or the embedded text in images and video frames. For example a videotext can be the anchor’s name in the video clip, football game scores superimposed on the video frame, or it can be the introductory and ending credits of each video material, or it can even be the text written on someone’s clothing in the video clip. Videotext can be extracted and used to browse, search, and classify video materials. [13] looked at the standardization efforts of the VideoText Description Scheme (DS) and modeled and tested the VideoText DS validity in browsing and classifying videos. An application that extracted the face and text information from a video was used. The extracted information was stored in the XML format proposed by MPEG-7, (shown below), then the information were parsed and used to browse and classify videos.

What is VideoText Description Scheme?

VideoText DS is an MPEG-7 Description Scheme that has been derived from the MovingRegion DS which covers the basic video object attributes such as bounding box, trajectory and others. It inherits all the attributes, decomposition, Descriptors, and Description Schemes from the MovingRegion DS. It also contains all the syntactic attributes of the text such as its language, font size, font style, and other temporal and visual information such as its time, motion, color, and spatial location. Figure 11 shows the VideoText DS – syntactic aspects.

<!- ################################### ->

<!- ``Videotext DS'': Syntactic Aspects ->

<!- ################################### ->

<simpleType name="TextDataType" base="string">

    <enumeration value="Superimposed"/>

    <enumeration value="Embedded"/>

</simpleType>

<complexType name="Videotext" base="MovingRegion"

             derivedBy ="extension">

    <element name="Text" type="TextualDescription"

             minOccurs="0" maxOccurs="1"/>

    <attribute name="TextType" type="TextDataType"

               use="optional"/>

    <attribute name="FontSize" type="positiveInteger"

               use="optional"/>

    <attribute name="FontType" type="string"

               use="optional"/>

</complexType>

Figure 11 – VideoText Description Scheme – Syntactic aspects

VideoText DS contains the following elements and attributes:

o TextDataType: there are two types of text in a video, embedded, which is the text written on people’s clothing in a video clip or shop and street names, and superimposed, which is the text in a video that was generated by title machines in studios.

o Videotext: text region in a video or set of images.

o Text: the string containing the text recognized in the videotext.

o TextType: attribute relating to the type of videotext.

o FontSize: integer specifying the font size.

o FontType: string specifying the font style.

<!- ########################################## ->

<!- ``VideotextObject DS'': Semantic Aspects   ->

<!- ########################################## ->

<complexType name="VideotextObjectDS"  base="Object"

             derivedBy ="extension"/>

   <attribute name="id"  type="ID"/>

   <attribute name="href" type="uri"/>

   <attribute name="CharacterCode" type="string"/>

</complexType>

Figure 12 – VideoTextObject Description Scheme – Semantic aspects

A videotext could appear in a video clip next to an object, such as a text appearing under a face in a video clip could mean that this text is the name of the person in the video. Some words appearing on an object in a video could mean the brand name of this object. This shows clearly the relationship between an object and a text in a video. This leads to the definition of a new DS called VedioTextObject, which contains the semantic attributes of the VideoText DS. Figure 12 shows the VideoTextObject DS.

Extraction of VideoText DS

It can be done automatically of manually. There are three different methods to do the automatic extraction, region analysis, edge analysis, and texture method. IBM proposed the region-based algorithm, and Philips proposed the edge-based algorithm. These algorithms were used by the authors in their experiment on the validation of the VideoText DS for video browsing.

Validation of the VideoText DS

Two typical scenarios were used to test the validity of the VideoText DS. Video browsing and video classifications based on the VideoText DS. For video browsing, the authors used their automatic videotext event detection technique to detect the presence of videotext in the video stream. Then two videotext extraction applications were used and compared, the IBM system, and the Philips system described above. The test video files used were taken from the MPEG-7 test data. For the video classification the authors adopted an existing videotext application, which classifies video segments onto known categories based on the location of faces and text (an observation was made that in different TV categories there are different face and text trajectory patterns). Two methods were used and compared for the extraction of text and face trajectories, the domain based method and the Hidden Markov Models (HMM).

The paper concluded that the VideoText DS proposed by the MPEG-7 group is a powerful feature; it provides rich, high-level semantic information that can be used in numerous video applications.

What is beyond MPEG-7?

Today, many elements exist to build an infrastructure for the delivery and consumption of multimedia content. But there is one detail still missing “looking at the big picture” to describe how the existing and under development elements relate to each other. That is the aim of MPEG-21.

We can define the MPEG-21’s job in so many different ways.

q MPEG-21 is going to define a multimedia framework to enable transparent and augmented use of multimedia resources across a wide range of networks and devices used by different communities.

q The multimedia content delivery chain encompasses content creation, production, delivery and consumption. To support this, the content has to be identified, described, managed and protected. The transport and delivery of content will occur over a heterogeneous set of terminals and networks within which, events will occur and require reporting. Such reporting must include reliable delivery, the management of personal data and preferences taking user privacy and the management of financial transactions into account. Doing that requires a multimedia framework that will orchestrate the job of all the different parts.

The MPEG-21 multimedia framework will identify and define the key elements needed to support the multimedia delivery chain as described above, the relationships between and the operations supported by them.

Summary

It is clear that in this paper we gave MPEG-7 more attention than the other MPEG standards. But we felt that in order for the reader to understand what is MPEG-7 is all about there was a need to understand what the other standard did and how they differed from each other in the way multimedia contents are coded, stored, delivered, decoded, and retrieved. We looked at what MPEG-1, MPEG-2, and MPEG-4 did for the encoding of multimedia resources, then we looked at what MPEG-7 is intended to do and how it employed some features of the previous standards and added to them for efficient retrieval of the multimedia contents, and finally we gave an overview of the intended task that MPEG-21 is going to achieve.

The research papers included gave examples about the attempts that were made in putting the MPEG-7 standard Description Schemes and Descriptors in action. It was obvious that the different MPEG-7 Ds and DSs are the way to go for searching, browsing, indexing, and managing multimedia contents, although some improvements might be needed as the MPEG-7 standard evolves.

References

Research Groups

[1] INTERNATIONAL ORGANISATION FOR STANDARDISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

Short MPEG-1 description

JUNE 1996 by Leonardo Chiariglione

http://mpeg.telecomitalialab.com/standards/mpeg-1/mpeg-1.htm

[2] INTERNATIONAL ORGANISATION FOR STANDARDISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

Short MPEG-2 Description

OCTOBER 2000 by Leonardo Chiariglione

http://mpeg.telecomitalialab.com/standards/mpeg-2/mpeg-2.htm

[3] INTERNATIONAL ORGANISATION FOR STANDARDISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

Overview of MPEG-4 Standard

March 2001 by Rob Koenen

http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm#E9E1

[4] INTERNATIONAL ORGANIZATION FOR STANDARDISATION

ISO/IEC JTC1/SC29/WG11

CODING OF MOVING PICTURES AND AUDIO

Overview of MPEG-7 Standard (version 5.0)

March 2001 by José M. Martínez

http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm

[5] INTERNATIONAL ORGANIZATION FOR STANDARDISATION

ISO/IEC JTC1/SC29/WG11

                    CODING OF MOVING PICTURES AND AUDIO

MPEG-21 Overview

July 2001 by Jan Bormans, and Keith Hill

http://mpeg.telecomitalialab.com/standards/mpeg-21/mpeg-21.htm

Research Papers

[6] Leonardo Chiariglione. Open Source in MPEG

ACM digital library

[7] MPEG-7 Behind the Scenes

Jane Hunter
Distributed Systems Technology Center
University of Queensland

http://www.dlib.org/dlib/september99/hunter/09hunter.html

[8] Multimedia Description Framework (MDF) for

Content Description of Audio/Video Documents

Michael J. Hu and Ye Jian

ACM digital library

[9] The XML cover pages

http://www.oasis-open.org/cover/mpeg7.html

[10] Representing Internet Streaming Media Metadata using

MPEG-7 Multimedia Description Schemes

Eric Rehm, Singingfish.com

ACM digital library

[11] TV Anytime as an application scenario for MPEG-7

Silvia Pfeiffer and Uma Srinivasan

ACM digital library

[12] Spoken Content Metadata and MPEG-7

J.P.A. Charlesworth and P.N. Gamer

ACM digital library

[13] On the Evolution of Videotext Description Scheme and Its

Validation Experiments for MPEG-7

Chitra Dorai, Ruud Bolle, Nevenka Dimitrova, Lalitha Agnihotri, and Gang Wei

Scope of Survey

Since the MPEG-7 standard was still under study when this paper was written, the information written in it regarding MPEG-7 are based on what was available from the INTERNATIONAL ORGANISATION FOR STANDARDIZATION up to March 2001. The study papers included in this survey paper are based on what was available from the ACM digital library up to November 2001. This survey paper is not a complete study about the MPEG standards nor it is a study about each and every functionality of MPEG-7, but it is an overview of the standards available and their chronological order up to the date of writing this paper, which is geared towards giving the novice reader, who wants to explore this area, some general idea that will help in making the decision about more in depth readings.

Table of Contents:

Introduction

Overview of the MPEG-7 Standard

Examples of MPEG-7 application areas

What is beyond MPEG-7?

Summary

References

Scope of Survey

Introduction

Overview of the MPEG-7 Standard

Objectives of MPEG-7 standard

MPEG-7 tools

Scope of the standard

MPEG-7 emphasizes greatly on describing the audio-visual data, but this data could contain within it some text that need to be searched and filtered that’s why MPEG-7 considered existing solutions for doing that.

MPEG-7 application’s areas

Ø Architecture, real estate, and interior design.

Ø Cultural services in history museums, and art galleries.

Ø Digital library searches for all kinds of archived multimedia resources.

Ø E-commerce, advertising, on-line catalogues, e-shops.

Ø Education, such as searching for support material.

Ø Home Entertainment, management of personal multimedia collections, home video editing, karaoke, etc.

Ø Investigation services, human characteristics recognition, forensics.

Ø Journalism, searching for famous people speeches.

Ø Directory services, yellow pages, tourist information, etc.

Ø Remote sensing, cartography, ecology, natural resources management, etc.

Ø Shopping for different items.

Ø Social services, such as dating.

Ø Surveillance, traffic control and such.

And the list goes on and on…

Querying a resource can be done in so many different ways:

Ø Play a few notes of a song.

Ø Draw a few lines on a screen and find an image that matches the drawing.

Ø Defining an object by its color, texture, shape, etc. and find an object that matches the definition.

Ø Defining multimedia objects and the relationship between them, and finding what matches.

Ø Describe an action and get scenarios that match.

MPEG-7 parts and their functionalities

3. MPEG-7 visual – Descriptors and Description Schemes dealing only with visual descriptions. It includes color, texture, shape, motion, localization, and other descriptors. Each of them can be a basic descriptor or a sophisticated one. Table 1 shows some of the current descriptors.

o Content description: representation of perceivable information.

o Content management: information about the media features, the creation and the usage of the audio-visual content.

o Content organization: representing the analysis and classification of several audio-visual contents.

o Navigation and access: specification of summaries and variations of the audio-visual content.

o User interaction: description of user preferences and usage history pertaining to the consumption of the multimedia material.

7. MPEG-7 Conformance – guidelines and procedures for testing the conformance of MPEG-7 implementations.

Table 1 – Overview of the current descriptors

Examples of MPEG-7 application areas

Representing Internet Streaming Media Metadata using MPEG-7 Multimedia Description Schemes

Rehm found out that some data and relationships hade to be modeled:

1. Overall Structure: Single content item, playlist, SMIL authored content, etc.

2. Media Information: URL link(s) to stream, bit rate, media format (RealMedia, Window Media, etc…), duration, MIME type, media type (audio, video, animation, etc…).

3. Creation Information: Title, Author, Copyright, Artist, Album, Record label, Language, etc…

4. Classification: Category, and Genre. Categories are the root nodes of the taxonomy. Genre represents a path from a root node using controlled vocabulary.

5. Related Material: Referencing page URL(s), title, anchor text, HTML Meta tags (description keywords).

6. Usage Information: Copyright, Content Owner.

7. Spoken Text: Transcript from speech recognition.

8. Summary Information: Key frame(s).

The Overall Structure

Figure 4 – Structure Support from MPEG-7 MDS

A segment DS is actually an abstract class. Subclasses of Segment DS, the VideoSegment, the AudioSegment, and TextSegment DS were designed to contain information about the audio, video, and text in an AV content item.

Media Information

If two identical instances of particular stream exist on the Internet (very common with MP3 for example), they can be simply represented with multiple MediaInstance description.

Creation Information

Figure 6- CreationMetaInformation DS

Classification

The Classification DS is used as part of the larger CreationMetaInformation DS, to categorize Internet streams into a proprietary taxonomy.

Related material

Related Material DS is also part of the CreationMetaInformation DS, to hold information about the web page(s) that contain links to streaming media. Such data is proven to increase the search precision and recall.

Usage Information

The Rights DS within the UsageMetaInformation DS is used to capture the copyrights.

Spoken Text

Spoken text is extracted by using speech recognition tools, close caption decoding, or transcripts provided by the content producer. It is captured in the SpokenContent DS as part of the AudioSegment DS. See figure 4.

Summary Information

The MPEG-7 SequentialSummary DS is only used when there is a need to represent multiple key frames extracted from a single Internet streaming video.

TV Anytime as an application scenario for MPEG-7

[11] Will show how TV Anytime Forum can make use of the MPEG-7 Description Schemes to realize what is intended from the TV Anytime technology.

Metadata

Content referencing

Rights management

MPEG-7 has a specific Description Scheme to manage the copyrights and other rights. But MPEG-7 cannot deal with the security issues of TV Anytime.

How can TV Anytime use MPEG-7 metadata?

Example of a possible scenario