1.1. Introduction Multimedia information retrieval (MIR) is about the search for knowledge in all its forms, everywhere. Indeed, what good is all the knowledge in the world if it is not possible to find anything? This sentiment is mirrored as an ACM SIGMM grand challenge [Rowe and Jain 2005]: “make capturing, storing, finding, and using digital media an everyday occurrence in our computing environment.”
Currently, the fundamental problem has been how to enable or improve multimedia retrieval using content-based methods. Content-based methods are necessary when text annotations are nonexistent or incomplete. Furthermore, content-based methods can potentially improve retrieval accuracy even when text annotations are present by giving additional insight into the media collections.
Our search for digital knowledge began several decades ago when the idea of digitizing media was commonplace, but when books were still the primary medium for storing knowledge. Before the field of multimedia information retrieval coalesced into a scientific community, there were many contributory advances from a wide set of established scientific fields. From a theoretical perspective, areas such as artificial intelligence, optimization theory, computational vision, and pattern recognition contributed significantly to the underlying mathematical foundation of MIR. Psychology and related areas such as aesthetics and ergonomics provided basic foundations for the interaction with the user. Furthermore, applications of pictorial search into a database of imagery already existed in niche forms such as face recognition, robotic guidance, and character recognition.
The earliest years of MIR were frequently based on computer vision (three excellent books: [Ballard and Brown 1982]; [Levine 1985]; [Haralick and Shapiro 1993]) algorithms focused on feature based similarity search over images, video, and audio. Influential and popular examples of these systems would be QBIC [Flickner, et al. 1995] and Virage [Bach, et al. 1996] circa mid 90s. Within a few years the basic concept of the similarity search was transferred to several Internet image search engines including Webseek [Smith and Chang 1997] and Webseer [Frankel, et al. 1996]. Significant effort was also placed into direct integration of the feature based similarity search into enterprise level databases such as Informix datablades, IBM DB2 Extenders, or Oracle Cartridges [Bliujute, et al. 1999; Egas, et al. 1999] towards making MIR more accessible to private industry.
In the area of video retrieval, the main focus in the mid 90s was toward robust shot boundary detection of which the most common approaches involved thresholding the distance between color histograms corresponding to two consecutive frames in a video [Flickner, et al. 1995]. Hanjalic, et al. [1997] proposed a method which overcame the problem of subjective user thresholds. Their approach was not dependent on any manual parameters. It gave a set of keyframes based on an objective model for the video information flow. Haas, et al. [1997] described a method to use the motion within the video to determine the shot boundary locations. Their method outperformed the histogram approaches of the period and also performed semantic classification of the video shots into categories such as zoom-in, zoom-out, pan, etc. A more recent practitioner's guide to video transition detection is given by Lienhart [2001].
Also in the area of speech and audio indexing many different algorithms and system are developed to structure and index audio content automatically. One of the first systems in the area of spoken document retrieval is the Thisl Broadcast News Retrieval system [Abberley, et al. 1997]. This systems apply a large vocabulary continuous speech (LVCSR) system to generate word transcription for broadcast data. The automatically transcribed word sequences are attached with time codes for each word. The transformation from speech to text allows the usage of standard text retrieval mechanism. NIST has carried out several TREC Spoken Document Retrieval evaluations. In TREC-6 to TREC-9 from 1997 – 2000 the indexing task for broadcast news was made more challenging regarding the quality and amount of processed speech data. Other well performing systems for indexing broadcast news (BN) are the systems from LIMSI (Gauvain et al.), the HTK group of the University of Cambridge [Woodland, 1999] and the BBN system. One result of this research work was that the text retrieval performance was not affected by higher error rates for the BN task which varies between 15% and 30%. It has also been shown that the segmentation of complex speech recordings including many speaker changes, music and background noises decrease the system performance. Oard et al. startet 2002 the MALACH project. This system combines speech recognition technology and text retrieval algorithms to index multilingual speech recordings from an oral history archive.
In the area of music indexing and retrieval one of the first systems were developed by Foote et al. Based on low level features of audio processing which were mainly invented and standardized in MPEG‑7 Audio several Audio-ID systems were developed by several groups. The Audio-ID technology generates a fingerprint of a segment of music and provides fast matching algorithms to find this fingerprint in a large pre-processed archive. Currently the focus of music retrieval has been changed to genre and mood classification.
Starting near the turn of the 21st century, researchers noticed that the feature based similarity search algorithms were not as intuitive or as user-friendly as they had expected. One could say that systems built by research scientists were essentially systems which could only be used effectively by scientists. The new direction was toward designing systems which would be user friendly and could bring the vast multimedia knowledge from libraries, databases, and collections to the world. To do this it was noted that the next evolution of systems would need to understand the semantics of a query, not simply the low level underlying computational features. This general problem was called “bridging the semantic gap”. From a pattern recognition perspective, this roughly meant translating the easily computable low level content-based media features to high level concepts or terms which would be intuitive to the user. Examples of bridging the semantic gap for the single concept of human faces were demonstrated by Rowley, et al. [1996] and Lew and Huijsmans [1996]. Perhaps the earliest pictorial content-based retrieval system which addressed the semantic gap problem in the query interface, indexing, and results was the ImageScape search engine [Lew 2000]. In this system, the user could make direct queries for multiple visual objects such as sky, trees, water, etc. using spatially positioned icons in a WWW index containing 10+ million images and videos using keyframes. The system used information theory to determine the best features for minimizing uncertainty in the classification.
At this point it is important to note that the feature based similarity search engines were useful in a variety of contexts [Smeulders, et al. 2000] such as searching trademark databases [Eakins, et al. 2003], finding video shots with similar visual content and motion or for DJs searching for music with similar rhythms [Foote 1999], automatic detection of pornographic content [Forsyth and Fleck 1999; Bosson, et al. 2002], and copyright infringement detection [Jaimes 2002, Joly 2003]. Intuitively, the most pertinent applications are those where the basic features such as color and texture in images and video; or dominant rhythm, melody, or frequency spectrum in audio [Foote 1999] are highly correlated to the search goals of the particular application.
In this section we discuss representative work [Dimitrova 2003; Lew 2006; Sebe, et al. 2003 (CIVR)] done in content-based multimedia retrieval in the recent years. The two fundamental necessities for a multimedia information retrieval system are (1) Searching for a particular media item such as a particular object or concept; and (2) Browsing and summarizing a media collection.
In searching for a particular media item, the current systems have significant limitations such as an inability to understand a wide user vocabulary, understand the user's satisfaction level, nor do there exist credible representative real world test sets for evaluation nor even benchmarking measures which are clearly correlated with user satisfaction. In general current systems have not yet had significant impact on society due to an inability to bridge the semantic gap between computers and humans.
Learning algorithms are interesting because they potentially allow the computer to understand the media collection on a semantic level. Furthermore, learning algorithms may be able to adapt and compensate for the noise and clutter in real world contexts. New features are pertinent in that they can potentially improve the detection and recognition process or be correlated with human perception. New media types address the changing nature of the media in the collections or databases. Some of the recent new media include 3D models (i.e. for virtual reality or games)).
For the most recent research, there currently are several conferences dedicated to the field of MIR such as the ACM SIGMM Workshop on Multimedia Information Retrieval (http://www.liacs.nl/~mir), the ACM International Conference on Image and Video Retrieval (http://www.civr.org), the International Conference on Music Information Retrieval (ISMIR), IEEE International Conference on Acoustics, Speech, and Signal Processing, and the INTERSPEECH conference. For a searchable MIR library, we suggest the community driven digital library at the Association for Multimedia Search and Retrieval (http://www.amsr.org). Additionally, the general multimedia conferences such as ACM Multimedia (http://www.sigmm.org) and the IEEE International Conference on Multimedia and Expo (ICME) typically have MIR related tracks.
1.2.1. Learning and Semantics The potential for learning in multimedia retrieval is quite compelling toward bridging the semantic gap and the recent research literature has seen significant interest in applying classification and learning [Therrien 1989; Winston 1992; Haralick and Shapiro 1993] algorithms to MIR. The Karhunen-Loeve (KL) transform or principal components method [Therrien 1989] has the property of representational optimality for a linear description of the media. It is important to distinguish between representational optimality versus classification optimality. The ability to optimally represent a class does not necessarily lead to optimally classifying an instance of the class. An example of an improvement on the principal component approach was proposed by Capelli, et al. [2001] where they suggest a multispace KL for classification purposes. The multispace KL directly addresses the problem of when a class is represented by multiple clusters in feature space and can be used in most cases where the normal KL would be appropriate. Zhou and Huang [2001] compared discriminating transforms and SVM for image retrieval. They found that the biased discriminating transform (BDT) outperformed the SVM. Lew and Denteneer [2001] found that the optimal linear keys in the sense of minimizing the distance between two relevant images could be found directly from Fisher's Linear Discriminant. Liu, et al. [2003] find optimal linear subspaces by formulating the retrieval problem as optimization on a Grassman manifold. Balakrishnan, et al. [2005] propose a new representation based on biological vision which uses complementary subspaces. They compare their new representation with principal component analysis, the discrete cosine transform and the independent component transform.
Another approach toward learning semantics is to determine the associations behind features and the semantic descriptions. Djeraba [2002 and 2003] examines the problem of data mining and discovering hidden associations during image indexing and consider a visual dictionary which groups together similar colors and textures. A learning approach is explored by Krishnapuram, et al. [2004] in which they introduce a fuzzy graph matching algorithm. Greenspan, et al. [2004] performs clustering on space-time regions in feature space toward creating a piece-wise GMM framework which allows for the detection of video events.
1.2.1.1. Concept Detection in Complex BackgroundsOne of the most important challenges and perhaps the most difficult problem in semantic understanding of media is visual concept detection in the
presence of complex backgrounds. Many researchers have looked at classifying whole images, but the granularity is often too coarse to be useful in real world applications. Its typically necessary to find the human in the picture, not simply global features. Another limiting case is where researchers have examined the problem of detecting visual concepts in laboratory conditions where the background is simple and therefore can be easily segmented. Thus, the challenge is to detect all of the semantic content within an image such as faces, trees, animals, etc. with emphasis on the presence of complex backgrounds.
In the mid 90s, there was a great deal of success in the special case of detecting the locations of human faces in grayscale images with complex backgrounds. Lew and Huijsmans [1996] used Shannon's information theory to minimize the uncertainty in the face detection process. Rowley, et al. [1996] applied several neural networks toward detecting faces. Both methods had the limitation of searching for whole faces which prompted later component based model approaches which combined separate detectors for the eyes and nose regions. For the case of near frontal face views in high quality photographs, the early systems generally performed near 95% accuracy with minimal false positives. Non-frontal views and low quality or older images from cultural heritage collections are still considered to be very difficult. An early example of designing a simple detector for city pictures was demonstrated by Vailaya, et al. [1998]. They used a nearest neighbor classifier in conjunction with edge histograms. In more recent work, Schneiderman and Kanade [2004] proposed a system for component based face detection using the statistics of parts. Chua, et al. [2002] used the gradient energy directly from the video representation to detect faces based on the high contrast areas such as the eyes, nose, and mouth. They also compared a rules based classifier with a neural network and found that the neural network gave superior accuracy. For a good overview, Yang, et al. [2002] did a comprehensive survey on the area of face detection.
Detecting a wider set of concepts other than human faces turned out to be fairly difficult. In the context of image search over the Internet, Lew [2000] showed a system for detecting sky, trees, mountains, grass, and faces in images with complex backgrounds. Fan, et al. [2004] used multi-level annotation of natural scenes using dominant image components and semantic concepts. Li and Wang [2003] used a statistical modeling approach toward converting images to keywords. Rautianinen, et al. [2001] used temporal gradients and audio analysis in video to detect semantic concepts.
In certain contexts, there may be several media type available which allows for multimodal analysis. Shen, et al. [2000] discussed a method for giving descriptions of WWW images by using lexical chain analysis of the nearby text on webpages. Benitez and Chang [2002] exploit WordNet to disambiguate descriptive words. They also found 3-15% improvement from combining pictorial search with text analysis. Amir, et al. [2004] proposed a framework for a multi-modal system for video event detection which combined speech recognition and annotated video. Dimitrova, et al. [2000] proposed a Hidden Markov Model based using text and faces for video classification. In the TRECVID [Smeaton and Over 2003] project, the current focus is on multiple domain concept detection for video retrieval.
1.2.1.2. Relevance Feedback Beyond the one-shot queries in the early similarity based search systems, the next generation of systems attempted to integrate continuous feedback from the user toward learning more about the user query. The interactive process of asking the user a sequential set of questions after each round of results was called
relevance feedback due to the similarity with older pure text approaches. Relevance feedback can be considered a special case of
emergent semantics. Other names have included query refinement, interactive search, and active learning from the computer vision literature.
The fundamental idea behind relevance feedback is to show the user a list of candidate images, ask the user to decide whether each image is relevant or irrelevant, and modify the parameter space, semantic space, feature space, or classification space to reflect the relevant and irrelevant examples. In the simplest relevance feedback method from Rocchio [Rocchio 1971], the idea is to move the query point toward the relevant examples and away from the irrelevant examples. In principle, one general view is to view relevance feedback as a particular type of pattern classification in which the positive and negative examples are found from the relevant and irrelevant labels, respectively.
Therefore, it is possible to apply any learning algorithm into the relevance feedback loop. One of the major problems in relevance feedback is how to address the small training set. A typical user may only want to label 50 images when the algorithm really needs 5000 examples instead. If we compare the simple Rocchio algorithm to more sophisticated learning algorithms such as neural networks, its clear that one reason the Rocchio algorithm is popular is that it requires very few examples. However, one challenging limitation of the Rocchio algorithm is that there is a single query point which would refer to a single cluster of results. In the discussion below we briefly describe some of the recent innovations in relevance feedback.
Chang, et al. [1998] proposed a framework which allows for interactive construction of a set of queries which detect visual concepts such as
sunsets. Sclaroff, et al. [2001] describe the first WWW image search engine which focussed on relevance feedback based improvement of the results. In their initial system, where they used relevance feedback to guide the feature selection process, it was found that the positive examples were more important towards maximizing accuracy than the negative examples. Rui and Huang [2001] compare heuristic to optimization based parameter updating and find that the optimization based method achieves higher accuracy.
Chen, et al. [2001] described a one-class SVM method for updating the feedback space which shows substantially improved results over previous work. He, et al. [2002] use both short term and long term perspectives to infer a semantic space from user’s relevance feedback for image retrieval. The short term perspective was found by marking the top 3 incorrect examples from the results as irrelevant and selecting at most 3 images as relevant examples from the current iteration. The long term perspective was found by updating the semantic space from the results of the short term perspective. Yin, et al. [2005] found that combining multiple relevance feedback strategies gives superior results as opposed to any single strategy. Tieu and Viola [2004] proposed a method for applying the AdaBoost learning algorithm and noted that it is quite suitable for relevance feedback due to the fact that AdaBoost works well with small training sets. Howe [2003] compares different strategies using AdaBoost. Dy, et al. [2003] use a two level approach via customized queries and introduce a new unsupervised learning method called feature subset selection using expectation-maximization clustering. Their method doubled the accuracy for the case of a set of lung images. Guo, et al. [2001] performed a comparison between AdaBoost and SVM and found that SVM gives superior retrieval results. Haas, et al. [2004] described a general paradigm which integrates external knowledge sources with a relevance feedback mechanism and demonstrated on real test sets that the external knowledge substantially improves the relevance of the results. Ferecatu [Ferecatu2005] proposed a hybrid
visual and conceptual image representation within active relevance feedback context. A good overview can also be found from Muller, et al. [2000].
1.2.2. New Features & Similarity Measures Research did not only proceed along the lines of improved search algorithms, but also toward creating new features and similarity measures based on color, texture, and shape. One of the recent interesting additions to the set of features are from the MPEG-7 standard [Pereira and Koenen 2001]. The new color features [Lew 2001, Gevers2001] such as the NF, rgb, and m color spaces have specific benefits in areas such as lighting invariance, intuitiveness, and perceptual uniformity. A quantitative comparison of influential color models is performed in Sebe and Lew [2001].
In texture understanding, Ojala, et al. [1996] found that combining relatively simple texture histograms outperformed traditional texture models such as Gaussian or Markov features. Jafari-Khouzani and Soltanian-Zadeh [2005] proposed a new texture feature based on the Radon transform orientation which has the significant advantage of being rotationally invariant. Insight into the MPEG-7 texture descriptors has been given by Wu, et al. [2001].
Veltkamp and Hagedoorn [2001] describe the state of the art in shape matching from the perspective of computational geometry. Sebe and Lew [2002] evaluate a wide set of shape measures in the context of image retrieval. Srivastava, et al. [2005] describes some novel approaches to learning shape. Sebastian, et al. [2004] introduce the notion of shape recognition using shock graphs. Bartolini, et al. [2005] suggest using the Fourier phase and time warping distance.
Foote [2000] introduces a feature for audio based on local self-similarity. The important benefit of the feature is that it can be computed for any audio signal and works well on a wide variety of audio segmentation and retrieval applications. Bakker and Lew [2002] suggest several new audio features called the frequency spectrum differentials and the differential swap rate. They evaluate the new audio features in the context of automatic labeling the sample as either speech, music, piano, organ, guitar, automobile, explosion, or silence and achieve promising results.
Fauqueur et al. [Fauqueur2004] devise a new histogram based color descriptor that uses distributions of quantised colors, previously employed in global image feature techniques, in the local feature extraction case. Considering that description must be finer for regions than for images they propose region descriptor of fine color variability: the Adaptive Distribution of Color Shades (ADCS). They combine ADCS with an appropriate similarity measure to enable its use in indexing.
Equally important to novel features is the method to determine similarity between them. Jolion [2001] gives an excellent overview of the common similarity measures. Sebe, et al. [2000] discuss how to derive an optimal similarity measure given a training set. In particular they find that the sum of squared distance tends to be the worst similarity measure and that the Cauchy metric outperforms the commonly used distance measures. Jacobs, et al. [2000] investigates non-metric distances and evaluates their performance. Beretti, et al. [2001] proposes an algorithm which relies on graph matching for a similarity measure. Cooper, et al. [2005] suggest measuring image similarity using time and pictorial content.
In the last decades, a lot of research has been done on the matching of images and their structures [Schmid, et al. 2000, Mikolajczyk and Schmid 2004]. Although the approaches are very different, most methods use some kind of point selection from which descriptors are derived. Most of these approaches address the detection of points and regions that can be detected in an affine invariant way.
Lindeberg [1998] proposed an “interesting scale level” detector which is based on determining maxima over scale of a normalized blob measure. The Laplacian-of-Gaussian (LoG) function is used for building the scale space. Mikolajczyk and Schmid [2004] showed that this function is very suitable for automatic scale selection of structures. An efficient algorithm to be used in object recognition was proposed by Lowe [2004]. This algorithm constructs a scale space pyramid using difference-of-Gaussian (doG) filters. The doG can be used to obtain an efficient approximation of the LoG. From the local 3D maxima a robust descriptor is build for matching purposes. The disadvantage of using doG or LoG as feature detectors is that the repeatability is not optimal since they not only respond to blobs, but also to high gradients in one direction. Because of this, the localization of the features may not be very accurate.
An approach that intuitively arises from this observation is the separation of the feature detector and the scale selection. The commonly used Harris detector [Harris and Stephens 1988] is robust to noise and lighting variations, but only to a very limited extent to scale changes [Schmid, et al. 2000]. To deal with this Dufournoud, et al. [2000] proposed the scale adapted Harris operator. Given the scale adapted Harris operator, a scale space can be created. Local 3D maxima in this scale space can be taken as salient points but this scale adapted Harris operator rarely attains a maximum over scales. This results in very few points, which are not representative enough for the image. To address this problem, Mikolajczyk and Schmid [2004] proposed the Harris-Laplace detector that merges the scale-adapted Harris corner detector and the Laplacian based scale selection.
During the last years much of the research on scale invariance has been generalized to affine invariance. Affine invariance is defined here as invariance to non-uniform scaling in different directions. This allows for matching of descriptors under perspective transformations since a global perspective transformation can be locally approximated by an affine transformation [Tuytelaars and van Gool 2000]. The use of the second moment matrix (or autocorrelation matrix) of a point for affine normalization was explored by Lindeberg and Garding [1997]. A similar approach was used by Baumberg [2000] for feature matching.
All the above methods were designed to be used in the context of object-class recognition application. However, it was found that wavelet-based salient points [Tian, et al. 2001] outperform traditional interest operators such as corner detectors when they are applied to general content-based image retrieval. For a good overview, we refer the reader to Sebe, et al. [IVC 2003].
Some recent works focus on detecting more perceptible local structure. Szumilas et al. [Szumilas2007] extract feature centre locations at places where a symmetry measure is maximized. Next, boundary points along rays emanating from the centre are extracted. Boundary points are defined as edges or transitions between relatively different regions, and are extracted by hierarchical clustering of pixel feature values along the ray. Rebai et al. [Rebai2007] focus their interpretable interest points on radial symmetry centers detected by a Hough like strategy generalized to several tangential angles.
To eliminate the Out-Of-Vocabulary (OOV) problem in the area of spoken document retrieval subword units for indexing are introduced [Larson2007]. Based on phone or syllable transcriptions generated by an automatic speech recognition system fuzzy matching algorithms, like the Levenshtein based fuzzy search, arbitrary textual search query can be formulated. Here new indexing paradigm are required to provide a short reaction time during retrieval.
1.2.3. 3D Retrieval In the early years of MIR, most research focussed on content-based image retrieval. Recently, there has been a surge of interest in a wide variety of media. An excellent example, “life records”, which encompasses simultaneously all types of media is being actively promoted by Bell [2004]. He is investigating the issues and challenges in processing life records - all the text, audio, video, and media related to a person's life.
Beyond text, audio, images, and video, there has been significant recent interest in new media such as 3D models. Assfalg, et al. [2004] discuss using
spin-images, which essentially encode the density of mesh vertices projected onto a 2D space, resulting in a 2D histogram. It was found that they give an effective view-independent representation for searching through a database of cultural artifacts. Funkhouser, et al. [2003] develop a search engine for 3D models based on shape matching using spherical harmonics to compute discriminating similarity measures which are effective even in the presence of model degeneracies. An overview of how 3D models are used in content-based retrieval systems can be found in Tangelder and Veltkamp [2004].
1.2.4. Browsing and SummarizationThere have been a wide variety of innovative ways of browsing and summarizing multimedia information. Spierenburg and Huijsmans [1997] proposed a method for converting an image database into a movie. The intuition was that one could cluster a sufficiently large image database so that visually similar images would be in the same cluster. After the cluster process, one can order the clusters by the inter-cluster similarity, arrange the images in sequential order and then convert to a video. This allows a user to have a gestalt understanding of a large image database in minutes.
Sundaram, et al. [2002] took a similar approach toward summarizing video. They introduced the idea of a video skim which is a shortened video composed of informative scenes from the original video. The fundamental idea is for the user to be able to receive an abstract of the story but in video format.
Snoek, et al. [2005] propose several methods for summarizing video such as grouping by categories and browsing by category and in time. Chiu, et al. [2005] created a system for texturing a 3D city with relevant frames from video shots. The user would then be able to fly through the 3D city and browse all of the videos in a directory. The most important frames would be located on the roofs of the buildings in the city so that a high altitude fly through would result in viewing a single frame per video.
Uchihashi, et al. [1999] suggested a method for converting a movie into a cartoon strip in the Manga style from Japan. This means altering the size and position of the relevant keyframes from the video based on their importance. Tian, et al. [2002] took the concept of variable size and positions of images to the next level by posing the problem as a general optimization criterion problem. What is the optimal arrangement of images on the screen so that the user can optimally browse an image database.
Liu, et al. [2004] address the problem of effective summarization of images from WWW image search engines. They compare a rank list summarization method to an image clustering scheme and find that their users find the clustering scheme allows them to explore the image results more naturally and effectively.
1.2.5. High Performance IndexingIn the early multimedia database systems, the multimedia items such as images or video were frequently simply files in a directory or entries in an SQL database table. From a computational efficiency perspective, both options exhibited poor performance because most filesystems use linear search within directories and most databases could only perform efficient operations on fixed size elements. Thus, as the size of the multimedia databases or collections grew from hundreds to thousands to millions of variable sized items, the computers could not respond in an acceptable time period.
Even as the typical SQL database systems began to implement higher performance table searches, the search keys had to be exact such as in text search. Audio, images, and video were stored as blobs which could not be indexed effectively. Therefore, researchers [Egas, et al. 1999; Lew 2000] turned to similarity based databases which used tree-based indexes to achieve logarithmic performance. Even in the case of multimedia oriented databases such as the Informix database, it was still necessary to create custom datablades to handle efficient similarity searching such as k-d trees [Egas, et al. 1999]. In general the k-d tree methods had linear worst case performance and logarithmic average case performance in the context of feature based similarity searches. A recent improvement to the k-d tree method is to integrate entropy based balancing [Scott and Shyu 2003].
Other data representations have also been suggested besides k-d trees. Ye and Xu [2003] show that vector quantization can be used effectively for searching large databases. Elkwae and Kabuka [2000] propose a 2-tier signature based method for indexing large image databases. Type 1 signatures represent the properties of the objects found in the images. Type 2 signatures capture the inter-object spatial positioning. Together these signatures allow them to achieve a 98% performance improvement. Shao, et al. [2003] use invariant features together with efficient indexing to achieve near real-time performance in the context of k nearest neighbor searching.
Other kinds of high performance indexing problems appear when searching peer to peer (P2P) networks due to the curse of dimensionality, the high communication overhead and that all searches within the network are based on nearest neighbor methods. Muller and Henrich [2003] suggest an effective P2P search algorithm based on compact peer data summaries. They show that their model allows peers to only communicate with a small sample and still retain high quality of results.
The goal of this section is to summarize the multimedia analysis research that takes place within several European projects and national initiatives. We explicitly mention the research partners and their contribution to different type of media analysis: (1) speech, music, and audio analysis; (2) image analysis; (3) 3D analysis in images and video; (3) video analysis; and (4) text and semantics. Please note that most of these research efforts do not restrict to a single media but they are rather addressing the multimedia problem and advocate the use of cross-media inference and analysis. We are also summarizing in the end of the section the main issues regarding the state of the art in analysis of different media focussing on the following issues: (1) objectives; (2) Approaches and technologies; (3) Systems; (4) Applications; and 95) challenges. 1.3.1. Multimedia Analysis in European Projects The research topic audio-visual indexing and retrieval is in the main focus of the 9 funded IST projects of the strategic objective “Audio Visual Search Technologies”. Table 1 shows which partners in the nine projects work on indexing and retrieval technologies for the different media types. This information was collected from the different projects and was augmented by us in the cases when the information was not available or was incomplete.
The table shows that all types of media are well covered by the funded EU projects. In the IP projects (Vitalas) all media types are presented. 3D indexing and retrieval is the main focus of the Victory project while Rushes addresses also this subject. In these projects special 3D search engine technology will be developed. It is also obvious that research on video processing is a very active research area. Motivated by work in the context of TrecVid many research groups continue their research work to improve video retrieval performance and a good example here is Vidi-Video.
| Speech/Audio | Image | 3D | Video | Text/Semantics |
| DIVAS | FhG IDMT Sail Labs |
|
| Elecard |
|
| PHAROS | Univ. P. Fabra FhG IDMT Sail Labs | EPFL |
| EPFL Open Univ., UK | Web Models L3S Research |
| RUSHES | Brunel Univ. | Brunel Univ. | FhG HHI | Queen Mary Univ. Brunel Univ. | Queen Mary Univ. Brunel Univ. |
| SAPIR | IBM Univ. of Padova | CNR |
| Eurix | Xerox |
| SEMEDIA |
|
|
| Joaneum Research Fundacio Barcelona Univ. P. Fabra UPC Barcelona Digital Video Systems Univ. of Glasgow |
|
| TRIPOD |
| Dublin City Univ. |
|
| Sheffield Univ. |
| VICTORY |
|
| Certh/ITI |
|
|
| VIDI-VIDEO | INESC Lisboa | U. Surrey UvA ITI U. Florence |
| UvA ITI U. Florence |
|
| VITALAS | FhG IAIS | INRIA Robotiker |
| INRIA CWI Certh/ITI | Univ. of Sunderland EADS |
Table 1: Overview about the AV indexing activities in the 9 IST projects with information about the active partners 1.3.2. Multimedia Analysis in National Initiative Many national projects do research in the area of audio-visual indexing and retrieval. Although the overall focus of the national projects differs the underlying technologies are quite similar. Table 2 presents a summary of the research activities in the national projects for the different types of media.
In all national projects a strong participation of industrial partners can be observed. The research activities are application driven with a clear market focus. In the German Theseus project tools for semantic knowledge engineering and future Web applications will be developed. The main objective of the French Quaero project is to provide applications for the multimedia business sector. MultimediaN shows already concrete results and demo applications for advanced multimedia search applications. IM2 is carrying out research in the area of meeting annotation which requires innovation in the area of multimedia indexing and communication modelling.
| Speech/Audio | Image | 3D | Video | Text/Semantics |
Quaero (French) | Limisi RWTH Aachen Univ. Karlsruhe VecSys IRCAM | INRIA Univ. J. Fourier Jouve |
| INRIA LTU Univ. J. Fourier | Jouve Limsi INRIA |
Theseus (German) | FhG IAIS M2Any | FhG HHI FhG First Siemens CT | FhG HHI FhG IGD | FhG HHI Siemens | Univ. Karlsruhe FhG IAIS DFKI FZI |
iAD (Norway) |
|
|
| Dublin Univ. | Fast |
MultimediaN (Dutch) | U. Twente TU Delft | CWI U. Amsterdam |
| U. Amsterdam CWI TU Delft Philips | U. Twente |
IM2 (Swiss) | IDIAP | EPFL IDIAP |
| U. Fribourg IDIAP |
|
Mundos (Spanish) |
|
|
| CineVideo20 |
|
Table 2: Overview about the AV indexing activities in the national research projects with information about the active partners 1.3.3. State-of-the Art in European Research State-of-the-Art: Speech Analysis - Objectives
- Automatic indexing of huge audio archives using speech technology
- Approaches/Technologies
- Speech recognition: HMM based LVCSR systems, Spoken Document
- Retrieval, Subword indexing (SAPIR, VITALAS, PHAROS, Quaero,
Theseus, MultimediaN, IM2)
- Speech Segmentation: speaker clustering and recognition (DIVAS,
VITALAS, Quaero, Theseus, MultimediaN, IM2)
- Speech-to-video transcoding (DIVAS)
- IST AV-projects: IBM speech system (SAPIR), Audiomining System
from Fraunhofer IAIS (VITALAS), Sail Labs Technolgoy (DIVAS),
AudioSurf from Limsi & Vecsys (Quaero)
- Others: BBN, HTK-Group Cambridge, LIMSI, RWTH Aachen,
Nuance, etc.
- Indexing of broadcast news/archives (VITALAS, DIVAS, VIDIVIDEO, Quaero, Theseus, MultimediaN)
- Podcast/Videocast search (Potzinger, Blinkx)
- Audio archives (Parliament data, historical archives)
- Challenges
- Variability of content (e.g. background noise)
- Domain dependency
- Scalability of subwords approaches
- Language dependency
State-of-the-Art: Music Analysis - Automatic indexing and classification of large music collections
- Music segmentation: Spectral Flatness (MPEG-7 Audio), Genetic Algorithms, Viterbi(DIVAS, PHAROS, Quaero, Theseus)
- Music retrieval and Recommendation (SOMs) (SAPIR, Theseus)
- IST projects: Fraunhofer IDMT (DIVAS, PHAROS), M2Any (Theseus), IRCAM (Quaero)
- Others: Barcelona Music & Audio Technologies, FhG AudioID, PlaySom (Univ. Vienna),SyncPlayer (Univ. Bonn), etc.
- Indexing of music collections
- Query by humming
- Audio-music identification
- Recommendation engines
- Genre Classification
- Polyphonic instrument recognition
- Affective analysis
State-of-the-Art: Image Analysis - Indexing and retrieval of images, object recognition
- Low level image processing (histograms, shapes, textures, MPEG7-visual, SIFT) (SAPIR, VIDIVIDEO, VITALAS, SMEDIA, TRIPOD, Rushes, Quaero, Theseus, MultimediaN, IM2)
- Image similarity measurements (Rushes, VIDIVIDEO, VITALAS, Theseus, IM2)
- Relevance Feedback (Rushes, SMEDIA, VITALAS), etc.
- Ist projects: INRIA (VITALAS), Univ. of Amsterdam & Univ. of Florence (VIDIVIDEO), etc.
- Others: IBM (QBIC), Webseek, MPEG-7 search system (Univ. Munich), IKONA (INRIA), Riya, Nevenvision, etc.
- Content based retrieval in image collections
- Object recognition
- Face recognition (security, photo collections)
- Automatic annotation of image collections with keywords and textual descriptions
- Semantic gap
- Image segmentation
- Sensory gap
State-of-the-Art: Video Analysis -
Objectives
- Automatic segmentation of videos, video retrieval, object recognition in videos
-
Approaches/Technologies
- Shot detection, keyframe generation (DIVAS, Rushes, SAPIR, VIDIVIDEO, VITALAS, Quaero, Theseus, MultimediaN, IM2) - Object tracking based on motion based features, closed captions recognition, etc. (Rushes, VIDIVIDEO, VITALAS, Quaero, Theseus, MultimediaN, IM2)
- Object detection and recognition (ANN, Adaboost, SIFT) (VIDIVIDEO, VITALAS, SMEDIA, VITALAS, Quaero, Theseus, MultimediaN, IM2)
- Video annotation and summarization (Rushes, SMEDIA, VITALAS, Quaero, Theseus, MultimediaN, IM2)
- Metadata workflow management (SMEDIA, PHAROS Quaero, Theseus, MultimediaN)
- Video event detection (SMEDIA, VITALAS, VIDIVIDEO)
-
Systems
- IST projects: Univ. Amsterdam & Univ. Florence (VIDIVIDEO), Joaneum Research (SMEDIA), CERTH/ITI (VICTORY), VITALAS (INA/INRIA, CERTH-ITI), Fraunhofer IAIS (Theseus), - Others: Virage, TrecVideo-particpants, Informedia, Univ. of Marburg, etc.
-
Applications
- Indexing of broadcast material, media observation - Indexing of videocast material, - Recommendation Engines - Video fingerprinting, logo detection, security, etc.
- 3D video (Rushes, VICTORY, Theseus)
-
Challenges
- Detection of complex concepts - Segmentation into more semantic based units (i.e. complex scenes) - Thousands of different objects
- Multimodality, fusion
State-of-the-Art: Text/Semantic Analysis - Automatic indexing and classification of text based documents
- SVM, PLSI, Named Entity Recognition (Rushes, SAPIR, VITALAS, Quaero, Theseus, iAD, MultimediaN)
- Bayesian semantic reasoning (Rushes, Theseus)
- Caption augmentation (TRIPOD)
- IST projects: EADS/Univ. of Sunderland text classification (VITALAS), Yahoo (SMEDIA), Univ. of Karlsruhe (Theseus), Empolis (Theseus)
- Others (many): Recommind, ITxY, Xtramind (DFKI), Autonomy, Gate, etc.
- Classification of news and documents in companies
- Email filtering
- Text based search engines
- Semantic analysis of multimedia (automatic) annotations
- Semantics, Ontologies
Despite the considerable progress of academic research in multimedia information retrieval, there has been relatively little impact of audio-visual content indexing and retrieval research into commercial applications with some niche exceptions such as video segmentation. One example of an attempt to merge academic and commercial interests is Riya (www.riya.com). Their goal is to have a commercial product that uses the academic research in face detection and recognition and allows the users to search through their own photo collection or through the Internet for particular persons. Another example is the MagicVideo Browser (www.magicbot.com) which transfers research in video summarization to household desktop computers and has a plug-in architecture intended for easily adding new promising summarization methods as they appear in the research community. An interesting long-term initiative is the launching of Yahoo! Research Berkeley (research.yahoo.com/Berkeley), a research partnership between Yahoo! Inc. and UC Berkeley with the declared scope to explore and invent social media and mobile media technology and applications that will enable people to create, describe, find, share, and remix media on the web. Nevenvision (www.nevenvision.com) is developing technology for mobile phones that utilizes visual recognition algorithms for bringing in ambient finding technology. However, these efforts are just in their infancy and there is a need for avoiding a future where the multimedia information retrieval (MIR) community is isolated from real world interests. We believe that the MIR community has a golden opportunity to the growth of the multimedia search field that is commonly considered the next major frontier of search [Battelle 2005]. To assess research effectively in multimedia retrieval, task-related standardized databases on which different groups can apply their algorithms are needed. In text retrieval, it has been relatively straightforward to obtain large collections of old newspaper texts because the copyright owners do not see the raw text being of much value, however image, video, and speech libraries do see great value in their collections and consequently are much more cautious in releasing their content. While it is not a research challenge, obtaining large multimedia collections for widespread evaluation benchmarking is a practical and important step that needs to be addressed. One possible solution is that task-related image and video databases with appropriate relevance judgments are included and made available to groups for research purposes as is it done with TRECVID. Useful video collections could include news video (in multiple languages), collections of personal videos, and possibly movie collections. Image collections would include image databases (maybe on specific topics) along with annotated text - the use of library image collections should also be explored. One critical point here is that sometimes the artificial collections like Corel might do more harm than good to the field by misleading people into believing that their techniques work, while they do not necessarily work with more general image collections. Therefore, cooperation between private industry and academia is strongly encouraged and is currently taking place within the European projects and national initiatives mentioned before. The key point here is to focus on efforts which mutually benefit both industry and academia. As was noted earlier, it is of clear importance to keep in mind the needs of the users in retrieval system design and it is logical that industry can contribute substantially to our understanding of the end-user and also aid in realistic evaluation of research algorithms. Furthermore, by having closer communication with private industry we can potentially find out what parts of their systems need additional improvements toward increasing user satisfaction. In the example of Riya, they clearly need to perform object detection (faces) in complex backgrounds and then object recognition (who the face is). For the context of consumer digital photograph collections, the MIR community might attempt to create a solid test set which could be used to assess the efficacy of different algorithms in both detection and recognition in real world media. To summarize the major research challenges listed in the previous section of particular importance to the audio-visual content indexing and retrieval research community are the following challenges: (1) Semantic search with emphasis on the detection of concepts in media with complex backgrounds; (2) Multi-modal analysis and retrieval algorithms especially towards exploiting the synergy between the various media including text and context information; (3) Experiential multimedia exploration systems toward allowing users to gain insight and explore media collections; (4) Interactive search, emergent semantics, or relevance feedback systems; and (5) Evaluation with emphasis on representative test sets and usage patterns. D. Abberley, D. Kirby, S. Renals and T. Robinson, "The THISL broadcast news retrieval system ", Proc. of ESCA ETRW Workshop on Accessing Information in Spoken Audio, Cambridge (UK), April 1999 Amir, A., Basu, S., Iyengar, G., Lin, C.-Y., Naphade, M., Smith, J.R., Srinivasan S., and Tseng, B. 2004. A Multi-modal System for the Retrieval of Semantic Video Events.
CVIU 96(2), 216-236. Assfalg, J., Del Bimbo, A., and Pala, P. 2004. Retrieval of 3D Objects by Visual Similarity.
ACM MIR, 77-83. Bach, J.R., Fuller, C., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R., Jain, R., AND Shu, C.F. 1996. Virage image search engine: An open framework for image management. In
SPIE: Storage and Retrieval for Still Image and Video Databases, 76-87. Balakrishnan, N., Hariharakrishnan, K., AND Schonfeld, D. 2005. A New Image Representation Algorithm Inspired by Image Submodality Models, Redundancy Reduction, and Learning in Biological Vision.
IEEE Transactions on Pattern Analysis and Machine Intelligence 27(9), 1367-1378. Ballard, D.H. and Brown, C.M. 1982.
Computer Vision. Prentice Hall, New Jersey, USA. Bakker, E.M. AND Lew, M.S. 2002. Semantic Video Retrieval Using Audio Analysis.
In CIVR, 262-270.
Bartolini, I., Ciaccia, P., AND Patella, M. 2005. WARP: Accurate Retrieval of Shapes Using Phase of Fourier Descriptors and Time Warping Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(1), 142-147.BATTELLE, J. 2005.
The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. Portfolio Hardcover, USA. Baumberg, A. 2000, Reliable feature matching across widely separated views,
CVPR, 774–781. BELL, G. 2004. A New Relevance for Multimedia When We Record Everything Personal. In
ACM Multimedia. Bentiez, A. B. AND Chang, S.-F. 2002. Semantic knowledge construction from annotated image collection. In
ICME. Beretti, S., Del Bimbo, A., AND Vicario, E. 2001. Efficient Matching and Indexing of Graph Models in Content-Based Retrieval.
IEEE Trans. on Pattern Analysis and Machine Intelligence 23(10), 1089-1105. Bliujute, R., Saltenis, S., Slivinskas, G., AND Jensen, C.S. 1999. Developing a DataBlade for a New Index. In
Proceedings of IEEE International Conference on Data Engineering, 314-323. Bosson, A., Cawley, G.C., Chan, Y., AND Harvey, R. 2002. Non-retrieval: Blocking Pornographic Images.
In CIVR, 50-60. Byrne W. et al. Automatic recognition of spontaneous speech for access to multilingual oral history archives.
IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, 12(4):420-435, July 2004 Cappelli, r., Maio, D., AND Maltoni. D. 2001. Multispace KL for Pattern Representation and Classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence 23(9), 977-996. Chang, S.-F., Chen, W., and Sundaram, H. 1998. Semantic visual templates: Linking visual features to semantics. In
ICIP, 531–535. Chen, Y., Zhou, X.S., AND Huang, T.S. 2001. One-class SVM for Learning in Image Retrieval, In
ICIP, 815-818. Chi, P., Girgensoh, A., Lertsithichai, S., Polak, W., AND Shipman, F. 2005. MediaMetro: browsing multimedia document collections with a 3D city metaphor. In
ACM Multimedia, 213-214. Chua, T.S., Zhao, Y., and Kankanhalli, M.S. 2002. Detection of human faces in a compressed domain for video stratification,
The Visual Computer 18(2), 121-133. Cooper, M., Foote, J., Girgensohn, A., AND Wilcox, L. 2005. Temporal event clustering for digital photo collections.
ACM Transactions on Multimedia Computing, Communications, and Applications 1(3). 269-288. Dimitrova, N., Agnihotri, L., and Wei, G. 2000. Video Classification Based on HMM Using Text and Faces.
European Signal Processing Conference. Dimitrova, N., Zhang, H. J., Shahraray, B., Sezan, I., Huang, T., AND Zakhor, A. 2002. Applications of video-content analysis and retrieval.
IEEE Multimedia 9(3), 42-55. Dimitrova, N. 2003. Multimedia Content Analysis: The Next Wave. In
CIVR, 9-18. Djeraba, C. 2002. Content-based Multimedia Indexing and Retrieval,
IEEE Multimedia 9, 18-22. Djeraba, C. 2003. Association and Content-Based Retrieval,
IEEE Transactions on Knowledge and Data Engineering 15(1), 118-135. Dufournaud, Y., Schmid, C., AND Horaud, R. 2000, Matching images with different resolutions,
CVPR, 612–618. Dy, J.G., Brodley, C.E., Kak, A., Broderick, L.S., AND Aisen, A.M. 2003. Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images,
IEEE Transactions on Pattern Analysis and Machine Intelligence 25(3), 373-378. Eakins, J.P., Riley, K.J., AND Edwards, J.D. 2003. Shape Feature Matching for Trademark Image Retrieval.
CIVR, 28-38. Egas, R., Huijsmans, N., Lew, M.S., and Sebe, N. 1999. Adapting k-d Trees to Visual Retrieval
. In Proceedings of the International Conference on Visual Information Systems, 533-540. Eiter, T., and Libkin, L. 2005.
Database Theory. Springer, London. 2005. Elkwae, E.A. and Kabuka, M.R., 2000. Efficient content-based indexing of large image databases.
ACM Transactions on Information Systems 18(2), 171-210. Fan, J., Gao, Y., and Luo, H. 2004. Multi-level annotation of natural scenes using dominant image components and semantic concepts. In
ACM Multimedia. 540 –547.
Fauquer, J. and Boujemaa, N. 2004 Region-based image retrieval: Fast coarse segmentation and fine color description, Journal of Visual Languages and Computing, 15(1):69-95. Ferecatu, M, Boujemaa, N., and Crucianu, M. 2005
Hybrid visual and conceptual image representation within active relevance feedback context, 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR'05). Flickner, M. Sawhney, H. Niblack, W. Ashley, J. Qian Huang Dom, B. Gorkani, M. Hafner, J. Lee, D. Petkovic, D. Steele, D. Yanker, P. 1995. Query by image and video content: the QBIC system,
IEEE Computer, September, 23-32. Foote, J. 1999. An Overview of Audio Information Retrieval.
ACM Multimedia Systems 7(1), 42-51. Foote, J. 2000. Automatic audio segmentation using a measure of audio novelty. In
ICME. 452–455. Forsyth, D.A., AND Fleck, M.M. 1999. Automatic Detection of Human Nudes,
International Journal of Computer Vision 32(1), 63-77. Frankel, C., Swain, M.J., and Athitsos, V. 1996. WebSeer: An Image Search Engine for the World Wide Web.
University of Chicago Technical Report 96-14. Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A., Dobkin, D., AND Jacobs, D. 2003. A search engine for 3D models.
ACM Transactions on Graphics 22(1), 83-105.
Gauvain, J.L, Lamel, L., Adda, G. and Jardino, M. The LIMSI 1998 Hub-4E Transcription System, Proc. DARPA BroadcastNews Workshop, pp. 99-104, Herndon, VA, February, 1999. Gevers, T. 2001. Color-based Retrieval. In
Principles of Visual Information Retrieval, M.S. LEW, Ed. Springer-Verlag, London, 11-49. Greenspan, H., Goldberger, J., AND Mayer, A. 2004. Probabilistic Space-Time Video Modeling via Piecewise GMM.
IEEE Transactions on Pattern Analysis and Machine Intelligence 26(3), 384-396. Guo, G., Zhang, H.J., and Li, S.Z. 2001. Boosting for Content-Based Audio Classification and Retrieval: An Evaluation, In
ICME. Haas, M., Lew, M.S. AND Huijsmans, D.P. 1997. A New Method for Key Frame based Video Content Representation. In
Image Databases and Multimedia Search, A. SMEULDERS AND R. JAIN, Eds., World Scientific. 191-200. Haas, M., Rijsdam, J. and Lew, M. 2004. Relevance feedback: perceptual learning and retrieval in bio-computing, photos, and video, In
ACM MIR, 151-156. Hanjalic, A., Lagendijk, R.L., and Biemond, J. 1997. A New Method for Key Frame based Video Content Representation. In
Image Databases and Multimedia Search, A. Smeulders and R. Jain, Eds., World Scientific. 97-107. Haralick, R.M. and Shapiro, L.G. 1993.
Computer and Robot Vision. Addison-Wesley, New York, USA. Harris, C. and Stephens, M. 1988, A combined corner and edge detector,
4th Alvey Vision Conference, 147–151 He, X., Ma, W.-Y., King, O. Li, M., and Zhang, H. 2002. Learning and inferring a semantic space from user’s relevance feedback for image retrieval. In
ACM Multimedia. 343–347. Howe, N. 2003. A Closer Look at Boosted Image Retrieval.
In CIVR, 61-70. Jacobs, D.W., Weinshall, D., AND Gdalyahu, Y. 2000. Classification with Nonmetric Distances: Image Retrieval and Class Representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence 22(6), 583-600. Jafari-Khouzani, K. AND Soltanian-Zadeh, H. 2005. Radon Transform Orientation Estimation for Rotation Invariant Texture Analysis.
IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6), 1004-1008. Jaimes, A and Chang, S-F. 2002 Duplicate Detection in Consumer Photography and News Video,
ACM Int. Conf. on Multimedia, 423-424. Larson, M., Eickeler, S. and Köhler, J. Supporting Radio Archive Workflows with Vocabulary Independent Spoken Keyword Search. Proceedings of SIGIR 2007 Workshop Searching Spontaneous Conversational Speech. 2007 Jolion, J.M. 2001. Feature Similarity. In
Principles of Visual Information Retrieval, M.S. LEW, Ed. Springer-Verlag, London, 122-162. Joly, A., Buisson, O., AND Frelicot, C. Robust content-based copy detection in large reference database,
Int. Conf. on Image and Video Retrieval, 2003 Krishnapuram, R., Medasani, S., Jung, S.H., Choi, Y.S., AND Balasubramaniam, R. 2004. Content-Based Image Retrieval Based on a Fuzzy Approach.
IEEE Transactions on Knowledge and Data Engineering 16(10), 1185-1199. Levine, M. 1985.
Vision in Man and Machine, Mcgraw Hill, Columbus. Lew, M.S. AND Huijsmans, N. 1996. Information Theory and Face Detection. In
Proceedings of the International Conference on Pattern Recogntion, 601-605. Lew, M.S. 2000. Next Generation Web Searches for Visual Content.
IEEE Computer, November, 46-53. Lew, M.S. 2001.
Principles of Visual Information Retrieval. Springer, London, UK. Lew, M.S., Sebe, N., Djeraba, C., AND Jain, R. 2006 Multimedia Information Retrieval: State of the Art and Challenges,
ACM Transactions on Multimedia Computing, Communication, and Applications, 2(1):1-19. Lew, M.S. and Denteneer, D. 2001. Fisher Keys for Content Based Retrieval.
Image and Vision Computing 19, 561-566. Li, J. and Wang, J.Z. 2003. Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach.
IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9), 1075-1088. Lienhart, R. 2001. Reliable Transition Detection in Videos: A Survey and Practitioner's Guide.
International Journal of Image and Graphics 1(3), 469-486. Lindeberg, T. 1998, Feature detection with automatic scale selection,
International Journal of Computer Vision,
30(2):79–116 Lindeberg, , T. and Garding, J. 1997, Shape-adapted smoothing in estimation of the 3D shape cues from affine deformations of local 2D brightness structure,
Image and Vision Computing, 15(6):415–434,1997 Liu, B., Gupta, A., and Jain, R. 2005. MedSMan: A Streaming Data Management System over Live Multimedia,
ACM Multimedia, 171-180. Liu, H., Xie, X., Tang, X., Li, Z.W., Ma, W.Y. 2004. Effective browsing of web image search results. In ACM MIR, 84-90. Liu, X., Srivastava, A., and Sun, D. 2003. Learning Optimal Representations for Image Retrieval Applications.
In CIVR, 50-60. Lowe, D. 2004, Distinctive image features from scale-invariant keypoints,
International Journal of Computer Vision, 60(2), 91–110. Mikolajczyk, K. and Schmid, C. 2004, Scale and affine invariant interest point detectors
International Journal of Computer Vision, 60(1), 63–86. Muller, H., Muller, W., Marchand-Maillet, S., Pun, T., AND SQUIRE, D. 2000. Strategies for Positive and Negative Relevance Feedback in Image Retrieval. In ICPR, 1043-1046. Müller, W. AND Henrich, A. 2003. Fast retrieval of high-dimensional feature vectors in P2P networks using compact peer data summaries. In
ACM MIR, 79-86. Ojala, T., Pietikainen, M., and Harwood, D. 1996. Comparative study of texture measures with classification based on feature distributions, Pattern Recognition 29(1), 51-59. Pereira, F. and Koenen, R. 2001. MPEG-7: A Standard for Multimedia Content Description.
International Journal of Image and Graphics 1(3), 527-546. Rautiainen, M., Seppanen, T., Penttila, J., and Peltola, J. 2003. Detecting Semantic Concepts from Video Using Temporal Gradients and Audio Classification. In CIVR. Rebai, A, Joly, A., and Boujemaa, N. 2007 Interpretability Based Interest Points Detection,
ACM International Conference on Image and Video Retrieval. Rocchio, 1971. Relevance Feedback in Information Retrieval. In
The Smart Retrieval System: Experiments in Automatic Document Processing, G. Salton, Ed. Prentice Hall, Englewoods Cliffs. Rowe, L.A. and Jain, R. 2005. ACM SIGMM retreat report on future directions in multimedia research.
ACM Transactions on Multimedia Computing, Communications, and Application 1(1), 3-13. Rowley, H., Baluja, S., and Kanade, K. 1996. Human Face Detection in Visual Scenes.
Advances in Neural Information Processing Systems 8, 875-881. Schmid, C., Mohr, R., Bauckage, C. 2000, Evaluation of interest point detectors, International Journal of Computer Vision, 37(2), 151–172 Schneiderman, H. AND Kanade, T. 2004. Object Detection Using the Statistics of Parts, International Journal of Computer Vision 56(3), 151-177. Sclaroff, S., La Cascia, M., Sethi, S., and Taycher, L. 2001. Mix and Match Features in the ImageRover Search Engine. In
Principles of Visual Information Retrieval, M.S. LEW, Ed. Springer-Verlag, London, 259-277. Scott, G.J. AND Shyu, C.R. 2003. EBS k-d Tree: An Entropy Balanced Statistical k-d Tree for Image Databases with Ground-Truth Labels.
In CIVR, 467-476. Sebastian, T.B., Klein, P.N., AND Kimia, B.B. 2004. Recognition of Shapes by Editing Their Shock Graphs.
IEEE Transactions on Pattern Analysis and Machine Intelligence 26(5), 550-571. Sebe, N., Lew, M.S., AND Huijsmans, D.P. 2000. Toward Improved Ranking Metrics.
IEEE Transactions on Pattern Analysis and Machine Intelligence 22(10), 1132-1143. Sebe, N., AND Lew, M.S. 2001. Color Based Retrieval,
Pattern Recognition Letters 22(2), 223-230. Sebe, N., AND LEW, M.S. 2002. Robust Shape Matching.
In CIVR, 17-28. Sebe, N., TIAN, Q., LOUPIAS, E., LEW, M.S., AND HUANG, T.S. 2003. Evaluation of Salient Point Techniques. Image and Vision Computing 21(13-14), 1087-1095. Sebe, N., LEW, M.S., ZHOU, X., AND HUANG, T.S. 2003. The State of the Art in Image and Video Retrieval.
In CIVR. Shao, H., Svoboda, T., Tuytekaars, T., and Van Gool, L. 2003. HPAT Indexing for Fast Object/Scene Recognition Based on Local Appearance. In CIVR, 71-80. Shen, H. T., Ooi, B. C., and Tan, K. L. 2000. Giving meanings to www images. In
ACM Multimedia, 39–48. Smeaton, A.F. and Over, P. 2003. Benchmarking the Effectiveness of Information Retrieval Tasks on Digital Video. In CIVR, 10-27. Smeulders, A., Worring, M., Santini, S., Gupta, A., and Jain, R. 2000. Content based image retrieval at the end of the early years.
IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12), 1349-1380. Smith, J. R. and Chang, S.F. 1997. Visually Searching the Web for Content.
IEEE Multimedia 4(3), 12-20. Snoek, C.G.M., Worring, M., van Gemert, J., Geusebroek, J.M., Koelma, D., Nguyen, G.P., de Rooij, O., AND Seinstra, F. 2005. MediaMill: exploring news video archives based on learned semantics. In
ACM Multimedia, 225-226. Spierenburg, J.A. AND Huijsmans, D.P. 1997. VOICI: Video Overview for Image Cluster Indexing. In BMVC. Srivastava, A., Joshi, S.H., Mio, W., AND Liu, X. 2005. Statistical Shape Analysis: Clustering, Learning, and Testing.
IEEE Transactions on Pattern Analysis and Machine Intelligence 27(4), 590-602. Sundaram, H., Xie, L., and Chang, S.F. 2002. A utility framework for the automatic generation of audio-visual skims. ACM Multimedia, 189-198. Szumilas, L., Donner, R., Langs, G., and Hanbury, A. 2007 Local structure detection with orientation-invariant radial configuration,
CVPR. Tangelder, J. and Veltkamp, R.C. 2004. A survey of content based 3d shape retrieval methods
, In Proceedings of International Conference on Shape Modeling and Applications, 157-166. Tian, Q., Sebe, N., Lew, M.S., Loupias, E., and Huang, T.S. 2001. Image Retrieval using Wavelet-based Salient Points.
Journal of Electronic Imaging 10(4), 835-849. Tian, Q., Moghaddam, B., and Huang, T.S. 2002. Visualization, Estimation and User-Modeling. In CIVR, 7-16. Tieu, K. and Viola, P. 2004. Boosting Image Retrieval,
International Journal of Computer Vision 56(1), 17-36. Therrien, C.W. 1989. Decision, Estimation, and Classification, Wiley, New York, USA. Tuytelaars, T. and Van Gool, L. 2000, Wide baseline stereo matching based on local affinely invariant regions,
British Machine Vision Conference, 412–425. Uchihashi, S., Foote, J., Girgensohn, A., AND Boreczky, J. 1999. Video Manga: generating semantically meaningful video summaries. In ACM Multimedia, 383-392. Vailaya, A., Jain, A., and Zhang, H. 1998. On Image Classification: City vs Landscape. In Proceedings of Workshop on Content-based Access of Image and Video Libraries, 3-8. Veltkamp, R.C. and Hagedoorn, M. 2001. State of the Art in Shape Matching. In
Principles of Visual Information Retrieval, M.S. Lew, Ed. Springer-Verlag, London, 87-119. Winston, P. 1992. Artificial Intelligence, Addison-Wesley, New York, USA. Wu, P., Choi, Y., Ro., Y.M., and Won, C.S. 2001. MPEG-7 Texture Descriptors.
International Journal of Image and Graphics 1(3), 547-563. Yang, M.H., Kriegman, D.J., AND Ahuja. N. 2002. Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 34-58. Ye, H. and Xu, G. 2003. Fast Search in Large-Scale Image Database Using Vector Quantization.
CIVR, 477-487. Yin, P.Y., Bhanu, B., Chang, K.C., AND Dong, A. 2005. Integrating Relevance Feedback Techniques for Image Retrieval Using Reinforcement Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1536-1551. Zhou, X.S. and Huang, T.S. 2001. Comparing discriminating transformations and SVM for learning during multimedia retrieval. In
ACM Multimedia, 137-146.
The project-coordinators were asked to give feedback about their projects regarding the research activities in the area of multimedia indexing and retrieval. The filled in questionnaires sent by the 9 projects are included in this Appendix. Overview of Divas | Project name | Divas |
| co-ordinator | Nikos Achilleopoulos, Archetypon S.A. Information Technologies |
| Budget in Mio. Euro | budget: 3,188 |
| project start | 1.1.2007 |
| project duration (in month) | 24 |
|
|
| Objectives |
|
| main objectives | Design and implement a multimedia search engine based on advanced direct video and audio search algorithms applied on encoded (compressed) content |
| objectives regarding AV search engine technology | Direct Search in Compressed Audio and Video |
| target / final product | DIVAS Algorithms, system level demonstrator (available over the web for user evaluation), studies and designs methodologies for application integration |
| internal user groups of the project results | ESCOM, BeTV |
| external user groups of the project results | Potential: AudioVisual Archive of any type |
| scenarios for deployment | ESCOM: indexing and search of audiovisual archive, BeTV: DRM monitoring |
| data sources | ESCOM and BeTV: Videofiles and Audiofiles |
| metadata inventories | Escom and BetV Metadata |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing | yes |
| image indexing | no |
| video indexing | yes |
| text+semantics | no |
| multimodal fusion | no |
| retrieval models/techniques | no |
|
|
| official benchmarking/which one |
|
| evaluation |
|
| social networks | no |
| what standardization body are you addressing (if applicable) | MPEG-7 |
|
|
| Please identify your use cases (if any) |
|
| Are the details confidential? | no/partly |
|
|
| System development/integration | yes, planned for 2008 |
| Distributed system | possible |
| p2p technology | no |
| mobile access | yes |
| DRM | yes |
Overview of Rushes | Project name | RUSHES |
| co-ordinator | Fraunhofer HHI, Dr. Oliver Schreer |
| Budget in Mio. Euro | 4,55 (2,67 funded) |
| project start | 01.02.2007 |
| project duration (in month) | 30 |
|
|
| Objectives |
|
| main objectives | to design, implement, and validate a system for indexing, accessing and delivering raw, unedited audio-visual footage known in broadcasting industry as "rushes". |
| objectives regarding AV search engine technology | to provide services for querying audio-visual footage using keywords, semantics or actual footage examples. |
| target / final product | (1) to allow home users to have advanced search functionalities and low access latency when navigating rushes databases; (2) to allow professional users to conduct automatic content cataloguing and semantic based indexing to link raw content with metadata; (3) to illustrate the benefits of using semantic technologies in video annotations/indexing; (4) to summarise AV sequences using representative frames. |
| internal user groups of the project results | specific departments of RUSHES industrial partners such as ATC, FAST, ETB |
| external user groups of the project results | broadcasters, search engine development companies, other European projects and clusteringinitiatives (CHORUS) |
| scenarios for deployment | regular meeting scenario, movies, advertisements, TV news report |
| data sources | text, television and other resources, and radio and other audio resources. |
| metadata inventories | MPEG-7 XML |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing | high- and low-level audio features, e.g. envelop, frequency, time, space, etc., will be extracted for a proper classification. |
| image indexing | high- and low-level video features, e.g. color, texture, action, space, etc., will be extracted for a proper classification. |
| video indexing | this consists of image and audio indexing. Video indexing cannot be accomplished unless the two components' indexing is performed. This involves alignment of visual and audio signals, interaction of two components, and other process. In terms of video summarisation/annotation, this can be performed using an attention model that considers human visual models for motion, audio, and event detection. |
| text+semantics | Graph Matching, Kernel analysis … can be used for similarity search. |
| multimodal fusion | Audio and visual observations can be fused in the domain of Bayesian network |
| retrieval models/techniques | probably Hidden Markov Model, or Mixture Gaussian Model |
|
|
| official benchmarking/which one | TRECVID |
| evaluation | the basic idea is the statistical analysis based on the test on benchmarking data |
| social networks | Britain's universities, some local companies such as BT, Microsoft, HP, IBM, Motorola. |
| what standardization body are you addressing (if applicable)? | MPEG/ITU-T, JPSearch, DVB, SMPTE, IPTC |
|
|
| Please identify your use cases (if any) | journalists working at broadcasters will use the RUSHES system for semi-automatic indexing and annotation as well as for retrieval of rushes material. |
| Are the details confidential? | yes |
|
|
| System development/integration |
|
| Distributed system | yes |
| p2p technology | no issue in RUSHES |
| mobile access | no issue in RUSHES |
| DRM | no issue in RUSHES |
Overview of Sapir | Project name | SAPIR |
| co-ordinator | Yosi Mass, IBM |
| Budget in Mio. Euro | 4,5 |
| project start | 01.01.2007 |
| project duration (in month) | 30 |
|
|
| Objectives |
|
| main objectives | The broad scope of SAPIR is to develop theories and technologies for next-generation search techniques that would effectively and efficiently deliver relevant information in the presence of exponentially growing (i.e. dynamic) volumes of distributed multimedia data. Fundamental to our approach is the development of scalable solutions that address the requirements of future generations of massively distributed data produced in a variety of applications. The scale of the problem can be gauged from the fact that almost everything we see, read, hear, write and measure will soon be available to computerized information systems. |
| objectives regarding AV search engine technology | While structured search methods apply to attributed-type data that yield records that match the search query exactly, SAPIR offers a more modern approach to searching information through similarity searching which is used in content-based retrieval for queries involving complex data such as images, videos, speech, music and text. Similarity search is based on gradual rather than exact relevance using a distance metric that, together with the database, forms a mathematical metric space. The obvious advantage of similarity search is that the results can be ranked according to their estimated relevance. However, current similarity search structures, which are mostly centralized, reveal linear scalability in respect to the data search size, which is not sufficient for the expected data volume dimension of the problem. With the increasing diversity of digital data types covering practically all forms of fact representation, computerized data processing must provide adequate tools for similarity searching. |
| target / final product | Define APIs and show a prototype that can do feature extractions form the different medias and index and search large volumes using a P2P architecture. |
| internal user groups of the project results | SAPIR partners |
| external user groups of the project results | The APIs will be published and be available for external users. We will have to decide which components that are developed by SAPIR will be available also. |
| scenarios for deployment | We have worked on 5 possible scenarios for the technology - 1. Advanced home messaging 2. The music and text scenario 3. Tourist searching 4. Hollywood@home 5. The journalist's helpers |
| data sources | We may start by testing image + text + metadata on the Flickr image collection. |
| metadata inventories | From Flickr and automatically extracted |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) | See next sheets |
| speech/audio indexing |
|
| image indexing |
|
| video indexing |
|
| text+semantics |
|
| multimodal fusion |
|
| retrieval models/techniques |
|
|
|
| official benchmarking/which one |
|
| evaluation |
|
| social networks | We currently work on definitions of Social Networks and how they can improve the search results. This is part of WP7 |
| what standardization body are you addressing (if applicable) | MPEG-7, MPEG-21 |
|
|
| Please identify your use cases (if any) | We work on 5 User scenarios as described above. Use cases can be derived from those scenarios. |
| Are the details confidential? | No |
|
|
| System development/integration | We defined indexing ans Search APIs. We currently work on first implementation of the Search APIs. We will upgrade the APIs as work progress and also add Content Managtement/Feature extraction APIs. |
| Distributed system |
|
| p2p technology | The main objective of the project is a large scale search using P2P technology. |
| mobile access | Will be supported as part of a dedicated WorkPackage (WP7) |
| DRM | This will be developed as part of a dedicated WorkPackage (WP6). |
Overview of Semedia | Project name | SEMEDIA Search Environments for Media |
| co-ordinator | Prof. Ricardo Baeza-Yates, Yahoo! Research |
| Budget in Mio. Euro | Funding: 2,73 |
| project start | 01.01.2007 |
| project duration (in month) | 30 |
|
|
| Objectives | The overall objective of SEMEDIA is to develop a collection of audiovisual search tools that are heavily user driven, preserve metadata along the chain, are generic enough to be applicable to different fields (broadcasting production, cinema postproduction, social web). This will be achieved through five specific objectives: |
| main objectives | O1. To develop techniques to extract metadata from ‘essence’ in ways that allow the automatic inference of high-level structural information from the content of new, partly annotated media data produced in a range of professional and amateur contexts. O2. To create tools for navigating intelligently and searching efficiently in very large bodies of media in heterogeneous, distributed, networked data storage systems. O3. To design and evaluate efficient user interfaces that allow fast browsing. O4. To integrate the results in a series of prototypes for real production and postproduction environments, and evaluate them with real data sets, user groups and industry work flows. O5. To develop strategies for wide dissemination of the results and their incorporation into marketable products. |
|
|
| objectives regarding AV search engine technology | " |
| target / final product | Tools will be integrated into industrial partner's systems. An integrated demonstrator will also be produced. |
| internal user groups of the project results | Yes, industrial partners have formed internal user groups. |
| external user groups of the project results | Yes, an external user group has been organized. |
| scenarios for deployment | Yes, however, it is available to the Consortium only. In Month 12, user scenarios will be made available to the Public. |
| data sources | Yes, industrial partners (BBC, CCRTV-ASI, S&M, and Yahoo!) have made data available to the consortium partners. |
| metadata inventories | Yes, meta-data inventories related to the data sources are being build. |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing | N/A |
| image indexing | Yes, to the extend that it contributes to the video indexing and retrieval task. |
| video indexing | Yes, this is the main focus of the SEMEDIA project |
| text+semantics | Yes, to the extend that it contributes to the video indexing and retrieval task. |
| multimodal fusion | Yes |
| retrieval models/techniques | Yes |
|
|
| official benchmarking/which one | Possibly an adaptation of TRECVID |
| evaluation | 3 Types: 1. System perf. 2. Usability 3. Retrieval perf. |
| social networks | Flickr and online models based on video |
| what standardization body are you addressing (if applicable) | tools produced will use "standard" APIs, whenever possible, we will adopt existing standards. MPEG7 is the current candidate. |
|
|
| Please identify your use cases (if any) | initial user scenarios produced (consortium only). In month 12, revised scenarios will be produced and available publicly. |
| Are the details confidential? | in month 12, scenarios will be available publicly. |
|
|
| System development/integration | planned that tools will be integrated into industrial partners systems. Integrated demos will also be produced. |
| Distributed system | Yes, but it is not the main focus of the project. |
| p2p technology | No. |
| mobile access | No. |
| DRM | Yes, Digital Rights Management is a concern and is being addressed. |
Overview of Tripod | Project name | Tripod |
| co-ordinator | University of Sheffield, Mark Sanderson |
| Budget in Mio. Euro | funding: 3.15 |
| project start | 01.01.2007 |
| project duration (in month) | 36 |
|
|
| Objectives |
|
| main objectives | The primary objective of Tripod is to revolutionise access to the enormous body of visual media. Applying an innovative multidisciplinary approach Tripod will utilise largely untapped but vast, accurate and regularly updated sources of semantic information to create ground breaking intuitive search services, enabling users to effortlessly and accurately gain access to the image they seek from this ever expanding resource. |
| objectives regarding AV search engine technology | Create image search facilities that serve broader user needs than current keyword or content-based approaches provide |
| target / final product | Package Tripod's tools as a suite of services to prepare Tripod for exploitation in a wide range of markets |
| internal user groups of the project results | Ordnance Survey, United Kingdom; Centrica, Italy; Geodan Holding BV, The Netherlands; Fratelli Alinari Istituto Edizioni Artistiche SpA, Italy; Tilde, Latvia |
| external user groups of the project results | Photographic agencies |
| scenarios for deployment |
|
| data sources | Mapping data from OS & Geodan; photographs from Alinari & Tilde |
| metadata inventories |
|
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing |
|
| image indexing |
|
| video indexing |
|
| text+semantics | × |
| multimodal fusion |
|
| retrieval models/techniques |
|
|
|
| official benchmarking/which one |
|
| evaluation |
|
| social networks |
|
| what standardization body are you addressing (if applicable) |
|
|
|
| Please identify your use cases (if any) |
|
| Are the details confidential? |
|
|
|
| System development/integration |
|
| Distributed system |
|
| p2p technology |
|
| mobile access |
|
| DRM |
|
Overview of Victory | Project name | VICTORY |
| co-ordinator | Dr. Dimitrios Tzovaras |
| Budget in Mio. Euro | project budget: 3,869 |
| project start | 01.01.2007 |
| project duration (in month) | 30 |
|
|
| Objectives |
|
| main objectives | O1: The first objective of VICTORY is to develop the MultiPedia repository and the mechanisms to support its wide access by the community. The centralised MultiPedia repository will consist of only the 3D models that contain the global truth of the objects stored in the repository. The accompanying MultiPedia information (2D images, text, annotations, etc.) will be available on a peer-to-peer basis. Tools will be supported by the repository administration mechanism for population, management and reorganisation of the centralised content. The content will be adequately categorised in order to support special interest groups targeting mainly industrial applications (automotive, games, simulations, etc.). Also, the repository will act as the main access point for the P2P framework and thus it will support mechanisms for adding MultiPedia content from the peers connected each time to the VICTORY network. O2: The second objective of VICTORY is to develop novel 3D search and retrieval algorithms (see below) O3: The third objective of VICTORY is the development of novel search and retrieval framework that allows an easy integration of different search methodologies (see below). O4: The fourth objective of VICTORY is the development of a P2P scheme so as to utilise not only the distributed data storage, but also the computational power of each peer for the pre-processing, interpreting, indexing, searching, retrieving and representing of MultiPedia data. Through the VICTORY framework, users will be able to handle, share and retrieve 3D and audio-visual data among peers around the world. Moreover, every peer will be responsible for extracting and indexing the features of the shared 3D data, thus the efficient manipulation of the 3D data will be accomplished. The P2P-based middleware will provide the means (intelligence, semantics, and communications protocols) allowing the negotiation and determination of peer resources sharing. The key driver will be the user QoE realised as the combination of a multitude of Quality of Services (communications quality, processing speed, 3D content rendering quality, power consumption, etc) impacting the user experience. |
|
|
| objectives regarding AV search engine technology | O2: The second objective of VICTORY is to develop novel 3D search and retrieval algorithms which will be based on a) content, which will be extracted taking into account low-level geometric characteristics and b) context, which will be high-level features (semantic concepts) mapped to low-level features. In the existing 3D search and retrieval methods no semantic information (high-level features) is attached to the (low-level) geometric features of the 3D content, which would significantly improve the retrieved results. Therefore, the second objective of the proposed system is to introduce a solution so as to bridge the gap between low and high-level information through automated knowledge discovery and extraction mechanisms. High level features will be a) appropriate annotation options provided by the system or generated by the user dynamically (active learning) and b) relevance feedback where the user will mark which retrieved objects he thinks are relevant to the query (user’s subjectivity). The strength of the VICTORY approach is the ability to translate both explicit and tacit knowledge of the user into semantic information by analysing user’s explicit operations like manual annotation, query by example, feedback and intuitive interactions with the system like browsing or objects manipulations. This acquired knowledge will be exploited to automatically propagate annotations through the existing object database of each peer and to adapt the retrieval process to the user’s subjectivity. The input of the system will consist of mixed-media (MultiPedia) queries such as text (annotation), 2D images (taken by the user's mobile device), sketches made by the user and 3D objects. Therefore, 2D/3D combined algorithms are going to be developed and integrated to the search engine.
O3: For supporting sophisticated 3D content search and retrieval, a search framework is needed that allows for combining text-/metadata-based searching with 3D object searching. An ontology helps to cluster the 3D objects and to either • use the ontology as organizing principle to navigate through the objects, or • use the ontology to restrict/guide the search through the objects
Thus, the third objective of VICTORY is the development of novel search and retrieval framework that allows an easy integration of different search methodologies. It will result in an integrated platform which allows processing and accessing data and knowledge by using ontology based management and retrieval mechanisms. The challenge within VICTORY means to bridge the gap between textual-/metadata oriented data respectively and to apply this really innovative technology to MultiPedia content, especially such as 3D-objects. |
| target / final product | 3D search engine |
| internal user groups of the project results | Companies: EMPOLIS, HYPERTECH |
| external user groups of the project results | Automotive, aeronautic, game industries, all |
| scenarios for deployment | see use cases |
| data sources | internet, automotive industries |
| metadata inventories |
|
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing |
|
| image indexing |
|
| video indexing |
|
| text+semantics |
|
| multimodal fusion |
|
| retrieval models/techniques | CERTH/ITI algorithms (see www.victory-eu.org) |
|
|
| official benchmarking/which one | Princeton Shape Benchmark (see www.victory-eu.org) |
| evaluation |
|
| social networks |
|
| what standardization body are you addressing (if applicable) | MPEG-7 |
|
|
| Please identify your use cases (if any) |
|
| Are the details confidential? | YES |
|
|
| System development/integration |
|
| Distributed system | YES |
| p2p technology | YES |
| mobile access | YES |
| DRM | YES |
Overview of Vidi-Video | Project name | VIDI-Video |
| co-ordinator | Prof. A. Smeulders |
| Budget in Mio. Euro | 3.6 Meuro |
| project start | 2/1/2007 |
| project duration (in month) | 36 |
|
|
| Objectives |
|
| main objectives | boost the performance of video search by developing a 1000 element thesaurus for automatically detecting instances of semantic concepts in the audio-visual content |
| objectives regarding AV search engine technology | Semantic search using a large-scale learned vocabulary |
| target / final product | Semantic video search engine |
| internal user groups of the project results | Fondazione Rinascimento Digitale, Italy Beeld en Geluid, The Netherlands |
| external user groups of the project results | Potential: audiovisual archives in broadcasting, surveillance, conferencing, diaries and logging. |
| scenarios for deployment | Use within the processes of the archives involved. In later stage other parties. |
| data sources | Sound and Vision archives, archives of FDR and partners, TRECVID, Surveillance data. |
| metadata inventories | Existing annotations of the archives. |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing | Yes |
| image indexing | No |
| video indexing | Yes |
| text+semantics | No |
| multimodal fusion | Yes |
| retrieval models/techniques | Yes |
|
|
| official benchmarking/which one | TRECVID, VOC |
| evaluation | Yes |
| social networks | Yes (for obtaining annotations) |
| what standardization body are you addressing (if applicable) | MPEG-7, RDFS/OWL |
|
|
| Please identify your use cases (if any) |
|
| Are the details confidential? | Partly |
|
|
| System development/integration | Yes |
| Distributed system | Yes |
| p2p technology | No |
| mobile access | No |
| DRM | No |
Overview of Vitalas | Project name | VITALAS |
| co-ordinator | INRIA, ERCIM |
| Budget in Mio. Euro | 6 millions |
| project start | January 1st, 2007 |
| project duration (in month) | 36 |
|
|
| Objectives |
|
| main objectives | Use-case driven project that aims that aims to provide advanced solution for indexing, searching and accessing large scale digital audio-visual content. |
| objectives regarding AV search engine technology | Cross-media indexing and retrieval, interactivity and context adapting, scalability |
| target / final product | Pre-industrial prototype system dedicated to intelligent access services to multimedia professional archives |
| internal user groups of the project results | Audiovisual archives (INA) and broadcasters (IRT), Photo press agency BELGA |
| external user groups of the project results | Photo press agency AFP |
| scenarios for deployment |
|
| data sources | INA, IRT, BELGA |
| metadata inventories | IPTC BELGA annotations, INA video archives annotations |
|
|
| Modalities (please give detailed answers in the additional sheets of this form) |
|
| speech/audio indexing | yes |
| image indexing | yes |
| video indexing | yes |
| text+semantics | yes |
| multimodal fusion | yes |
| retrieval models/techniques | yes |
|
|
| official benchmarking/which one | Not yet. Probably TRECVID. Maybe ImageCLEF, ImagEval. |
| evaluation | Technical evaluation + end user tests |
| social networks | no |
| what standardization body are you addressing (if applicable) | Content representation (e.g. JPEG), query languages (e.g. Xquery), evaluation of multimedia retrieval systems (e.g. JPsearch) |
|
|
| Please identify your use cases (if any) | Automatic labelling of visual concepts in images, global navigation in a set of results, Interactive browsing of a search results, Search by concept, Face identification, Personalization, Search by example, Visual and audio categorization, Search by concept in video content. |
| Are the details confidential? | yes |
|
|
| System development/integration | 3 prototype versions. V1 due to January 2008 |
| Distributed system | Yes: Web services, distributed similarity search structures |
| p2p technology | No |
| mobile access | No |
| DRM | No |
Project Divas | module/task | Music Segmentation |
| investigator/partner | Fraunhofer IDMT |
| applied algorithms/approaches | segmentation algorithm based on Foote's segmentation |
| pre-existing technology before project start | MP3, Music Segmentation, Speech Segmentation |
| research challenge/innovation/not addressed | improvement of the music segmentation algorithm; music segmentation directly from the compressed domain |
| type and amount of processed data | compressed audio data; more than 1000 pieces of music |
| success criteria, recognition/indexing rate | at the moment the recognition performance is about 70% |
| risk |
|
| demo (available/planned/not forseen) | demo is available |
| module/task | Speech Segmentation |
| investigator/partner | SAIL LABS |
| applied algorithms/approaches | make models more robust by statistical training using compressed audio and application of transforms |
| pre-existing technology before project start | segmentation component using specifically trained phone-level models and a GMM/BIC-based approach for segmentation. |
| research challenge/innovation | keep aproximate same level of segmentation results in spite of compressed audio data and corresponding limitation of incorporated information |
| type and amount of processed data |
|
| success criteria, recognition/indexing rate | see separate table of current non-compressed vs compressed data segment recogition results |
| risk | moderate |
| demo available | yes |
Project Rushes | module/task | Audio retrieval |
| investigator/partner | Brunel University, UK |
| applied algorithms/approaches | possibly HMM with perception model (how human beings link speech with visual components) |
| pre-existing technology before project start | HMM implementation by others |
| research challenge/innovation/not addressed | feature extraction/selection, speech-to-video transmoding |
| type and amount of processed data | real audio data, at least 20 persons and each one > 10 minutes |
| success criteria, recognition/indexing rate | in the used database, hopefully > 85% |
| risk | consistency of recognition |
| demo (available/planned/not forseen) | will be available |
Project Sapir | module/task | Speech |
| investigator/partner | IBM |
| applied algorithms/approaches | We use an Automatic Speech Recognition (ASR) system for transcribing speech data. The ASR generates lattices that can be considered as directed acyclic graphs. Each vertex in a lattice is associated with a timestamp and each edge (u,v) is labeled with a word or phone hypothesis and its prior probability, which is the probability of the signal delimited by the timestamps of the vertices u and v, given the hypothesis. The 1-best path transcript is obtained from the path containing the best hypotheses using dynamic programming techniques. For indexing and search purposes, it is often more convenient to use a compact representation of a word lattice, called word confusion network (WCN). Each edge is labeled with a word hypothesis and its posterior probability, i.e., the probability of the word given the signal. The main advantages of WCN are that it provides an alignment for all of the words in the lattice and also posterior probabilities. Note that the 1-best path can be directly extracted from the WCN. |
| pre-existing technology before project start | We have an ASR technology and we adapt it to represent the features in MPEG-7 and then use it for indexing and search in the SAPIR p2p architecture. |
| research challenge/innovation/not addressed | Write a UIMA Annotators that extract the features and represent them in MPEG-7. Index and search using a P2P architecture |
| type and amount of processed data | TBD |
| success criteria, recognition/indexing rate | TBD |
| risk | Efficiency dimension - SAPIR basic (features similarity search) performance can degrade for large volume of content and/or large number of peers, resulting in scalability issues. Effectiveness dimension - Feature search does not improve over text only search, resulting in little gain over existing approaches. |
| demo (available/planned/not forseen) | Planned |
| module/task | Music |
| investigator/partner | UPD - University of Padova |
| applied algorithms/approaches | Music ContentObjects can be instantiated in three main forms: digital audio recordings with possible compression, MIDI (Musical Instrument Digital Interface) files with temporal information, and digital scores. All the forms may be of interest for the final user, depending on the required audio quality, on the available bandwidth, on the usage, and on copyright restrictions. Many formats correspond to audio and score forms, yet for the aims of this project, only open formats will be addressed, such as MP3 and aiff for audio or Lilypond [11] and Guido [12] for scores. The first step in music processing will regard the automatic extraction of high level features, which are shared by all the forms. The main content descriptors, are the rhythm and the melody of the leading voice. |
| pre-existing technology before project start | UPD has technology for Music feature extraction |
| research challenge/innovation | Write a UIMA Annotators that extract the features and represent them in MPEG-7. Index and search using a P2P architecture |
| type and amount of processed data | TBD |
| success criteria, recognition/indexing rate | TBD |
| risk | Same as for speech |
| demo available | Planned |
Project Vitalas | module/task | Speech Mining and Segmentation |
| investigator/partner | Fraunhofer IAIS |
| applied algorithms/approaches | The speech recordings from the content providers (e.g. INA) are segmented automatically in homogenous segments. Further, a speech/non-speech detection is performed. Here algorithms based on Gaussian Mixture Techniques are applied. The indexing of the speech files is performed by subword recognition on syllable level. Here Hidden-Markov-Models are used for the context-dependent modelling of the phones. For the subword retrieval process a dynamic time warping approach in combination with the Levensthein distance metric is applied to enable a fuzzy search to eliminate the Out-Of-Vocabulary problem. |
| pre-existing technology before project start | Speech recognition engine for the German language. |
| research challenge/innovation | Robust indexing on large scale corups |
| type and amount of processed data | Speech and video recordings from the INA archive (mainly in French) |
| success criteria, recognition/indexing rate | Tbd |
| risk | High |
| demo available | Segmentation and indexing demo for German is available |
| module/task | Jingle detection |
| investigator/partner | Fraunhofer IAIS |
| applied algorithms/approaches | Fingerprints extraction, combination of features, Gaussian filtering techniques |
| pre-existing technology before project start | Zero crossing rate and spectral flatness |
| research challenge/innovation | Solve fade-in/fade-out problems and signal overlap problems, scalability |
| type and amount of processed data | Audio-visual archives, 10 000 hours |
| success criteria, recognition/indexing rate | Currently being defined |
| risk | Medium |
| demo available | Segmentation and indexing demo for German is available |
Project VidiVideo | module/task | Audio Analysis |
| investigator/partner | INESC |
| applied algorithms/approaches | Speech recognition, Machine learning, MEL features |
| pre-existing technology before project start | Speech recognition for Portugese, features for speech recognition, no use of audio events in search engines |
| research challenge/innovation/not addressed | Non-news data, integration of many different features, audio-visual integration in early stages, |
| type and amount of processed data | Broadcast TV, >500 hours |
| success criteria, recognition/indexing rate | Average Precision |
| risk | Methods don't generalize to the new domains |
| demo (available/planned/not forseen) | Integrated demo for whole of VidiVideo |
Project Rushes
| module/task | 3D video scene description |
| investigator/partner | FhG/HHI, Germany |
| applied algorithms/approaches | camera motion and 3D scene structure clustering |
| pre-existing technology before project start | initial algorithm |
| research challenge/innovation | real life data |
| type and amount of processed data | 1 hour rushes material from EiTB |
| success criteria, recognition/indexing rate | not yet defined |
| Risk | too inaccurate for real life data |
| demo available | no |
Project Victory
| module/task | 3D Search engine |
| investigator/partner | CERTH/ITI |
| applied algorithms/approaches | see www.victory-eu.org |
| pre-existing technology before project start | www.3d-search.iti.gr |
| research challenge/innovation | all the techniques used are innovative |
| type and amount of processed data | thousands of 3D models |
| success criteria, recognition/indexing rate | retrieval accuracy>95% |
| Risk |
|
| demo available | (see www.victory-eu.org) |
Project Divas | module/task | Video segmentation, indexing and search |
| investigator/partner | ELECARD |
| applied algorithms/approaches | "Scene change detection" segmentation algorithm, "Scene change" index search algorithm, "Brightness histogram (horizontal)" index creation and index comparing algorithm, "Key frame extraction" algorithm for index creation on compressed domain |
| pre-existing technology before project start | initial algorithms |
| research challenge/innovation | H.264 segmentation, indexing and search on compressed domain |
| type and amount of processed data | 30 hours video from Escom, 30 hours video from BeTV |
| success criteria, recognition/indexing rate | Scene change detection segmentation algorithm - 90% accuracy |
| Risk | Some content (encoded with codecs other, than H.264 and MPEG-2) needs full decoding |
| demo available | not yet |
Project Rushes | module/task | Relevance feedback |
| investigator/partner | Queen Mary University London, UK |
| applied algorithms/approaches | Support vector machines |
| pre-existing technology before project start | initial algorithm |
| research challenge/innovation | real life data |
| type and amount of processed data | about 158 hours news video from Trecvid 2006 |
| success criteria, recognition/indexing rate | above 70% at the last iteration |
| Risk | amount and quality of data |
| demo available | yes |
| module/task | AV information retrieval |
| investigator/partner | Brunel University, UK |
| applied algorithms/approaches | HMM scheme |
| pre-existing technology before project start | wavelet implementation for feature selection/ranking |
| research challenge/innovation | real life data |
| type and amount of processed data |
|
| success criteria, recognition/indexing rate | it should be more than 95% |
| Risk | inaccurate for real life data |
| demo available | yes |
| module/task | Video annotation and summarisation |
| investigator/partner | Brunel University, UK |
| applied algorithms/approaches | semantic feature based annotation and frame based summarisation |
| pre-existing technology before project start | IBM UIMA package |
| research challenge/innovation | comprehensive search |
| type and amount of processed data | Internet resources |
| success criteria, recognition/indexing rate | hopefully > 70% |
| Risk | diversity of information |
| demo available | will be available |
Project Sapir | module/task | Video segmentation |
| investigator/partner | Eurix |
| applied algorithms/approaches | The video processing module segments a video into temporal units using different levels of granularity: keyframes, shots and clusters. Shots and clusters represent the first level of decomposition, while keyframes are used at the second level. |
| pre-existing technology before project start | initial algorithm |
| research challenge/innovation | Extract data and represent it in MPEG-7. Then use the MPEG-7 for indexing and retrieval. |
| type and amount of processed data | TBD |
| success criteria, recognition/indexing rate | TBD |
| Risk | Efficiency dimension - SAPIR basic (features similarity search) performance can degrade for large volume of content and/or large number of peers, resulting in scalability issues. Effectiveness dimension - Feature search does not improve over text only search, resulting in little gain over existing approaches. |
| demo available | No |
Project Semedia | Module/task | Quick overview of media including sparsely annotated material |
| investigator/partner | JRS |
| applied algorithms/approaches | new approaches for browsing & navigation within huge sparsely annotated material and classificators |
| pre-existing technology before project start | several low level analysis modules, framework for GUI development |
| research challenge/innovation | development of algorithms and GUIs for browsing & navigation including e.g. setting detection, finding of retakes and classificators |
| type and amount of processed data | Substantial subsets from BBC, CCRTV, S&M and flickr test data |
| success criteria, recognition/indexing rate | Application dependent, varying from high recall to high precision |
| Risk | minimal |
| demo available | planned to be integrated into the post-production demonstrator |
| module/task | Low level indexing for efficient searches of A/V databases |
| investigator/partner | FBM-UPF |
| applied algorithms/approaches | bag of visual words approach based on: local region detectors: Harris, Hession, MSER; Sift and GLOH descriptors; Aggregation of visual object representations |
| pre-existing technology before project start | N/A |
| research challenge/innovation | Combining content-based image retrieval with social media object annotations |
| type and amount of processed data | Millions of Flickr photos and high-quality video. |
| success criteria, recognition/indexing rate | Measured in term of recall/precision and accuracy |
| Risk | Scalability of the approach of Internet size (hundreds of millions of photos and or video) |
| demo available | Planned to be integrated into the second version of the web-based-communities demonstrator |
| Module/task | Efficient combination of metadata sources |
| investigator/partner | JRS |
| applied algorithms/approaches | development of methods and tools to efficiently combine metadata coming from different sources relating to the same essence; development of methods to ensure metadata consistency and content over the entire production workflow |
| pre-existing technology before project start | results from a diploma thesis we performed within this area |
| research challenge/innovation | successfully apply technologies from the semantic web area within multi media description formats such as MPEG-7; development of identity resolution (find out which annotations are the same) and find a upper hierarchy/ontolgy to describe the content neutral and coherent |
| type and amount of processed data | A substantial sub-set of Flickr photo annotations |
| success criteria, recognition/indexing rate | Application dependent, varying from high recall to high precision |
| Risk | minimal |
| demo available | planned to be integrated into the web-based-communites demonstrator and maybe within post-production demonstrator |
| Module/task | Data architectures and security in networked media environments |
| investigator/partner | UPC, DVS |
| applied algorithms/approaches | Fast metadata extraction from cluster filesystem storages, Caching algorithms |
| pre-existing technology before project start | DVS Spycer Content Management System, results from UPC caching research |
| research challenge/innovation | Efficient content management on cluster filesystem storages |
| type and amount of processed data | Media data from broadcast and postproduction, some TB |
| success criteria, recognition/indexing rate | Better scalability, higher throughput |
| Risk | Minimal |
| demo available | Planned |
| Module/task | Media mining techniques |
| investigator/partner | UG |
| applied algorithms/approaches | affect -based models for mining event patterns in football video data sets |
| pre-existing technology before project start |
|
| research challenge/innovation | event detection by analysing audio, video ad textual streams |
| type and amount of processed data | World CUP Football data set |
| success criteria, recognition/indexing rate | Application dependent, varying from high recall to high precision |
| Risk | minimal |
| demo available | Planned |
| Module/task | Interface design for context-aware adaptive search, browsing and annotation |
| investigator/partner | FBM-UPF |
| applied algorithms/approaches | New algorithms for semantic clustering, surrogate formation, layout management, and new approaches for direct interaction, minimalistic design |
| pre-existing technology before project start | Calm technolgy approach, inteface ecology, information visualization techniques for large information spaces, latent semantic analysis, statistical models of user interaction |
| research challenge/innovation | Increasing contact with media spaces, designing for a prolonged exploration, building an immersive experience |
| type and amount of processed data | videos collected from social sites along with their contextual information, news articles and RSS feeds. In total 5000 videos and 10000 articles |
| success criteria, recognition/indexing rate | Prolonged immersive exploration of information spaces, social interaction, intuitive affordances of interaction mechanism |
| Module/task | Prototypes in Media Postproduction Environments |
| investigator/partner | All partners |
| applied algorithms/approaches | Selection of approaches developed above |
| pre-existing technology before project start | S&M Cakes production management system and DVS Spycer content management system |
| research challenge/innovation | Use of selected approaches in real-world postproduction content management tools |
| type and amount of processed data | Dataset from S&M postproduction, probably a few TB |
| success criteria, recognition/indexing rate | Usefulness of the integrated research approaches, User satisfaction |
| Risk | Integration problems |
| demo available | Planned |
| Module/task | Feedback-Only Search |
| investigator/partner | FBM-UPF |
| applied algorithms/approaches | Development of specialized algorithms for feedback-intesive situations. Comparison to standard statistical classifiers. |
| pre-existing technology before project start | Many standard statistical classifiers (e.g., SVM) |
| research challenge/innovation | Identification of feedback-intensive situations and performance comparison of statistical classification techniques to retrieval and feedback functions. Analysis of advantages of specialized algorithms compared to standard classifiers. |
| type and amount of processed data | Substantial subsets from BBC, CCRTV, S&M, and Y!I test data |
| success criteria, recognition/indexing rate | Achieve a better understanding of techniques for feedback-intensive situations. Success of new algorithms will be measure by high accuracy in classifying. |
| Module/task | Prototypes in Broadcast Media Environments |
| investigator/partner | All partners |
| applied algorithms/approaches | Selection of approaches developed above |
| pre-existing technology before project start | CCRTV-ASI Digition Suite for professional asset management |
| research challenge/innovation | Use of selected approaches in real-world broadcast content management tools |
| type and amount of processed data | A sub-set of the CCRTV online and archieve media files, probably a few TB |
| success criteria, recognition/indexing rate | Usefulness of the integrated research approaches, User satisfaction |
| Risk | Integration problems |
| demo available | Planned |
| Module/task | Integrated retrieval and mining models |
| investigator/partner | UG |
| applied algorithms/approaches | event mining models and new retrieval models |
| pre-existing technology before project start | event detection algorithms |
| research challenge/innovation | Integration of retrieval model with mining data set |
| type and amount of processed data | TREC VID data set, world cup data set |
| success criteria, recognition/indexing rate | precision, recall |
| Risk | minimal |
| demo available | planned |
| Module/task | Prototypes for media access, search and retrieval in web-based communities |
| investigator/partner | All partners |
| applied algorithms/approaches | Selection of approaches developed above |
| pre-existing technology before project start | Yahoo! Web servers |
| research challenge/innovation | Use of selected approaches in real-world web community environments |
| type and amount of processed data | Sub-set of media files from Yahoo! Communities, probably a few TB |
| success criteria, recognition/indexing rate | Positive user feedback |
| Risk | Integration problems, lack of acceptance by users |
| demo available | Planned |
Project VidiVideo
| Module/task | Visual analysis |
| investigator/partner | UvA, CVC |
| applied algorithms/approaches | Keypoints, color spaces, machine learning, motion pattern analysis |
| pre-existing technology before project start | Various feature detection methods, SVM based learning of concepts |
| research challenge/innovation | Complete invariant feature sets, Motion features, |
| type and amount of processed data | Broadcast TV, >500 hours |
| success criteria, recognition/indexing rate | Average Precision |
| Risk | Ambition of 1000 usable detectors too high. |
| demo available | Yes |
| Module/task | Learning |
| investigator/partner | Surrey, UvA |
| applied algorithms/approaches | Machine learning |
| pre-existing technology before project start | mostly SVM based classifiers |
| research challenge/innovation | Integrated multi-media features, fusion low-high level semantics, class specific detectors |
| type and amount of processed data | Broadcast TV, >500 hours |
| success criteria, recognition/indexing rate | Average Precision |
| Risk | inbalance in training/testing set |
| demo available | No |
Project Vitalas
| module/task | Rigid local entities retrieval |
| investigator/partner | INA, INRIA |
| applied algorithms/approaches | Low level local features extraction, similarity search structure, tracking and spatio-temporal fusion |
| pre-existing technology before project start | SIFT like local features, common similarity search structures |
| research challenge/innovation | More discrimant local features, Large video datasets, spatio-temporal fusion |
| type and amount of processed data | 10000 hours of video (INA) |
| success criteria, recognition/indexing rate | Currently being defined |
| risk | Medium |
| demo available | Not yet |
| module/task | Large set of cross-media concepts extraction |
| investigator/partner | CERTH-ITI, UoS, CWI |
| applied algorithms/approaches | Low level features, Hierarchy of classifiers, machine learning |
| pre-existing technology before project start | Low level features, SVM, mediamill |
| research challenge/innovation | Cross-media fusion, Large hierarchy of classifiers, several thousands of concepts |
| type and amount of processed data | 1000 hours of video (INA) |
| success criteria, recognition/indexing rate | Currently being defined |
| risk | High |
| demo available | Not yet |
Project Rushes | module/task | Semantic reasoning |
| investigator/partner | Queen Mary University London, UK |
| applied algorithms/approaches | Bayesian networks |
| pre-existing technology before project start | Initial algorithm |
| research challenge/innovation | Availability of semantic features |
| type and amount of processed data | semantic annotation of 10 concepts in 12000 images |
| success criteria, recognition/indexing rate | Improved accuracy compared with initial annotation |
| risk | Availability and accuracy of semantic features |
| demo available | yes |
| module/task | Text semantic retrieval |
| investigator/partner | Brunel University, UK |
| applied algorithms/approaches | biologically driven segmentation/clustering->feature extraction/selection->Support Vector Machine for classification |
| pre-existing technology before project start | some segmentation implementation, e.g. kernel based. |
| research challenge/innovation | segmentation and features to be selected for similarity search |
| type and amount of processed data | on-line documents |
| success criteria, recognition/indexing rate | > 90% (possibly) |
| risk | unknown |
| demo available | yes |
Project Sapir | module/task | Text |
| investigator/partner | Xerox |
| applied algorithms/approaches | Four kinds of information will be generated by text processing: 1. word-level indexing information: information about word occurrences in the text; used for keyword searching 2. named entity information: annotation of names of people, places, dates; used for searches with semantic constraints 3. extracted facts: structured information induced from text; used for searches with semantic constraints 4. summary: a selection of important sentences that allows a user to determine quickly whether a ContentObject is relevant |
| pre-existing technology before project start | Some text analytics tool from Xerox |
| research challenge/innovation | Write a UIMA Annotators that extract the features and represent them in MPEG-7. Index and search using a P2P architecture |
| type and amount of processed data | TBD |
| success criteria, recognition/indexing rate | TBD |
| risk | Same as for Speech |
| demo available | Planned |
Project Tripod | module/task | Caption augmentation for images with existing captions |
| investigator/partner | Tripod partners |
| applied algorithms/approaches | Expanding captions with words from Web pages and from map data |
| pre-existing technology before project start | Very little currently being done |
| research challenge/innovation | Making the approach work well |
| type and amount of processed data | Thousands of images |
| success criteria, recognition/indexing rate | Acceptance of image captions by photolibraries |
| risk | Medium |
| demo available | Not yet |
Project Vitalas | module/task | Text search module |
| investigator/partner | EADS, CWI |
| applied algorithms/approaches | Vectorial approaches |
| pre-existing technology before project start | TF/IDF, Inverted lists |
| research challenge/innovation | Large scale |
| type and amount of processed data | Annotations (manually and automatically generated) of audio-visual and photo angency archives (10000 hours of video, 3 millions images) |
| success criteria, recognition/indexing rate | Currently being defined |
| risk | Low |
| demo available | Not yet
|
module/task | Word sense disambiguation |
| investigator/partner | University of Sunderland |
| applied algorithms/approaches | Statistic methods |
| pre-existing technology before project start | EuroWordNet |
| research challenge/innovation | Large scale |
| type and amount of processed data | Annotations (manually and automatically generated) of audio-visual and photo angency archives (10000 hours of video, 3 millions images) |
| success criteria, recognition/indexing rate | Currently being defined |
| risk | Low |
| demo available | Not yet |
Annex B: Overview of the national research projects | Project | Quaero |
| Budget | · €100m for >5 years and more than 20 partners · Granted by French ‘Agence de L’innovation Industrielle’ · State aid to be authorised by DG Competition of European Commission |
| Duration | >5 years |
| Country | France with the participation of German partners |
| Partners | Private companies : Thomson, France Telecom, Jouve, Exalead, Bertin Technologies, LTU Technologies, Vecsys, Synapse Development Public research labs : LIMSI-CNRS, RWTH-Aachen, Karlsruhe University, INRIA, LIG-UJF, IRCAM, ENST-GET, IRIT, INIST-CNRS, MIG-INRA, LIPN Public institutions : INA, BNF, LNE, DGA Some contacts have been established with other European potential participants |
| Main Objectives and challenges | Develop demonstrators or applications corresponding to identified use cases in the domain of access and manipulation of multimedia and multilingual content · Search, navigate, distribute, produce Develop the corresponding enabling technologies for multilingual and multimodal content processing |
| Main applications and use cases | 1. Consumer Multimedia Search Engine 2. Multimedia Search Services to enrich European portals 3. Personalised Video on interactive consumer networked devices Anytime and Anywhere 4. Recondition the Audiovisual Cultural Heritage 5. Professional Digital Media Asset Management for Broadcasting Industry 6. Platform for Text and Image Annotation |
| Research and Technologies | · Search and extraction infrastructure · Content processing infrastructure · Document capture and processing · Speech recognition · Translation · Musical analysis · Object recognition in images and video · Face detection and recognition · Video segmentation and structure analysis · Object tracking and event recognition in videos · Man machine interaction · Security |
| Benchmarking of project results | Evaluation is the founding principle of Quaero’s technological research and development organisation. Evaluation will be used as a tool for facilitating and structuring technology transfer between research organisations and leaders of use cases. Periodic evaluation campaigns shall be conducted within the program to assess global progress in each of the technology areas addressed in the program. These evaluation campaign shall be build on the most advanced procedures developed and organized by national or international bodies and programs such as NIST, CLEF, Technolangue, Technovision… |
| Project | Theseus |
| http://www.bmwi.de/BMWi/Navigation/Technologie-und-Innovation/Informationsgesellschaft/multimedia,did=184810.html http://theseus-programm.de |
| Budget | Overall volume: 200 Mio. Euro (Funding: 90 Mio. Euro) |
| Duration | 5 years |
| Country | Germany |
| Partners | Industry: Empolis/Bertelsmann (co-ordinator), SAP, Siemens, Deutsche Thomson, Lycos, Morsophy, m2any, Intelligent Views, Ontoprise Research and public organisations: Fraunhofer Gesellschaft zur Förderung der angewandten Forschung (FhG), Institut für Rundfunktechnik (IRT), Deutsche Nationalbibliothek (DNB), Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Forschungszentrum Informatik (FZI), VDMA-Verband, Gesellschaft für Forschung und Innovation (VFI), universities (Karlsruhe, München, Darmstadt, Dresden, Konstanz, Erlangen) |
| Main Objectives and challenges | The main objective is to generate innovation in the area of semantic technologies to strengthen the role of the German IT industry and to establish new services in this area. The technologies are mainly for new internet based applications and services. |
| Main applications and use cases | There are several applications foreseen. They are realized in sub projects (calls “use cases”): · Alexandria: semantic internet platform to process and organize user generated content, semantic internet search platform · Contentus: Processing of cultural audio visual content of the German National Library · Medico: semantic image technology for Clinical Decision Support and Computer Aided Diagnosis. · ORDO: automatic semantic processing of huge text and audio visual corpora, semantic search tools · Processus: development of knowledge intensive tools to optimize generic production workflow · Texo: semantic based interconnection between service provider and service users |
| Research and Technologies | · Image and video processing · 3D analysis · Ontology · User interaction and semantic modelling · Machine learning · Digital rights management |
| Benchmarking of project results | In the Core Technology part of the project one work package is dealing with benchmarking of the other technology and research work. For the benchmarking the Fraunhofer IDMT is responsible |
| Project | iAD – information access disruptions |
| Budget | Ca. €30m |
| Duration | 8 years, start in 2007 |
| Country | Norway |
| Partners | · Fast Search & Transfer (Host) · Accenture · Schibsted · Cornell University · AIC Dublin (DCU, UCD) · NTNU Trondheim · University of Tromsø · University of Oslo · Norwegian School of Management |
| Main Objectives and challenges | · Core research for next generation precision, analytics and scale in information access · Build international networks to identify and execute on global disruption opportunities enabled by emerging services in the information age |
| Main applications and use cases |
|
| Research and Technologies | Schema agnostic indexing services · Schema-agnostic end2end design · Consolidation of query model Processing high-speed data streams · Capturing & extracting knowledge from data streams: · Pervasive sensor networks, RFID readers, multimedia feeds, … Scalable infrastructure for push and pull based computing · Robust principles and services for next generation infrastructure for distributed information access Extreme precision and recommendation in multimedia access · Extreme precision solutions for access to multimedia content · Social networks with recommender functions Understanding and managing the disruptive potential of iAD · Analyze business and societal impact · Assess disruptive potential |
| Benchmarking of project results |
|
| Project | MultimediaN http://www.multimedian.nl/en/multimedian.php |
| Budget | 30 MEuro |
| Duration | Phase 1: 2002 – 2004 Phase 2: 2004 – 2009 |
| Country | Netherland |
| Partners | · Center for Math and Computer Science · Philips Research · Technical University Delft · Telematica Institute · TNO · University of Amsterdam · University of Twente + 39 affiliated business partners |
| Main Objectives and challenges | MultimediaN is a public-private partnership focusing on science and technology of multimedia interaction & search engines. MultimediaN contributes to the solution of four fundamental problems: 1. The accessibility of much multimedia content is low. 2. The information is fragmented: sound can't be matched to text, text can't be matched to speech. 3. A lot of information contributes to the 'information overload' that is characteristic of today's society. 4. Multimedia information is often badly organized as a result of legacy systems, self-created standards and heterogeneity in terminologies. |
| Main applications and use cases | MultimediaN is divided in fundamental, integration, and application projects. The fundamental projects (Learning Features, Multimodal Interaction, and Ambient Multimedia Databases) create knowledge that is new on a world level. The integration projects (Semantic Multimedia Access, Professional Dashboard, and Video At Your Fingers) develop knowledge in which existing video-, audio- and speech technology are combined. The application projects (E-Culture and Personal Information Services) are pilots, which create application knowledge in an application context. · Learning Features · Multimodal Interaction · Ambient Multimedia Databases · Semantic Multimedia Access · Professional's Dashboard · Video At Your Fingertips · E-Culture (N9C) · PERsonal Information Services |
| Research and Technologies | MultimediaN covers the following research topics: · Image, picture, video processing and indexing · Audio and speech recognition and indexing · Textual processing · Knowledge modelling, mining · System engineering (databases, standards) |
| Benchmarking of project results | The modules are evaluated in several international benchmarking initiatives. For video indexing a special track of TRECVidio was established in which data from MultimediaN was used for evaluation. |
| Project | Interactive Multimodal Information Management (IM2) |
| Budget | Phase 1: · SNFS funding: 15’349’000.- CHF · Self & third-party funding: 19’655’000.- CHF Phase 2: · NSF funding: 14’000’000 · Self & third-party funding: 14’000’000.- CHF |
| Duration | 3 x 4 years (4 phases), project start: January 2002 |
| Country | Switzerland |
| Partners | IDIAP Research Institute, Martigny (co-ordinator) Partners: EPFL, Univ. Geneva, Univ. Fribourg, ETHZ, and Univ. Bern |
| Main Objectives and challenges | IM2 has the objective to develop advanced methods for indexing multimedia content and to provide advanced multimodal human computer interfaces. Therefore investigations in the area of human-human communication are carried out. |
| Main applications and use cases | The application scenario so far is the indexing and modelling of face-to-face meetings. |
| Research and Technologies | IM2 covers the following research areas: · Unconstrained speech recognition · Language understanding · Computer vision · Machine learning · Multimodal scene analysis · Model of individual and group dynamics · Sociology and social-psychology · Structure, index, summarize communication scenes · User interfaces |
| Benchmarking of project results | Each of the following technology module is evaluated in international benchmark initiatives (NIST, DARPA, …): · ASR: Automatic speech recognition · KWS: keyword spotting · SEG: speaker segmentation · ID/LOC: identification and · localization/tracking · FOA: focus of attention · GAA: gesture and action recognition IM2 provides a huge corpus with recorded meetings for internal and external evaluation and benchmarking. IDIAP has shown the good performance of their computer vision technology in the ImageCLEF 2007 evaluation for the medical annotation task. |