SOA of existing benchmarking initiatives + who is participating in what (EU&NI)This is a featured page


1.1. Introduction and WG2 objectives


When addressing audio-visual search engine challenges, we have identified benchmarking and evaluation issues as a critical topic. Despite the availability of many effective multimedia retrieval methodologies created by the research community, few commercial products currently incorporate such techniques. It is not obvious which technique is the best for a given problem. It is clear, however, that to cope with the rapid growth in the production of and access to digital multimedia content, evaluation campaign will help to facilitate the wider use and the dissemination of multimedia retrieval research results. Benchmarking efforts are usually intended to be precise and measure carefully how systems or algorithms perform with respect to a dataset, a task and an evaluation metric. Thus, to be scientifically valid, they have to be specific such that results are unambiguous and measurable. This makes benchmarks necessarily very narrow in focus and they often exclude much research. The goal is to find research questions that are of general interest, where a number of researchers are working on pretty much the same goal, and then evaluate this work. In this context, the benchmarking will format all the research work in the community letting people working on the same tasks and necessarily limit the innovation. During ACM Multimedia Information Retrieval, a panel was organized on “Diversity in Multimedia Information Retrieval Research” were the question: “Does benchmarking kills innovation?” was discussed[1]. The panel paper[2] and slides are also available on this web page. Besides TRECVID, benchmarking initiatives are becoming numerous: ImageCLEF, Pascal, ImagEval, ... This definitely shows that no existing single initiative could be by itself satisfactory by offering the context to test all the tasks addressed by our community. Also, due to the richness of the scientific objectives of multimedia search engines corresponding to growing and evolutionary use-cases and user needs, benchmark initiatives should be able to follow the field dynamic. Nevertheless, benchmarking remain necessary and valuable for the community as it provides objective reference among the numerous technical academic and industrial solutions. But it should be carefully set with clear and fair rules, and wide consensus of the community regarding definition of tasks, evaluation parameters, performance measures, ground truth setting, conflict of interest avoiding, ... Joining all these conditions remain quite challenging and the bottleneck of some initiatives. In the ideal conditions, we believe that winning a benchmark is worth thousand publications for academia and thousand press releases for industrials and represent a “moment of truth” among all what technology providers (academia/industrial) can argue on their work and results. Chorus activities in this topic are conducted within the WG2. The related web site is: http://www.ist-chorus.org/wg2---evaluation-benchmarking-an.php Having a common understanding of evaluating multimedia retrieval systems would allow technology users and companies to orient themselves to select the retrieval technique most suitable for their own specific needs. The problem is that current evaluation initiatives are disparate and run independently of each other, and there is a lack of coordination of these initiatives. Among our objectives within Chorus is to address this topic by putting together these dispersed initiatives on the benchmarking and evaluation of multimedia information retrieval to establish a clear understanding of the current situation and determine how best to move forward in a unified and cooperative way. During our first year effort, we have organized two events allowing experience sharing among benchmarking communities around the existing initiatives during Chorus Rocquencourt workshop[3] and also during CBMI’07[4] panel. The programs of these two events as well as links to the presentations are provided in the annex. By bringing together organizers of existing multimedia evaluations in the Chorus events, we allow sharing experiences and plan for the next period of Chorus to put forward best practices to improve the existing evaluation initiatives. The following initiatives have participated to Chorus events: TRECVID, INEX, ImageCLEF, CLEF, ImageEval, SHREC, MIREX, ELDA, Robin, and the Pascal Challenges. Most of the existing evaluation campaigns workflow is typically similar (registration of participants, distribution of data, submission of results, creation of ground truth, evaluation, dissemination of results during workshop/conference. The communalities could be analyzed to identify how existing evaluations efforts could be mutualized such as databases collections maintenance and ground truth generation. Also, another benefit would be to avoid that the scientific community is requested several times for participation to different campaigns where some tasks are very close even using different data collections. The evaluation method developed by the TREC (TExt Retrieval) conference is considered the standard methodology for large scale evaluation of information retrieval systems [VOORHEES,1998]. Most of the benchmark initiatives described in this document are to some extent based on this model. Subsequently, an evaluation typically consists of the following phases:


[2] James Z. Wang, nozha boujemaa, alberto del bimbo, donlad geman, Alexander G. Hauptmann and jelena tesic. (2006). Panel: Diversity in Multimedia Information Retrieval Research. In Proceedings of the 8th ACM international workshop on Multimedia information retrieval. 5-12.

  • Establishing a common dataset In order to prevent biases all participants work on exactly the same dataset. In general, a difference is made between the training set, which is used by the participants to train their systems and the test set, which is similar to the training set and used for the final evaluation. This difference is necessary for systems to prevent bias towards the training set. Furthermore, a dataset is typically selected for a particular task or track. This may include generating or tailoring the dataset to a specific task, but most of often the dataset is an excerpt from real life data, which is representative for the problem domain.

  • Definition of the task to be performed All participants perform exactly the same task, of which the results are evaluated and compared. Typically a task reflects a real life need within a particular domain. In general organizers of benchmark initiatives try to find a balance between relatively easy tasks they know is supported by state of the art technology and challenging tasks that are not yet covered.

  • Establishing of the ground truth In order to evaluate results provided by participants for a particular task, the correct response or ground-truth, should be known. Although this may appear trivial, establishing the ground-truth can be a rather complicated task because of the quantity and complexity of the dataset or the ambiguous nature of the response. Often the ground-truth is established manually by domain experts, which is typically a rather labor intensive task.

  • Assessing results relative to the ground truth. The results submitted by a participant for a particular task are evaluated relative to the ground truth. Typically this is an automated process that produces a metric, which allows comparison with other participants. Sometimes, however, submitted results are judged (and cross-validated) by human experts. Although the objective of benchmarking is to establish a quality metric for technology within a particular domain, most initiatives emphasize the benchmark as a platform of discussion rather then a competition.
The objective of this document is to raise awareness between researchers on the availability of the different benchmarking initiatives and to make available description of their activities and properties.


1.2. Overview of existing benchmark initiatives


In order to map the landscape of currently active benchmark initiatives CHORUS organized a workshop (14-3-2007, INRIA, Rocquencourt) for which it invited representatives of the major multimedia benchmarks, who all gave an presentation about their respective benchmark initiative. This initial meeting was followed up by a panel discussion (26-6-2007, CBMI, Bordeaux) on benchmark initiatives. Based on the initial workshop and panel discussion we established 5 dimensions that we use to compare the benchmarks initiatives:

  • Definition of tracks and tasks denotes a short description of the tracks and task a participant can compete in. A track refers to the “theme” of comparison, such as copy-detection for video or artist identification for music, whereas a task refers to a particular assignment the participant has to complete. Although most initiatives cover multiple tracks, a participant does not necessarily need to compete in all of them.

  • Evaluation metrics of task denotes the metric that was used to evaluate a task. The standard metrics used in information retrieval include Mean Average Precision (MAP), Binary Preference (Bpref), Mean Reciprocal Rank (MRR) and Geometric Mean Average Precision (GMAP). However, some tasks are unsuited for evaluation using these measures. In this case, we indicate the evaluation metric for the specific initiative.

  • Type and size of the data used describes the quantity and quality of the data is used for the benchmark initiative.

  • Method used for generating the ground-truth describes the method used to obtain the ground-truth, which is used as a measurement to evaluate the submitted results of the participants.

  • Participation statistics denotes for the last three years (2007, 2006, and 2005) the number of registrations of intended participation and the number of registered participants that submitted results.
In addition we represent in the overview the:
  • URL, which denotes the address of the initiatives website.
  • Conclusion, which denotes a partial conclusion from the perspective of the initiative.
Find below the overview of the benchmark initiatives we address:


TrecVid

The TREC conference series is sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies. The goal of the conference series is to encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. In 2001 and 2002 the TREC series sponsored a video "track" devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. Beginning in 2003, this track became an independent evaluation (TRECVID) with a 2-day workshop taking place just before TREC. TRECVID is coordinated by Alan Smeaton (Dublin City University) and Wessel Kraaij (TNO Information and Communication Technology). Paul Over and Tzveta Ianeva provide support at NIST.
Definition of tracks and tasks (2007) · Shot detection · Semantic Concept Features · Automatically create MPEG-1 summary Maximum duration to be determined Shows the main objects (animate and inanimate) and events from “rushes” Evaluated using simple play and pause controls Need not be series of frames directly from the video Summaries can contain picture-in-picture, split screens · Search · Interactive · Manual · Automatic
Evaluation metrics of tasks Mean Average Precision
Type and size of data used 2005 Data unavailable
2006 158 hours Arabic, Chinese, English Broadcast News Common speech recognition, translation, annotations
2007 100 hours Dutch TV shows, Common speech recognition, translation; several groups provided low-level features and (unverified) semantic concept detection results 100 hours BBC “Rushes” - raw stock footage, natural sound, highly repetitive, ong segments, reusable shots of people, objects, events, locations, etc.
Method used for the generation of the ground-truth · Researchers submit experiment results on test collections · Cut off at certain rank for each system · Top results from systems are pooled (redundancy removed) · Pooled results manually judged for relevance · All systems submission results are scored for all results · With manual truth assumed to be ‘complete’ · Works well with good variety of system approaches · Cost effective and scalable
Participation statistics 2005 Registrations: 63 Search submissions: 42 Search runs submitted: 112
2006 Registrations: 70 Search submissions: 54 Search runs submitted: 123
2007 Registrations: 71 Search submissions: N.A. Search runs submitted: N.A.
URL http://www-nlpir.nist.gov/projects/trecvid/
Conclusion from TrecVid perspective · Standardized evaluations and comparisons – improve science · Weed out many hypotheses from small, idiosyncratic data · Test on common large collection and some common metadata · Failures are not embarrassing and can be presented at the TRECVID workshops! · Virtually all work is done on one extracted keyframe per shot · Anyone can participate · Sign promise to use the data for research only



ImageClef

ImageCLEF is the cross-language image retrieval track which is run as part of the Cross Language Evaluation Forum (CLEF) campaign. The ImageCLEF retrieval benchmark was established in 2003 with the aim of evaluating image retrieval from multilingual document collections. Images by their very nature are language independent, but often they are accompanied by texts semantically related to the image (e.g. textual captions or metadata). Images can then be retrieved using primitive features based on pixels with form the contents of an image (e.g. using a visual exemplar), abstracted features expressed through text or a combination of both. The language used to express the associated texts or textual queries should not affect retrieval, i.e. an image with a caption written in English should be searchable in languages other than English. Besides textual and multimodal tasks, ImageCLEF offers two purely visual tasks for image classification or object detection/retrieval. Note: A pre-conference workshop[1] was organized together with the MUSCLE network of excellence the day before the workshop for the past three years with high-quality keynote speakers on visual information retrieval evaluation and related topics.
Definition of tracks and tasks (2007) · Ad-hoc retrieval with query in different language from the annotation or multilingual image annotations (2003-2007) · Object classification/retrieval task; purely visual (2006-2007) · Medical image retrieval task (2004-2007) · Medical image classification task; purely visual (2005-2007) · Interactive image retrieval (2004-2006) · Geographic retrieval from image collections (2006)
Evaluation metrics of tasks · Mean Average Precision as a lead measure · BPref, P(10-50) used for comparison · Many ideas on how to find better measures No resources to pursue this
Type and size of data used · IRMA collection for medical image classification (11’000 images) · ImageCLEFphoto collection (IAPR TC 12) (20’000 images) · ImageCLEFmed collection (~70’000 images) · Varying degree off annotations and languages · Realistic collections for this specific task (containing image of varying quality, majority of English annotations, domain-specific vocabularies and abbreviations, spelling errors, …)
Method used for the generation of the ground-truth · Classification Collections used were classified beforehand · Retrieval Pooling is used with varying number depending on submissions Judgment scheme: relevant – partially – non-relevant Double judgments to analyze ambiguity · Interactive Participants evaluate themselves (time, Nrel)
Participation statistics 2005 36 registrations 24 submissions 300 runs
2006 47 registrations 30 submissions 300 runs
2007 51 registrations 38 submissions >1’000 runs
URL http://www.imageclef.org/
Conclusion ● ImageCLEF creates important resources and is acknowledged in the field (50 registrations) ● Discussions at workshop are regarded as very stimulating ● Lack of participation for interactive retrieval ● Lack off funding is a major problem to professionalize it and analyze all data ● Resource sharing could really help!


ImageEval

In 2005, the Steering Committee of ImagEVAL had the opportunity of proposing evaluation campaigns for funding by the French “Techno-Vision” program. The ImagEVAL project relates to the evaluation of technologies of image filtering, content-based image retrieval (CBIR) and automatic description of images in large-scale image databases The objective of ImagEVAL is double: ● to organize important evaluations starting from concrete needs and using professional data collections ● to evaluate technologies held by national and foreign research laboratories, and software solutions
Definition of tracks and tasks (2007) ● Transformed image recognition ● Combined text/image strategies for image retrieval ● Text area detection ● Object detection (e.g. Car, tree, …) ● Extraction of attributes (e.g. indoor/outdoor, day/night, natural/urban, …)
Evaluation metrics of tasks ● MAP : Mean Average Precision (main metric) and complementary Precision/Recall based metrics ● Mean Reciprocal Rank (for a sub-task of the transformed image recognition) ● Christian Wolf’s metric (for the text area detection): this metric (implemented in DetEVAL tools) is mainly based on the metrics used in ICDAR evaluation, nevertheless it enables a clever evaluation of the classical over and low segmentation problem that appear when dealing with bounding boxes for both results and ground truths.
Type and size of data used ● Old postcards (~7600 images) ● Black & white, color photographs (~50 000 images) ● Transformed image recognition : 42 500 images ● Combined text/image strategies for image retrieval : 700 web pages ● Text area detection : 500 images ● Object detection (e.g. Car, tree, …) : 14 000 images ● Extraction of attributes (e.g. indoor/outdoor, day/night, natural/urban, …) : 23 500 images
Method used for the generation of the ground-truth Ground truth files build by two professionals that annotated each image.
Participation statistics 2005 Data unavailable
2006 20 registrations 11 submissions
2007 Data unavailable
URL http://www.imageval.org/
Conclusion · Very interesting and challenging data provided by professionals that actively participated to the creation of the campaign · ImagEVAL is a part of the solution answering the lack of evaluation in the computer vision community · Correct participation level for a first edition but need to attract more international labs and companies · We need to collaborate with other evaluation campaigns, share experiences and elaborate a coherent planning to avoid overlapping · Define more focused evaluation problems according to end users feedbacks and potential overlapping with other evaluation campaigns


TechnoVision-ROBIN

Technovision is a recent program of the French Ministry of Research and Technology that will fund evaluation projects in the area of computer vision. Many vision algorithms have been proposed in the past, but comparing their performance has been difficult owing to the lack of common datasets. Technovision aims to correct this by funding the creation of large, representative image datasets. ROBIN is a Technovision proposal covering the evaluation of object retrieval algorithms
Definition of tracks and tasks (2007) · multi-class objects detection · generic objects detection · generic objects recognition · image categorization
Evaluation metrics of tasks · Detection Recall for maximal Precision: R Precision for maximal Recall: P Equal Precision and Recall: EER Area under the curve: AUC · Discrimination Discrimination at minimal uncertainty rate: D Uncertainty at maximal discrimination rate: U Equal discrimination and uncertainty rate: EDU Confusion matrix at maximal uncertainty: (c, c) · Rejection Equal Rejection Rate: ERR
Type and size of data used · 6000 images from a static camera and images from a moving vehicle. · Satellite images containing 10000 regions of interest (128x128 pixels) · 6400 Aerial images and 1000 short videos containing vehicles and infrastructure elements · 10000 aerial images with computer synthesized objects · 15000 computer generated images · 1500 multi-sensor aerial images
Method used for the generation of the ground-truth Manual annotation
Participation statistics Data unavailable
URL http://robin.inrialpes.fr/
Conclusion First round is still running, no conclusion available yet


IAPR TC-12 Image Benchmark

IAPR TC-12 Benchmark consists of 20,000 images (plus 20,000 corresponding thumbnails) taken from locations around the world and comprising an assorted cross-section of still natural images, providing the resources to carry out evaluation of visual information retrieval from generic photographic collections (i.e. containing everyday real-world photographs akin to those that can frequently be found in private photographic collections as well). Each photograph is thereby associated with a semi-structured text caption in three languages: English, German and Spanish.
Definition of tracks and tasks (2007) The IAPR TC-12 Image Benchmark has not been used in a standalone evaluation event yet, but provided the resources for the following tasks: · ImageCLEFphoto (2006-2007): ad-hoc retrieval (with the query language either being identical or different from that used to describe the images) · ImageCLEF object classification/retrieval task (2007), purely visual · MUSCLE Live Retrieval Evaluation Event (2007)
Evaluation metrics of tasks · MAP as a lead performance measure · bpref, GMAP, P(20) as additional performance indicators
Type and size of data used · 20,000 still natural photographs of generic content (e.g. people, animals, cities, landscapes) · Detailed semi-structured captions in up to three languages (English, German, Spanish) · 60 query topics in TREC format (topic titles, narratives, and sample images)
Method used for the generation of the ground-truth · ImageCLEFphoto, Live Event: Pooling is used with varying number depending on submissions Judgment scheme: relevant – partially – non-relevant Double judgments to analyze ambiguity Interactive Search and Judge to complete pools with further relevant images · ImageCLEF object classification, Live Event: Collections used were classified beforehand
Participation statistics 2005 Data unavailable
2006 Registrations: 36 (ImageCLEFphoto) Submissions: 12 (ImageCLEFphoto) Runs: 157 (ImageCLEFphoto)
2007 Registrations: 32 (ImageCLEFphoto), 22 (ImageCLEF object retrieval), 3 (Live Event) Submissions: 21 (ImageCLEFphoto), 7 (ImageCLEF object retrieval), 3 (Live Event) Runs: 616 (ImageCLEFphoto), 38 (ImageCLEF object retrieval), 3 (Live Event)
URL http://eureka.vu.edu.au/~Egrubinger/IAPR/TC12_Benchmark.html
Conclusion · New query topics will be created for 2008 · Evaluation events that will use the IAPR TC-12 Benchmark include: · ImageCLEF 2008 (ad-hoc retrieval task and object annotation task) · GeoCLEF 2008 · MUSCLE Live Retrieval Evaluation Event 2008



CIVR Evaluation Showcase

Image and video storage and retrieval continue to be one of the most exciting and fastest-growing research areas in the field of multimedia technology. However, opportunities for the exchange of ideas between different groups of researchers, and between researchers and potential users of image/video retrieval systems, are still limited. The International Conference on Image and Video Retrieval (CIVR) series of conferences was originally set up to illuminate the state of the art in image and video retrieval between researchers and practitioners throughout the world. This conference aims to provide an international forum for the discussion of challenges in the fields of image and video retrieval. Video and image retrieval systems find their way to regular conference demo sessions, but they are never exposed and run simultaneously. The CIVR Evaluation Showcase event aims to fill this lacuna. Specifically, we aim for a showcase that goes beyond the regular demo session: it should be fun to do for the participants and fun to watch for the conference audience. To reach this goal, a number of participants simultaneously do an interactive search task during the showcase event. At the CIVR 2007, three live evaluation events were held for the first time.
Definition of tracks and tasks (2007) · Video Retrieval (VideOlympics) textual search (e.g “Find shots of a meeting with a large table.”) · Image Retrieval text queries (e.g "Find images of snowy mountains"). Visual queries (e.g. “Where is the church shown in the example image?”) · Copy Detection Find real copies of entire long videos (from 1 minute to 3 hours). Find copies of clips that are transformed (e.g Copies are transformed by cropping; fade cuts; flips; insertion of logos etc.
Evaluation metrics of tasks · Video Retrieval (VideOlympics) Precision, recall, speed, best system voted by conference attendees, etc. · Image Retrieval For the visual queries, the amount of time taken for the first correct answer to be found was recorded. For the text queries, the ratio of correct to incorrect images within the first N images returned was calculated. The value of N was based on the number of correct images for each query in the ground truth. · Video Copy Detection Quality metric based on number of correct answers returned. Speed metric. Note that the evaluation results will not be published, the emphasis is on demonstrating the capabilities of the technology for a well-defined task that interests many people.
Type and size of data used · Video Retrieval (VideOlympics) TRECVid 2006 test data (160 hrs of Arabic, Chinese, and US broadcast news). · Image Retrieval Extended IAPR TC12 dataset (21000 images) · Video Copy Detection Newly created dataset containing web video clips, TV archives and movies(~100 hours of video)
Method used for the generation of the ground-truth · Video Retrieval (VideOlympics) Manual relevance judgements. · Image Retrieval Manual relevance judgements. · Video Copy Detection Videos from which modified versions are generated are known.
Participation statistics 2005 Data unavailable
2006 Data unavailable
2007 9 participants (VideOlympics) 3 participants (Image Retrieval) 10 participants (Video Copy Detection)
URL http://www.civr2007.com/showcase.php
Conclusion · Live retrieval evaluation includes · Effect of the user interface. · Speed / efficiency of retrieval of the system. · Skill of the user · Currently no metrics exists to measure this.



SHREC (3D)

The Network of Excellence AIM@SHAPE is taking the initiative to organize a 3D shape retrieval evaluation event: SHREC - 3D Shape Retrieval Contest. The general objective is to evaluate the effectiveness of 3D-shape retrieval algorithms. The contest is organized in conjunction with the SMI conference (Shape Modeling International) where the evaluation results will be presented.
Definition of tracks and tasks (2007) · Watertight models (object models represented by seamless surfaces) · Partial matching · protein models · CAD models · Relevance feedback · Similarity measures · 3D faces
Evaluation metrics of tasks · Relevance measure (highly relevant, marginally relevant) · Precision, Recall · First, Second Tier · (Normalized) (Discounted) Cumulated Gain · Average Dynamic Recall
Type and size of data used Princeton Shape Benchmark (1814 classified polygonal models)
Method used for the generation of the ground-truth Manually established
Participation statistics Data unavailable
URL http://www.aimatshape.net/event/SHREC
Conclusion 3D media have specific properties/requirements, which justifies a 3D benchmarking initiative. However, the conceptual framework is similar to other benchmarks initiatives, suggesting closer cooperation can be beneficial.


MIREX

The Music Information Retrieval Evaluation eXchange (MIREX) is a community-based formal evaluation framework coordinated and managed by the International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) at the University of Illinois at Urbana-Champaign (UIUC). IMIRSEL has been funded by both the National Science Foundation and the Andrew W. Mellon Foundation to create the necessary infrastructure for the scientific evaluation of the many different techniques being employed by researchers interested in the domains of Music Information Retrieval (MIR) and Music Digital Libraries (MDL). For the past two years MIREX participants have met under the auspices of the International Conferences on Music Information Retrieval (ISMIR). The first MIREX plenary convened 14 September 2005 in London, UK, as part of ISMIR 2005. The second plenary of MIREX 2006 was convened in Victoria, BC on 12 October 2006 as part of ISMIR 2006. Some of the tasks, such as "Audio Onset Detection," represent micro level MIR/MDL research (i.e., accurately locating the beginning of music events in audio files, necessary for indexing). Others, such as "Symbolic Melodic Similarity," represent macro level MIR/MDL research (i.e., retrieving music based upon patterns of similarity between queries and pieces within the collections).
Definition of tracks and tasks (2007) · Audio Artist Identification · Audio Classical Composer Identification · Audio Artist Identification subtask · Audio Genre Classification · Audio Music Mood Classification · Audio Music Similarity and Retrieval · Audio Onset Detection · Audio Cover Song Identification · Real-time Audio to Score Alignment (a.k.a Score Following) · (Postponed to possibly 2008) · Query by Singing/Humming · Multiple Fundamental Frequency Estimation & Tracking · Symbolic Melodic Similarity
Evaluation metrics of tasks · Human listening tests on similarity denoted on a broad scale (3 classes) and a fine scale (10 classes). · Objective statistics based on meta-data
Type and size of data used 5000 music files, 9 genres
Method used for the generation of the ground-truth Evaluated by human judgments
Participation statistics 2005 41 submissions 72 runs
2006 46 submissions 92 runs
2007 Data unavailable
URL http://www.music-ir.org/mirex2007
Conclusion · Challenges for Music Retrieval Benchmarking Data and access to it sufficient size real-world sufficient quality · Metadata high-quality labels (production-style) ground truth annotation · Evaluation automatic vs. human evaluation



INEX

The aim of the Initiative for the Evaluation of XML Retrieval (INEX), launched in 2002, is establish an infrastructure and provide means, in the form of a large XML test collection and appropriate evaluation metrics, for the evaluation of content-oriented XML retrieval systems. INEX has a strong international character; participants from over 80 organisations, distributed across Europe, America, Australia, Asia, and Middle-East have so far contributed to INEX. The main INEX Ad Hoc task focuses on text-based retrieval of XML fragments. The INEX Multimedia track is concerned with other types of media that can also be found in XML collections. Existing research on multimedia information retrieval has already shown that it is far from trivial to determine the combined relevance of a document that contains several multimedia ob jects. The objective of the INEX MM track is to exploit the XML structure that provides a logical level at which multimedia ob jects are connected, to improve the retrieval performance of an XML-driven multimedia information retrieval system. INEX MM ran a pilot evaluation study in 2005 and has been established as an INEX track in 2006 and 2007.
Definition of tracks and tasks (2007) MMfragments task: The objective of this retrieval task is to find relevant multimedia XML fragments (i.e., XML elements or passages that contain at least one image) given a multimedia information need, which may contain visual or structural hints. Within the MMfragments task, there are three subtasks: · Focused: return a ranked list of elements or passages to the user. · Relevant In Context: return relevant elements or passages clustered per article to the user. · Best In Context:return articles with one best entry point to the user. MMimages task: The objective of this retrieval task is to find relevant images given a multimedia information need, that may contain visual hints. The requirement is to to return a ranked list of documents (=image + metadata) from this collection. In this task, the type of the target element is defined, so it is basically closer to an image (or a document) retrieval task, rather than XML element or passage retrieval.
Evaluation metrics of tasks MMfragments task: Since the relevance assessments are performed at the sub-document level, systems are compared using effort-precision/gain-recall graphs, the eXtended Cumulated Gain (XCG) metrics used in many INEX tasks. The summary statistic of these, i.e., mean average effort precision, is also reported. MMimages task: mean average precision and recall precision graphs.
Type and size of data used The resources used for the multimedia track are based on Wikipedia data: Wikipedia XML collection: A Wikipedia crawl converted to XML consisting of 659,388 XML documents with image identifiers added to the <image > tags for those images that are part of the Wikipedia image XML collection. This is the target collection for the MMfragments task. Wikipedia image collection: A subset of 171,900 images referred to in the Wikipedia XML collection is chosen to form the Wikipedia image collection. Wikipedia image XML collection: This XML collection is specially prepared for the multimedia track. It consists of XML documents containing the images in the Wikipedia image collection and their meta-data. This is the target collection for the MMimages task. Image classification scores: For each image, the classification scores for the 101 MediaMill concepts are derived by University of Amsterdam. Image features: For each images, the set of 120D feature vectors that has been used to derive the image classification scores is also available. These feature vectors can be used to build a custom CBIR-system, without having to pre-process/access the image collection.
Method used for the generation of the ground-truth For both tasks, the topics are generated by the participants in INEX MM track and the relevance assessments are also performed by them. MMfragments task: It requires assessments at the sub-document level, a simple binary judgement at the document level is not sufficient. Still, for ease of assessment, retrieved fragments are grouped by document. Once all participants have submitted their runs, the top N fragments for each topic are pooled and grouped by document. Assessors look at the documents in the pool and highlight the relevant parts of each document. The assessment system stores the relevance or non-relevance of the underlying XML elements. MMimages task: TREC style document pooling of the top N documents (= images + metadata) and binary assessments at the document level.
Participation statistics 2005 7 registrations 5 submissions 21 runs
2006 20 registrations 4 submissions 31 runs
2007 16 registrations 4 submissions 30 runs
URL http://inex.is.informatik.uni-duisburg.de
Synthesis and conclusion · Realistic and sizable document collection Interesting additional resources Easy entry point for IR/DB researchers (no image analysis needed) ● Few participants Top performing runs use no visual information Too little data to be conclusive ● Re-usable test collection Inter assessor agreement high No submission bias


Cross-Language Speech Retrieval (CL-SR)

The CLEF Cross Language Speech Retrieval (CL-SR) benchmark test evaluates spoken document retrieval systems in a multilingual context. In 2006 the CL-SR track included search collections of conversational English and Czech speech using six languages (Czech, Dutch, English, French, German and Spanish). In CLEF 2007 additonal topics were added for the Czech speech collection, and additonal speech recognition results were available for the English speech collection. Speech content was described by automatic speech transcriptions manually and automatically assigned controlled vocabulary descriptors for concepts, dates and locations, manually assigned person names, and hand-written segment summaries. Additional resources of word lattices and audio files can be made available.The track was coordinated by U. Maryland (US), Dublin City U. (IE) and Charles U. (CZ).
Definition of tracks and tasks (2006)
  • Task 1: retrieve pre-defined topics in ASR decoded speech archive (American English – spontaneous speech)
  • Task2: retrieve pre-defined topics in ASR decoded speech archive (Czech – spontaneous speech)
Evaluation metrics of tasks
Type and size of data used
  • English task:
o The resulting test collection contains 8,104 segments from 272 interviews totaling 589 hours of speech o 63 search topics o 8.104 cohorent segments (equivalent of “documents” in a classic IR task) o 30.497 relevance judgements o ASR transcripts were provided by one partner (IBM for English)
Method used for the generation of the ground-truth
  • The collection from the Shoah Visual History Foundation contains a 10,000 hour subset for which manual segmentation into topically coherent segments was carefully performed by subject matter experts.
Participation statistics 2005 o 7 participants
2006 o English task: 6 participants o Czeck task: 3 paricipants
2007 o unknown
URL http://www.clef-campaign.org/2007/2007agenda.html http://clef-clsr.umiacs.umd.edu/

NIST Spoken Term Detection

The STD task is to find all of the occurrences of a specified “term” in a given corpus of speech data. For the STD task, a term is a sequence of one or more words. The evaluation is intended to help develop technology for rapidly searching very large quantities of audio data. Although the evaluation actually uses only modest amounts of data, it is structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements1 to much larger data sets. Therefore, systems must be implemented in two phases: indexing and searching. In the indexing phase, the system must process the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences.
Definition of tracks and tasks (2006) The STD task is to find all of the occurrences of a specified “term” in a given corpus of speech data. For the STD task, a term is a sequence of one or more words. Terms will be specified only by their orthographic representation. Example terms are “grasshopper”, “New York”, “in terms of”, “overly protective”, “Albert Einstein”, and “Giacomo Puccini”.
Evaluation metrics of tasks Systems will be evaluated for both speed and detection accuracy. Speed and accuracy will be measured for a variety of conditions, for example as a function of term characteristics (such as frequency of usage and acoustical features) and corpus characteristics (such as source type and signal quality). Basic detection performance will be characterized in the usual way via standard detection error tradeoff (DET) curves of miss probability (PMiss) versus false alarm probability (PFA). Miss and false alarm probabilities are functions of the detection threshold, q, and will be computed separately for each search term.
Type and size of data used The development and evaluation corpora will include three languages and three source types. - The three languages will be Arabic (Modern Standard and Levantine), Chinese (Mandarin) , and English (American). - The three source types will be Conversational Telephone Speech (CTS), Broadcast News (BNews), and Conference Room (CONFMTG) meetings i.e., goal oriented, small group, roundtable meetings. - 1-3 hours per language and source type
Method used for the generation of the ground-truth Search queries are labeled manually for the test corpus
Participation statistics 2006 o 9 submissions
2008 o Planned

o
URL http://www.nist.gov/speech/tests/std/



Nist Rich Transcription

The Rich Transcription evaluation series is implemented to promote and gauge advances in the state-of-the-art in several automatic speech recognition technologies. The goal of the evaluation series is to create recognition technologies that will produce transcriptions which are more readable by humans and more useful for machines. As such, a set of research tasks has been defined which are broadly categorized as either Speech-to-Text Transcription (STT) tasks and Metadata Extraction (MDE) tasks. The evaluation series was started in 2002 and continues to this day. The meeting recognition community is expanding the scope of the RT evaluations to include multimodal research including audio and video.
Definition of tracks and tasks (2007)
  • Evaluation of quality of automatic indexing of meeting recordings (4 measures):
  • Speech-to-text (STT) transcription rate
  • Diarization 1: Who spoke when
  • Diarization 2: Speech Activity Detection
  • Diarization 3: Source Localization
Evaluation metrics of tasks
  • Word error metric für STT task
  • The Diarization Error Rate (DER) metric is used to assess SPKR system performance. DER is the ratio of incorrectly attributed speech time, (either falsely detected speech, missed detections of speech, or incorrectly clustered speech) to the total amount of speech time, expressed as a percentage
  • Diarization “Speech Activity Detection” (SAD) rate
  • Speaker Localization and Tracking Rate
Type and size of data used
  • Speech recordings from lecture rooms
  • Speech recordings from meeting rooms
Method used for the generation of the ground-truth
  • Manual annotation
Participation statistics 2005 o 9 participants (also partners from European projects: CHIL, AMI, o Not all sites participates in all 4 tasks
2006 o Unknown
2007 o Unknown
URL http://www.nist.gov/speech/tests/rt/index.htm http://www.nist.gov/speech/publications/papersrc/rt05sresults.pdf



1.3. Conclusion

A willingness to cooperate has already been demonstrated through several common events (i.e. MUSCLE/ImageCLEF workshops, Chorus evaluation session). By bringing together a number of these initiatives into a single entity, a cross-disciplinary approach to multimedia retrieval benchmarking can be developed. Already, common evaluation tasks have been identified over the different initiatives that will allow joining forces. Still many open issues remain and need much work and discussion within the community. The outputs from several meetings dig out some hard issues which need deeper investigation and are summarized below.
  • Technology assessment vs user satisfaction: Best evaluated system may not be usable. Existing commercial systems often evaluate poorly. On the other hand, users are satisfied with commercial systems. Are we missing?
Accurate performance measures? (make existing ones better) Relevant perf. measures? (find new ones). More should be done on including the user perspective in evaluation
  • CLEF2005 interrogation: Why as we have good results on cross lingual evaluation, none of the best systems have a commercial success?
Tentative answer: conditions of test do not reflect the real use of the systems
  • Requirements for a user oriented evaluation
Key issue: non-intrusive approach Real "subjects" Real applications Simulation (Wizard of Oz) What is needed is:
  • Basic Research Evaluation (validate research direction)
  • Technology Evaluation (assessment of solution for well defined problem)
  • Usage Evaluation (end-users in the field)
  • Impact Evaluation (socio-economic consequences)
  • Program Evaluation (funding agencies)
Our future plans in the project next stages include: the mapping of the current landscape of existing benchmark initiatives assessing their differences and common properties to put together our efforts to better address the remaining hard problems. We plan to continue our investigations to provide recommendations for the best practices for methods and systems evaluation.



Annex I: Evaluation efforts (and standards) within ongoing EU Projects and National Initiatives


In this section we tried to collect participation of ongoing European projects and national initiatives to evaluation campaign through a questionnaire. We have partial information coming from Sapir, Tripod, Vitalas, Aim@shape, Vidivideo and MultimediaN.


Project names: MultimediaN & VidiVideo
Arnold Smeulders & Marcel Worring 1

1- Internal technical evaluation within WPs
· Test Corpora Type (Text, audio, video…): Video from TRECvid Video from surveillance internally ALOI static database of objects MediaMill challenge
· Test Corpora Size: TRECvid: Hundreds of hours partially annotated in TRECvid manner ALOI: 100 different recordings of 1000 objects = 100.000 images MediaMill challenge: 101 concepts with ground truth and models based on TRECvid data.
· Performance measures (Mean precision,…) TRECvid style: Mean Average Precision ALOI: recognition rates Video Olympics: number of retrieved items in a five minute period, pleasant interface by voting of potential users.

2- Participation in open national/European/international Benchmark initiatives:
· Name and level (european?) of the initiative: TRECvid: worldwide ALOI: scientific VOC: worldwide VideoOlympics: worldwide
· Nbr of participants TRECvid: 60 participants – all international – and growing ALOI: downloads Video Olympics: 9 participants
· How is generated the ground truth? TRECvid style: basic annotation supplemented by parties ALOI: fully documented at scanning Video Olympics: fully annotated for target questions
· Are you the organizer? No, of TRECvid, but the NIST is. Yes, of ALOI, see Int Journal Comp Vision Geusebroek & Smeulders Yes, of the MediaMill challenge, see ACM Multimedia 2006 Yes, of the Video Olympics, see www.videolympics.org for information and a video impression of the first edition.

3- User Trials (feedback with real end-users, no relation with the provided technologies)
We think this is not a very useful question at this point. We work closely with the national video archive of the Netherlands in MultimediaN, VidiVideo and other projects. When there is a real need we will engage real end-user at the first instance.
However, we do are busy developing user group question types.

4- Participation in standardization effort:
· Label and name (MPEG7, JPSearch, XMLx…)
· Others: …
· More infos on this standardization context and objective:
· Abstract of your contribution XML Dublin Core storage format of detected results.


Project name: SAPIR
Yosi Mass 1
1- Internal technical evaluation within WPs
· Test Corpora Type – we use FlickrXML files extracted from Flickr. Each file contains text metadata as appear in Flickr as well as 5 MPeg-7 Visual Descriptors (Scalable Color, Color Structure, Color Layout, Edge Histogram and Homegenous Texture) extracted from the image.
· Test Corpora Size: - 40M images. We plan to grow to 100M images.
· Performance measures – currently we don’t have automatic measures. We use a UI to search for images that are similar to a given image possibly combined with Text.

2- Participation in open national/European/international Benchmark initiatives: · No

3- User Trials (feedback with real end-users, no relation with the provided technologies)
We defined 5 possible scenarios that can benefit from large scale content based search in audio-visual data. The 5 scenarios are – Tourist, Journalist helper, Music&Text, Advanced home messaging and Hollywood&Home. The scenarios can be found on the project site at http://www.sapir.eu. We then run some focus groups to evaluate the scenarios
· Number of these external users: - 5-7 per scenario
· Do the users belong to different communities? : Yes, some are novice and some are professional. For example for the Journalist scenario we interviewed some journalists.
· Trials protocol: We did some UI Sketches for the scenarios and then interviewed the participants in the focus groups
· User's satisfaction criteria: We measured along 3 dimensions – effectiveness, efficiency and satisfaction. We used the following criterias –
· Perceived effectiveness
Are you able to precisely formulate your request?
Do you get the requested results?
Do you get sufficient recall information to judge the value of the result?
Do you get sufficient precision information to judge the value of the result?
· Perceived efficiency
Do you formulate precise queries with minimal efforts?
Do you get the results within reasonable time?
Does the ranking and presentation of the results fit the intention of your quest?
· Perceived satisfaction
Do you find the service easy to use? E.g. no hazzle, no errors, logical structure.
Do you find the service enjoyable to use (pleasant, comfortable, nice design, etc)
Do you get sufficient supported? E.g. during the installation phase or when errors or unexpected situations occurs.
Do you find the cost/benefit ratio reasonable?
Do you trust the providers of the service?
Do you find the service accessible? E.g. mobility issues
The results of the findings from the Focus groups are part of a deliverable that will be put towards the YE on the project web site.

4- Participation in standardization effort:
· Label and name (MPEG-7, MPEG-A, MPEG-21, OMA)
· More infos on this standardization context and objective: to be supplied toward the YE
· Abstract of your contribution: to be supplied until the YE



Project name: Tripod
Mark Sanderson 1

1- Internal technical evaluation with related WPs
· Test Corpora Type (Text, audio, video…): Image collection
· Test Corpora Size: Several thousand
· Performance measures (Mean precision,…): Not entirely determined yet, some classic retrieval effectiveness measures; for caption creation, maybe the bleu or rouge measures.

2- Participation in open national/European/international Benchmark initiatives:
· Name and level (european?) of the initiative: geo-CLEF, possibly in the follow on to MUSCLE
· Web site of the initiative: http://www.clef-campaign.org/; www.muscle-noe.org
· Nbr of participants: ~15
· Nbr and Title of tasks: geoimage track
· Performance measures: Standard retrieval measures
· How is generated the ground truth? Relevance assessors
· How are maintained the test data collections? CLEF maintain the data
· Are you the organizer? Co-organiser

3- User Trials (feedback with real end-users, no relation with the provided technologies)
· Number of these external users: Still to be determined
· Do the users belong to different communities? : Large public? Professionals? (Which are…)
· Trials protocol: Still to be determined
· User's satisfaction criteria: Still to be determined

4- Participation in standardization effort:
· Label and name (MPEG7, JPSearch, XMLx…)
Tripod will build on the XMP standard
· Others: …
· More infos on this standardization context and objective:
· Abstract of your contribution Tripod will evaluate two aspects of its outputs. 1) It will evaluate the quality of the image captions that it outputs; 2) it will evaluate the search engine that searches over the enhanced images. Evaluation of summaries will be conducted by creating a range of existing manually captioned images and comparing a different set of automatically captioned images with the manual set. Retrieval evaluation will be conducted in a classic IR test collection approach. We plan to be strongly involved in CLEF and in the follow on from the MUSCLE network of excellence. Our involvement will be in providing data sets to those exercises and in contributing to the experimental design.


Project name: AIM@SHAPE
Michela Spagnuolo

1- Internal technical evaluation with related WPs
· Test Corpora Type (Text, audio, video…): digital 3D objects
· Test Corpora Size: depending on the object represented
· Performance measures (Mean precision,…):

2- Participation in open national/European/international Benchmark initiatives:
· Name and level (european?) of the initiative: SHREC: 3D Shape Retrieval Contest, international initiative
· Web site of the initiative: http://www.aimatshape.net/event/SHREC/
· Nbr of participants & Nbr and Title of tasks: the contest is organized in tracks, each for a specific 3D retrieval task, either in terms of retrieval method (eg, partial/global) or shape type (eg protein/CAD models)
1- Watertight models. Eight groups initially registered, five groups actually participated.
2- CAD models. Nine groups initially registered, four groups actually participated
3- Partial matching. Five groups initially registered, only two actually participated.
4- Protein models. Three groups participated.
5- 3D face models. Seven groups initially registered, three actually participated.
· Performance measures: For each query there exists a set of highly relevant items and a set of marginally relevant items. Therefore, most of the evaluation measures have been split up as well according to the two sets. Measures used: true and false positives, true and false negatives, first and second tier, precision, recall, average precision, average dynamic recall, cumulated gain vector, discounted cumulated gain vector, normalized cumulated gain vector (see Section 4 of the attached SHERC06.PDF for a complete description of the performance measures)
· How is generated the ground truth? Manually, by track organizers
· How are maintained the test data collections? In the first two SHREC contests, they have been maintained by the organizers; we are considering the possibility to maintain them directly in the ShapeRepository of the AIM@SHAPE project (see shapes.aimatshape.net)
· Are you the organizer? AIM@SHAPE is organizing the contest and more precisely, Remco Veltkamp, UU (email: remco.veltkamp@uu.nl

3- User Trials (feedback with real end-users, no relation with the provided technologies)
· Number of these external users:
· Do the users belong to different communities? : Large public? professionals? (which are…)
· Trials protocol:
· User's satisfaction criteria: not applicable, the contest is meant for a scientific audience

4- Participation in standardization effort: none
· Label and name (MPEG7, JPSearch, XMLx…)
· Others: …
· More infos on this standardization context and objective:
· Abstract of your contribution


Project name: VITALAS
Nozha Boujemaa

1- Internal technical evaluation
● Test Corpora Type (Text, audio, video…): image, audio, video, text
● Test Corpora Size used to develop the VITALAS system:
 ~ 1000 professional images + textual metadata per image
 ~ 100 hours of broadcast archive video + metadata per program
● Test Corpora Size used for large scale retrieval using the VITALAS system:
 ~ 3 million professional images + textual metadata per image
 ~ 10.000 hours of broadcast archive video + metadata per program
● Performance measures (Mean precision,…): Mean average precision and recall graphs

2- Participation in open national/European/international Benchmark initiatives:
● Name and level (european?) of the initiative:
 INEX Multimedia – international (Theodora Tsikrika (CWI) (a VITALAS partner) is one of the two organisers of the INEX Multimedia track).
 TRECVID - international
 ImageCLEF - international

3- User Trials (feedback with real end-users, no relation with the provided technologies)
● Number of these external users: not defined yet
● Do the users belong to different communities? : professionals (Jouranalists, Documentalists)
● Trials protocol: not defined yet
● User's satisfaction criteria: not defined yet

4- Participation in standardization effort:
● JPSearch, XQuery
● More infos on this standardization context and objective: The VITALAS project aims to offer significant contributions to the development of European and International Standards. The areas in which the project can make a substantial contribution include content representation, query languages for cross-media retrieval, and the evaluation of multimedia / cross-media retrieval systems.




Annex II: Related Chorus Events to Benchmarking and Evaluation

Below is the program of Chorus Roquencourt workshop. Slides of all presentation are available on the web site: http://www.ist-chorus.org/chorus-wg2.php
Short abstract with link to each benchmark initiatives are available on: http://www.ist-chorus.org/benchmark-initiatives-for-multim.php

CHORUS EVENTS
NAVS Chorus cluster, March 14th 2007

Agenda Chorus WG2 meeting - 14:30-17:30
Evaluation and Benchmarking of Multimedia Content Search Methods
The objective is to make the point on ongoing evaluation initiatives.
14:30 - 14:45 TrecVid - Alex Hauptmann (CMU - USA)
14:45 - 15:00 ImageClef - Henning Müller (UHG - Switzerland)
15:00 - 15:15 ImagEval - Pierre Alain Moellic (CEA - France)
15:15 - 15:30 Pascal Challenge & Robin - Frédéric Jurie (INRIA Rhône-Alpes)

15:30 - 15-45 Short Statements: IAPR-TC12 - Marcel Worring (UvA - Netherlands);
CIVR Evaluation Showcase - Allan Hanbury (VUT - Austria)
15:45 - 16:00 Coffee break
16:00 - 16:15 SHREC (3D) - Michela Spagnuolo (CNR - Italy)
16:15 - 16:30 MIREX - Andreas Rauber (VUT - Austria)
16:30 - 16:45 INEX - Thijs Westerveld (CWI - Netherlands)
16:45- 17:30 Panel discussion:
1- Why so many benchmarch inititives? Is there communalities?
2- How can they work closer together?
3- What are the main difficulties encountered: data collections, data annotation, task definition, task evaluation, participation...?
4- How can we face the identified problems

Meeting Closer: Next steps in Chorus WG2 activities - Nozha Boujemaa (INRIA - France)

Program and slides of CBMI’2007 panel are available on the web site: http://www.ist-chorus.org/events_0.php


CHORUS EVENTS
CBMI Chorus Panel: June 25th 2007 Bordeaux
CBMI homepage

Topic: Benchmarking Multimedia Search Engines

Panel Chair: Nozha Boujemaa INRIA -France (
slides)
Panelist:

  • Stéphane Marchand-Maillet - University of Geneva, Switzerland (slides)
  • Christian Fluhr - CEA, France (slides)
  • Kahlid Choukri - ELDA, France (slides)
With the contributions from Henning Mueller (SIM - Geneva), Paul Clough (Univ. Sheffield)

The panel will address the following questions:

  1. "Role of the user in the evaluation process of multimedia retrieval techniques; How much difficult taking the user in the evaluation process?"
  2. "How to measure search engines performance/success: user satisfaction or technology accuracy?"
  3. "How to quantify the success in each situation? How much is it dependent from scenarios and context (application)?"
  4. "Are the best performance systems the most successful commercially?"
  5. With the ending question: "How useful the evaluation is? Pushing a head the knowledge or killing the innovation?"



No user avatar
pointjc
Latest page update: made by pointjc , Mar 12 2008, 12:28 PM EDT (about this update About This Update pointjc Moved from: State of the Art in audio-visual content indexing and retrieval technologies - pointjc

No content added or deleted.

- complete history)
Keyword tags: None
More Info: links to this page

Anonymous  (Get credit for your thread)


There are no threads for this page.  Be the first to start a new thread.