Multimedia search appears to have at least the following subclasses: P2P search and mobile search. They are both important: P2P search could very well be a model for searching the whole web as audiovisual content is becoming dominant and mobile search will cater to the increasing number portable devices. In this part of the document we will address these two areas showing their importance, their state of the art and the main research players. 1.2.1. Introduction
A P2P application is different from the traditional client/server model because it acts both as a client and a server. That is to say, while they are able to request information from other servers, they also have the ability to respond to requests for information from other clients, at the same time.
Figure 1 shows the architecture of a P2P network, where each node acts as a user interface, service provider, message router, and –possibly partial- resource repository. The links between nodes tend to be dynamic. The advantage of a peer-to-peer architecture compared to traditional client-server architectures is that a machine can assume the role that is most efficient for the performance of the network. This implies the load on the server is reduced/distributed, which allows for more specialized services. 
Figure 1: Architecture of a P2P network A typical peer-to-peer application has the following key features: ·
Peer discovery. The application must be able to find other applications that are willing to share information. Historically, the application finds these peers by registering to a central server that maintains a list of all applications currently willing to share, and giving that list to any new applications as they connect to the network. However, there are other means available, such as network broadcasting or discovery algorithms.
·
Querying peers for content. Once these peers have been discovered, the application can ask them for the content that is desired by the application. Content requests typically come from users, but it is possible that the peer-to-peer application is running on its own and performing its query as a result of some other routed network request.
·
Sharing content with other peers. In the same way that the peer can ask others for content, it can also share content after it has been discovered.
The social classification or collaborative tagging component of P2P is relatively important: some research work on the social aspects of P2P search related to the different kinds of folksonomies has been carried. Social search, in general, takes into account all user input to refine the search: social bookmarking and tagging, sharing personal item lists, etc… Social selection needs the active participation of the user. She shares compiled lists or tagged items, so that the content slowly grows. Specialized search engines only need an initial setup and can be refined after that. Users give useful information to the search engine by writing in search keywords. There is no need for additional input.Web search is almost exclusively under the control of centralized search engines. Lately, various projects have started building and operating a P2P web search network, but so far these endeavours are fairly small in scale. Ironically, Web search and Internet scale file content search seem to be perfect candidates for a P2P approach, for several reasons: 1) The data is originally highly distributed, residing on millions of sites (with more and more individuals contributing, e.g., through their blogs) 2) A P2P network could potentially dwarf even the largest server farm in terms of processing power and could thus enable much more advanced methods for computational intensive tasks, such as linguistic data analysis, statistical learning, or ontology based background knowledge and reasoning (all of which are out of the question when you have to serve hundred millions of queries per day on a, however big but centralized, server farm). 3) There is growing concern about the world’s dependency on a few quasi monopolistic search engines and their susceptibility to commercial interests, spam or distortion by spam combat, biases in geographic and thematic coverage, or even censorship. These issues have led to postulate that “the Web should be given back to the people”. The peer-to-peer (P2P) approach, which has become popular in the context of file-sharing systems such as Gnutella or KaZaA, allows handling huge amounts of data in a distributed and self-organizing way. In such a system, all peers are equal and all of the functionality is shared among all peers, so that there is no single point of failure and the load is evenly balanced across a large number of peers. These characteristics offer enormous potential benefits for search capabilities powerful in terms of scalability, efficiency, and resilience to failures and dynamics. Additionally, such a search engine can potentially benefit from the intellectual input (e.g., bookmarks, query logs, etc.) of a large user community. One of the key difficulties, however, is to efficiently select promising peers for a particular information need. Effective discovery methods rely on the information published for a particular resource. Commonly used discovery methods include: · Flooding broadcast queries
When a peer makes a query, the query is then broadcasted to all the neighbour peers. If its neighbour peers could not solve the query, then the query is broadcasted to neighbour’s neighbour peers. If a resource is found, that peer will send a message to the original sender of the query, indicating it can solve the query, and then establish a peer-to-peer connection. The original Gnutella implementation is an example of a flooding broadcast discovery mechanism. Each query has a time-to-live (ttl) counter. Typically, the ttl is set between 5 and 7, and the value is decremented by each node as it relays the message. Another counter tracks the number of hops. Once the ttl counter reaches zero, the query will be discarded. Due to the broadcast nature of each query, the system does not scale well (o(n2)); the bandwidth network assumption grows exponentially with a linear increase in the number of peers. Raising the number of peers in the system will cause the network to quickly reach bandwidth saturation. This type of method has the advantage of flexibility in the processing of queries. Each peer can determine locally how it will process the query and respond accordingly. It is simple to design and efficient. Unfortunately, it is suitable only for small networks. As well as that, this type of mechanism is very susceptible to malicious activity, rogue peers can send out large number of queries, which produce a significant load on the network.
· Selective forwarding systems Instead of sending a query to all peers, it is selectively forwarded to specific peers who are considered likely to locate the resource. Peers will become super peer automatically if they have sufficient bandwidth and processing power, i.e. if a peer has broadband connection and higher processing power. Peers with dial-up connection (low bandwidth) will make queries to super peers. This type of systems use flow control algorithm (fca), which tries to apply a form of intelligent flow control in terms of how a peer forwards request and response messages and a sensible priority scheme, as well as how it drops messages that won’t fit into the connections. Selective forwarding systems are more scalable than flooding broadcast systems. This approach greatly reduces bandwidth limitations to scalability. But it is susceptible to malicious activity: a rogue peer can insert itself into the network at the various points and misroute queries, or discard them altogether. Each peer must also contain some amount of information used to route or direct queries received. The size of this information is negligible in a small network, but in large networks, this overhead may grow to levels that are unacceptable, hence it is not suitable for a large peer network.
· Decentralized hash table networks In decentralized hash table networks, each file stored within the system is given a unique ID, typically a sha-1 hash of its content, which is used to identify and locate a resource. Given this unique ID, a resource can be located quickly despite the size of the network. Since this key identifies each resource, it is impossible to perform a fuzzy or keyword search within the network. If a peer is looking for a file from another peer, it must obtain this key first in order to retrieve the file. These systems are also susceptible to malicious activity by rogue peers: they may discard a query, insert large amount of frivolous data to clutter the key space, or flood the network with queries to degrade the performance.
· Centralized indexes and repositories Indexes of all peers and their resources are kept on a main server. A query is sent to a server, then the server will look-up the index, if the query can be solved, then the server will send a message to the original query sender explaining where he can get the file. Napster uses centralized indexed and repositories system. Centralized indexes have provided the best performance for resource discovery. The server in centralized indexes and repositories system is expensive, the bandwidth and hardware required to support large networks of peers are expensive. If the server in the system fails to function properly, it brings down the whole network. In the case of Napster, it has a cluster of servers, so that if one server fails, the rest of the servers will continue supporting the network.
· Distributed indexes and repositories The idea of distributed index is that each content broker in the network keeps an index of local files as well as an index of some files stored in some neighbouring content broker. When a content broker receives a query from a peer, it first checks to see if the query can be satisfied locally. If it cannot, it uses the local index to decide which content broker to forward the request to. The index on each server is not static and changes as files move through the system. With this approach, we could eliminate the need for expensive centralized servers. If well designed, distributed indexes and repositories provide currently one of the best performances and scalability. In addition, it has a high sngle point failure tolerance, because a content broker only contains a relative small number of indexes in comparison to the centralized server, so that if one content broker goes down, the network will still function properly. The problem with this type of indexing system is that if a file is changed locally by a peer, then the content broker will not be aware of this fact. Subsequently, when another peer requests that particular file, the content broker will return an out-of-date copy of that file. The overhead in keeping everything up-to-date and efficiently distributed is a major detriment to scalability. Peers joining and leaving the network from time to time,Also when a peer leaves the network, all the resources indexes stored in that peer will become unavailable to other peers. Distributed indexing systems, as they currently exist, cannot provide robust discovery in large networks.
· Relevance driven network crawlers Relevance driven network crawlers use a database of existing information the peer has accumulated to determine which resources it encounters may or may not be relevant or interesting to the peers. Over time, a large amount of information is accrued, which is analysed to determine what common elements the peer has found relevant. The crawler then traverses the networks, usually consisting of html documents for new information, which matches the profile distilled from the previous peer information. The time required for the crawler to traverse a large amount of content is very long; it is not suitable for large networks.
Except filesharing applications, like bittorennt, Gnutela, Edonkey, etc. there are only a dozen of companies working on P2P search engines, some have already a tool, others are in research or in development. Some like Minerva and Yahoo are developing specific algorithms, other like Faroo, Yacy, Open search are fully distributed P2P search engines. Most companies have distributed crawlers but a central index : Majestic, GPU, Grob and Boitho.
1.2.4. The State of the Art
There are many workshops and papers around the subject of P2P IR ( see references). Among principal recent worshop in this area we can name: Distributed IR at SIGIR 2004, P2PIR at SIGIR 2004, Heterogeneous and Distributed IR at SIGIR 2005, P2PIR 2005 and 2006 at CIKM, Large-Scale Distributed Systems for IR at SIGIR 2007, Adversarial IR on the Web, at WWW 2007 and IPTS. Most have good reports on line giving historical context. There are isolated solutions today for P2P text search and for P2P similarity search but for single AV features (e.g. for color or shape). There are no efficient solutions combining both text and multiple features (e.g. both color and shape). Of course all solutions have to deal with scalability. 1.2.4.1. The scalability viewpoint Comprehensive Web search based on a P2P network has been considered infeasible from a scalability viewpoint. Recent work, however, indicates that the scalability problems could be overcome, either by distributing a conceptually global keyword index across a DHT style network or by having each peer compile its local index at its own discretion (using the peer’s own "native" data sources or performing thematically focused Web crawls and other data extraction according to the peer’s interest profile). In addition, various acceleration techniques can be employed. For example, one pursues a multilevel partitioning scheme, a hybrid between partitioning by keyword and partitioning by document. Another uses view trees for result caching to improve the P2P search efficiency. From a query processing and IR viewpoint, one of the key issues is query routing: when a peer poses a query with multiple keywords and expects a high quality top10 or top100 ranked result list, the P2P system needs to make a judicious decision on which other peers the query should be forwarded. This decision needs statistical information about the data contents in the network. It can be made fairly efficiently in a variety of ways, like utilizing a DHT based distributed directory, building and maintaining a semantic overlay network (SON) with local routing indexes, or using limited forms of epidemic gossiping. However, efficiency of P2P query routing is only one side of the coin. Of course, we also expect good search result quality, that is, good effectiveness in IR terminology, measured in terms of precision and recall. The goal is to be as good as the best centralized search engines, but the P2P approach faces the challenge that the index lists and statistical information that lead to good search results are scattered across the network. For example, consider two or three keyword queries such as "Michael Jordan", "native American music", or "PhD admission". A standard, efficient and scalable, approach would decompose each of these queries into individual terms such as "native" and "American" and "music", identify the best peers for each of the terms separately, and finally combine them, e.g., by intersection or some form of score aggregation in order to derive a candidate list of peers to which the query should be forwarded. The result of this "factorization" would often lead to mediocre results as the best peers (and files located on those peers) for the entire query may not be among the top candidates for any of the individual keywords. The root cause of the above problem is that the outlined "factorized" method for P2P query routing and processing has no way of taking into account the correlation between the keywords in the query. We miss out on the fact that, for example, "PhD" and "admission" are statistically correlated in the corpus, and, even worse, that the best matches for the entire query should exhibit a higher than average frequency of both terms (ideally within some proximity window). Standard search engines do not necessarily consider these correlations either, but they process index lists on the overall document space directly, whereas the P2P system first needs to identify other peers for query routing in order to access index lists and then sees only partitions of the global index space. Thus, the necessarily coarser aggregation granularity of routing indexes or the distributed directory causes an additional penalty for a P2P approach. On the other hand, directly simulating the centralized algorithms in the P2P network would incur undue communication costs. One may argue that critical correlations of the above kind typically occur in composite names or phrases, as suggested by our examples. Although this is indeed often the case, the observation alone does not provide a solution. It is virtually impossible to foresee all phrases or names or correlated term pairs that will appear in important user queries, and brute force pre-computation of statistical measures for all possible pairs of terms is not a viable option.1.2.4.2. File sharing and Text only · Searching in unstructured P2Ps
In an unstructured P2P system, no rule exists that strictly defines where data is stored and which nodes are neighbours of each other. To find a specific data item, early work such as the original Gnutella used flooding, which is the Breadth First Search (BFS) of the overlay network graph with depth limit D. D refers to the system-wide maximum TTL of a message in terms of overlay hops. In this approach, the querying node sends the query request to all its neighbours. Each neighbour processes the query and returns the result if the data is found. This neighbour then forwards the query request further to all its neighbours except the querying node. This procedure continues until the depth limit D is reached. Flooding tries to find the maximum number of results within the ring that is centred at the querying node and has the radius: D-overlay-hops. However, it generates a large number of messages (many of them are duplicate messages) and does not scale well. Many alternative schemes have been proposed to address the problems of the original flooding. These works include iterative deepening, k-walker random walk, modified random BFS, two-level k-walker random walk, directed BFS , intelligent search, local indices based search, routing indices based search, attenuated bloom filter based search, adaptive probabilistic search, and dominating set based search . They can be classified as BFS based or Depth First Search (DFS) based. The routing indices based search and the attenuated bloom filter based search are variations of DFS. All the others are variations of BFS. In the iterative deepening and local indices, a query is forwarded to all neighbours of a forwarding node. In all other schemes, a query is forwarded to a subset of neighbours of a forwarding node. The searching schemes in unstructured P2P systems can also be classified as deterministic or probabilistic. In a deterministic approach, the query forwarding is deterministic. In a probabilistic approach, the query forwarding is probabilistic, random, or is based on ranking. The iterative deepening, local indices based search, and the attenuated bloom filter based search are deterministic. The others are probabilistic. Another way of categorizing searching schemes in unstructured P2P systems is regular-grained or coarse-grained. In a regular-grained approach, all nodes participate in query forwarding. In a coarse-grained scheme, the query forwarding is performed by only a subset of nodes in the entire network. Dominating set based search is coarse-grained, because the query forwarding is performed only by the dominating nodes in the CDS (Connected Dominating Set). All the others are regular-grained. Another taxonomy is blind search or informed search. In a blind search, nodes do not keep information about data location. In an informed search, nodes store some metadata, a process that facilitates the search. Blind searches include iterative deepening, k-walker random walk, modified random BFS, and two-level k-walker random walk. All the others are informed search. · Iterative deepening
Yang and Garcia-Molina borrowed the idea of iterative deepening from artificial intelligence and used it in P2P searching. This method is also called expanding ring. In this technique, the querying node periodically issues a sequence of BFS searches with increasing depth limits. The query is terminated when the query result is satisfied or when the maximum depth limit D has been reached. Iterative deepening is tailored to applications where the initial number of data items returned by a query is important. However, it does not intend to reduce duplicate messages and the query processing is slow. · k-walker random walk and related schemes
In the standard random walk algorithm, the querying node forwards the query message to one randomly selected neighbour. This neighbour randomly chooses one of its neighbours and forwards the query message to that neighbour. This procedure continues until the data is found. Consider the query message as a walker. The query message is forwarded in the network the same way a walker randomly walks on the network of streets. The standard random walk algorithm uses just one walker. This can greatly reduce the message overhead, but causes longer searching delay. In the k-walker random walk algorithm, k walkers are deployed by the querying node. That is, the querying node forwards k copies of the query message to k randomly selected neighbours. Each query message takes its own random walk. Each walker periodically “talks” with the querying node to decide whether that walker should terminate. Nodes can also use soft states to forward different walkers for the same query to different neighbours. K-walker random walk algorithm attempts to reduce the routing delay. On average, the total number of nodes reached by k random walkers in H hops is the same as the number of nodes reached by one walker in kH hops. Therefore, the routing delay is expected to be k times smaller. Another similar approach, called the modified random BFS, was proposed. The querying node forwards the query to a randomly selected subset of its neighbours. On receiving a query message, each neighbour forwards the query to a randomly selected subset of its neighbours (excluding the querying node). This procedure continues until the query stop condition is satisfied. Three approaches are considered and evaluated using k-walker random walk: owner replication, path replication, and random replication. All three schemes replicate the object found, when a query is successful. The owner replication replicates an object only at the requesting node. The path replication creates copies of an object on all nodes on the path, from the providing node to the requesting node. The random replication places copies on the p randomly selected nodes that were visited by the k walkers. The path replication implements the square-root replication. The random replication has slightly less overall search traffic than the path replication, because path replication intends to create object copies on the nodes that are topologically along the same path. Both the path replication and the random replication have less overall search traffic than the owner replication. · Directed BFS and intelligent search
The basic idea of directed BFS approach is that the query node sends the query message to a subset of its neighbours that will quickly return many high-quality results. These neighbours then forward the query message to all their neighbours just as in BFS. To choose “good” neighbours, a node keeps track of simple statistics on its neighbours, for example, the number of query results returned through that neighbour, and the network latency of that neighbour. Based on these statistics, the best neighbours can be intelligently selected using the following heuristics: ● The highest number of query results returned previously ● The least hop-count in the previously returned messages (i.e. the closest neighbours) ● The highest message count (i.e. the most stable neighbours) ● The shortest message queue (i.e. the least busy neighbours) By directing the query message to just a subset of neighbours, directed BFS can reduce the routing cost in terms of the number of routing messages. By choosing good neighbours, this technique can maintain the quality of query results and lower the query response time. However, in this scheme only the querying node intelligently selects neighbours to forward a query. All other nodes involved in a query processing still broadcast the query to all their neighbours, as in BFS. Therefore, the message duplication is not greatly reduced. There is also a similar approach called intelligent search. The query type considered is called the keyword query: a search for documents that contain desired keywords listed in a query. A query is represented using a keyword vector. This technique consists of four components: a search mechanism, a profile mechanism, a peer ranking mechanism, and a query similarity function. When the querying node initiates a query, it does not broadcast the query to all its neighbours. Instead, it evaluates the past performance of all its neighbours and propagates the query only to a subset of its neighbours that have answered similar queries before and therefore will most likely answer the current query. On receiving a query message, a neighbour looks at its local datastore. If the neighbour has the desired documents, it returns them to the querying node and terminates. Otherwise, the neighbour forwards the query to a subset of its own neighbours that have answered similar queries before. The query forwarding stops when the maximum TTL is reached. The cosine similarity model is used to compute the query similarity. Based on this model, the similarity between two queries is the cosine of the angle between their query vectors. To determine whether a neighbour answered similar past queries, each node keeps a profile for each of its neighbours. The profile of a neighbour contains the most recent queries that were answered by that neighbour. The profile is created and updated using two schemes. In one scheme, each peer continuously monitors the query and query response message. Queries answered by a neighbour are stored in the profile for that neighbour. In the second scheme, the peer that replies to a query message broadcasts this information to all its neighbours. Neighbours are ranked to facilitate the selection. · Local indices based search
The local indices intend to get the same number of query results as scoped-flooding with less number of nodes processing a query. In local indices, each node keeps indices of data on all nodes within k-hop distance from it. Therefore, each node can directly answer queries for any data in its local indices without resorting to other nodes. All nodes use the same policy P on the list of depths, at which the query should be processed. The nodes whose depths are listed in P check their local indices for the queried data and return the query result, if the sought data is found. These nodes also forward the query message to all their neighbours, if their depths are not equal to the maximum depth limit. All other nodes, whose depths are not listed in P, just forward the query message to all their neighbours, once receiving it, and do not check their local indices. When the depth limit is reached, the query is terminated even if the query result is not satisfied. Note that all nodes in a P2P system organized using local indices play equal roles. The local indices are updated when a node joins, leaves, or modifies its data. The local indices approach is similar to iterative deepening. Both broadcast the query message based on a list of depths; however, in iterative deepening, all nodes within the maximum depth limit process the query. In local indices, only nodes whose depths are listed in the policy P process the query. In addition, the iterative deepening approach spreads the query message iteratively with increasing TTL; the local indices approach spreads the query message once with the maximum TTL. · Routing indices based search
Routing indices is similar to directed BFS and intelligent search in that all of them use the information about neighbours to guide the search. Directed BFS only applies this information to selecting neighbours of the querying source (i.e. the first hop from the querying source.) The rest of the search process is just as that of BFS. Both intelligent search and routing indices guide the entire search process. They differ in the information kept for neighbours. Intelligent search uses information about past queries that have been answered by neighbours. Routing indices stores information about the topics of documents and the number of documents stored in neighbours. Routing indices considers content queries, queries based on the file content instead of file name or file identifier. One example of such a content query is: a request for documents that contain the word “networks”. A query includes a set of subject topics. Documents may belong to more than one topic category. Document topics are independent. Each node maintains a local index of its own document database based on the keywords contained in these documents. The goal of a Routing Index (RI) is to facilitate a node to select the “best” neighbours to forward queries. A RI is a distributed data structure. Given a content query, the algorithms on this data structure compute the top m best neighbours. The goodness of a neighbour is application dependent. In general, a good neighbour is the one through which many documents can be quickly found. A routing index is organized based on the single –hop routes and document topics. There is one index entry per route (i.e. per neighbour) per topic. An RI index entry, (networks, B), at node A stores information about documents in the topic: networks that may be found through the route (A-> B). This entry gives hints on the potential query result, if A forwards the query to B (i.e. the route A -> B is chosen); hence, the name Routing Index. A routing index entry is very different from a regular index entry. If (networks, B) were the regular index entry, it would mean that node B stores documents in the topic: networks. By organizing the index based on neighbours (routes) instead of destinations (indexed data locations), the storage space can be reduced. Three types of RIs, compound RI, hop-count RI, and exponentially aggregated RI, are proposed. They differ in RI index entry structures. A compound RI (CRI) stores information about the number of documents in each interesting topic that might be found, if a query is forwarded to a single-hop neighbour. The goodness of a neighbour for a query in CRI is the number of desired documents that may be found through that neighbour. · Attenuated bloom filter based search
The attenuated bloom filter based search assumes that each stored document has many replicas spread over the P2P network; documents are queried by names. It intends to quickly find replicas close to the query source with high probability. This is achieved by approximately summarizing the documents that likely exist in nearby nodes. However, the approach alone fails to find replicas far away from the query source. Bloom filters are often used to approximately and efficiently summarize elements in a set. A bloom filter is a bit-string of length m that is associated with a family of independent hash functions. Each hash function takes as input any set element and outputs an integer in [0, m). To generate a representation of a set using bloom filters, every set element is hashed using all hash functions. Any bit in the bloom filter, whose position matches a hash function result, is set to 1. To determine whether an element is in the set described by a bloom filter, that element is hashed using the same family of hash functions. If any matching bit is not set to 1, the element is definitely not in the set. If all matching bits in the bloom filter are set to 1, the element is probably in the set. If the element is indeed not in the set, this is called a false positive. Attenuated Bloom Filters are extensions to bloom filters. An attenuated bloom filter of depth d is an array of d regular bloom filters of the same length w. A level is assigned to each regular bloom filter in the array. Level 1 is assigned to the first bloom filter. Level 2 is assigned to the second bloom filter. The higher levels are considered to be attenuated with respect to the lower levels. To route a query for a file, the querying node hashes the file name using the family of hash functions. Then the querying node checks level-1 of its attenuated bloom filters. If level-1 of an attenuated bloom filter for a neighbour has 1s at all matching positions, the file will probably be found on that neighbour (1-hop distance from the query source). We call such a neighbour a candidate. The querying node then forwards the query to the closest one among all candidates. If no such candidate can be found, the querying node will check the next higher level (level-2) of all its attenuated bloom filters similarly to checking level-1. If no candidate can be found after all levels have been checked at the query source, this indicates that definitely no nearby replica exists. On receiving the query, a neighbour of the querying node looks up its local data store. If the data is found, it will be returned to the query source. If not, this neighbour will check its attenuated bloom filters similarly. During the query processing, if a false positive is found after d (the depth of the attenuated bloom filter) unsuccessful hops, the attenuated bloom filter based search terminates with a failure. No back tracking is allowed. The attenuated bloom filter approach can be combined with any structured approach to optimize the searching performance. We can use the attenuated bloom filters to try locating nearby replicas. If no nearby replica exists, we switch to the structured approach to continue the lookup. The hop-count RI is similar to the attenuated bloom filter approach. Both summarize the documents at some distance from the querying source. There are two differences between them. One is that the attenuated bloom filter is a probabilistic approach while the hop-count RI is a deterministic approach if omitting the document change. The other is that the attenuated bloom filter provides information about a specific file while the hop-count RI provides the number of documents on each document category but not a specific file. · Adaptive probabilistic search
In the Adaptive Probabilistic Search (APS), it is assumed that the storage of objects and their copies in the network follows a replication distribution. The number of query requests for each object follows a query distribution. The search process does not affect object placement and the P2P overlay topology. The APS is based on k-walker random walk and probabilistic (not random) forwarding. The querying node simultaneously deploys k walkers. On receiving the query, each node looks up its local repository for the desired object. If the object is found, the walker stops successfully. Otherwise, the walker continues. The node forwards the query to the best neighbour that has the highest probability value. The probability values are computed based on the results of the past queries and are updated based on the result of the current query. The query processing continues, until all k walkers terminate either successfully or fail (in which case the TTL limit is reached). To select neighbours probabilistically, each node keeps a local index about its neighbours. There is one index entry for each object, which the node has requested or forwarded requests for through each neighbour. The value of an index entry for an object and a neighbour represents the relative probability of that neighbour being selected for forwarding a query for that object. The higher the index entries value the higher the probability. Initially, all index values are assigned the same value. Then, the index values are updated as follows. When the querying node forwards a query, it makes some guess about the success of all the walkers. The guess is made based on the ratio of the successful walkers in the past. If it assumes that all walkers will succeed (optimistic approach), the querying node pro-actively increases the index values associated with the chosen neighbours and the queried object. Otherwise (pessimistic approach), the querying node proactively decreases the index values. Using the guess determined by the querying node, every node on the query path updates the index values similarly when forwarding the query. The index values are also updated when the guess for a walker is wrong. Specifically, if an optimistic guess is made and a walker terminates with a failure, then the index values for the requested object along that walker’s path are decreased. The last node on the path sends an update message to the preceding node. On receiving the message, the preceding node decreases the index value for that walker and forwards the update message to the next node on the reverse path. This update procedure continues on the reverse path until the querying node receives an update message and decreases the index value for that walker. If the pessimistic approach is employed and a walker terminates successfully, the index values for the requested object on the walker’s path are increased. The update procedure is similar. To remember a walker’s path, each node appends its ID in the query message during query forwarding and maintains a soft state for the forwarded query. If a walker A passes by a node, which another walker B stopped by before, the walker A terminates unsuccessfully. The duplicate message was discarded. Compared to the k-walker random walk, the APS approach has the same asymptotic performance in terms of the message overhead. However, by forwarding queries probabilistically to most promising neighbour(s) based on the learned knowledge, the APS approach surpasses the k-walker random walk in the query success rate and the number of discovered objects. The APS uses the same guess for all objects. This imprecision causes more messages. Therefore, the swapping-APS (s-APS) constantly observes the ratio of successful walkers for each object and swaps to a better update policy accordingly. The weighted-APS (w-APS) includes the location of objects in the probabilistic selection of neighbours. A distance function is embedded in the stored path of the query and is used in the index update. When the pessimistic guess is made for a walker and the walker succeeds, the index values for neighbours closer to the discovered object are increased more than those for distant neighbours. · Dominating set based search
In this approach, routing indices are stored in a selected set of nodes that form a connected dominating set (CDS). A CDS in a P2P network is a subset of nodes which are connected through direct overlay links. All other nodes that are not in the CDS can be reached from some node in the CDS in one-hop. Searching is performed through a random walk on the dominating nodes in the CDS. The construction of the CDS uses solely the local information: a node’s 1-hop and 2-hop neighbours. The construction consists of two processes: marking followed by reduction. The marking process marks each node in the P2P system as either a dominating node or a non dominating node. The marker T represents a dominating node, while the marker F represents a non-dominating node. A node is marked using T, if two of its neighbours are not directly connected (i.e. these two neighbours are not neighbours of each other). At the end of the marking process, all nodes with marker T form the CDS. To reduce the size of the CDS, two reduction rules are applied during the reduction process. Each node in the CDS is assigned a 1-hop ranking value. This ranking value is the sum of the number of documents on a node and the number of documents of the node’s neighbour that has the most documents. The first reduction rule specifies that if the neighbours of a node A in the CDS are a proper subset of neighbours of another node B in the CDS and the node A has a smaller 1-hop ranking value than node B, then remove node A from the CDS. The second reduction rule states that a node C is removed from the CDS, if the following three conditions are satisfied: 1) Two neighbours A and B of the node C are also dominating nodes.
2) The neighbour set of C is a proper subset of the union of the neighbour sets of A and B.
3) The node C has a 1-hop ranking value that is smaller than the values of both A and B.
Searching is conducted on the CDS as follows: if the querying source is not a dominating node, the source forwards the query to its dominating neighbour with the highest 1-hop ranking value. If the querying source is a dominating node, it forwards the query to its dominating neighbour with the highest 1-hop ranking value. This querying source also forwards the query to a non dominating neighbour, if that neighbour has the most documents among all neighbours of the querying source. On receiving a query request, a dominating node looks up its local database for the searched document and performs the query forwarding similarly to a querying source that is a dominating node. On receiving a query request, a non-dominating node only looks up the local database and does not forward the query any further. All found documents are returned from the hosting nodes to the querying source along the reverse query paths. The query stops when the TTL limit is reached or a node is visited the second time. The dominating set based approach intends to get the greatest number of documents by forwarding queries primarily on dominating nodes, which are well-connected and have many documents themselves or whose neighbours have many documents. The construction of the CDS does not incur more overlay links, as often occurs in super peers. The cost of creating and maintaining the CDS is lower than that of routing indices. · Searching in strictly structured P2Ps
In a strictly structured system, the neighbour relationship between peers and data locations is strictly defined. Searching in such systems is therefore determined by the particular network architecture. Among the strictly structured systems, some implement a distributed hash table (DHT) using different data structures. Others do not provide a DHT interface. Some DHT P2P systems have flat overlay structures; others have hierarchical overlay structures. A DHT is a hash table whose table entries are distributed among different peers located in arbitrary locations. Each data item is hashed to a unique numeric key. Each node is also hashed to a unique ID in the same key space. Each node is responsible for a certain number of keys. This means that the responsible node stores the key and the data item with that key or a pointer to the data item with that key. Keys are mapped to their responsible nodes. The searching algorithms support two basic operations: lookup(key) and put(key). Lookup(k) is used to find the location of the node that is responsible for the key k. put(k) is used to store a data item (or a pointer to the data item) with the key k in the node responsible for k. In a distributed storage application using a DHT, a node must publish the files that are originally stored on it, before these files can be retrieved by other nodes. A file is published using put(k). Different non-hierarchical DHT P2Ps use different flat data structures to implement the DHT. These flat data structures include ring, mesh, hypercube, and other special graphs such as de Bruijn graph. Chord uses a ring data structure. Pastry [uses a tree-based data structure that can be considered as a generalization of a hypercube. A d-dimensional toroidal space is used to implement the DHT in CAN. The space is divided into a number of zones. Each zone is a hyper-rectangle and is taken care of by a node. The zone boundaries identify the node responsible for that zone. The systems Koorde, Viceroy, and Cycloid have overlays with constant degrees. Koorde embeds a de Bruijn graph on the Chord ring for forwarding lookup requests. The overlay of Viceroy is an approximate butterfly network. The butterfly level parameter of a node is selected according to the estimated network size. Cycloid integrates Chord and Pastry and imitates the cube-connected-cycles (CCC) graph routing. Cycloid performs better than Koorde and Viceroy in large-scale and dynamic P2P systems. · Searching in hierarchical DHT P2Ps
All hierarchical DHT P2Ps organize peers into different groups or clusters. Each group forms its own overlay. All groups together form the entire hierarchical overlay. Typically, the overlay hierarchies are two-tier or three-tier. They differ mainly in the number of groups in each tier, the overlay structure formed by each group, and whether or not peers are distinguished as regular peers and super peers/dominating nodes. Super peers/dominating nodes generally contribute more computing resources, are more stable, and take more responsibility in routing than regular peers. We will focus on Kelips and Coral. · Kelips
Kelips is composed of k virtual affinity groups with group IDs. Inside a group, a file is stored in a randomly chosen group member, called the file’s home node. Thus Kelips offers load balance in the same group and among different groups. · Coral and related schemes
Coral is an indexing scheme. It does not dictate how to store or replicate data items. The objectives of Coral are to avoid hot spots and to find nearby data without querying distant nodes. A distributed sloppy hash table was proposed to eliminate hot spots. In DHT, a key is associated with a single value that is a data item or a pointer to a data item. In a DSHT, a key is associated with a number of values which are pointers to replicas of data items. · Other hierarchical DHT P2Ps
In Kelips and Coral, all peers play equal roles in routing. The differences among peers, such as processing power and storage capacity, are not considered. The nodes with more contributed resources are called super peers. Otherwise, they are called peers. A super peer may be demoted to a peer. A peer may also become a super peer. The system architecture consists of two rings: an outer ring and an inner ring. The outer ring is a Chord ring and consists of all peers and all super peers. The inner ring consists of only super peers. · Searching in non-DHT P2Ps
The non-DHT P2Ps try to solve the problems of DHT P2Ps by avoiding hashing. Hashing does not keep data locality and is not amenable to range queries. There are three big kinds of non-DHT P2Ps: SkipNet, SkipGraph, and TerraDir. SkipNet is designed for storing data close to users. SkipGraph is intended for supporting range queries. TerraDir is targeted for hierarchical name searches. Searching in such systems follows the specified neighbouring relationships between nodes. · Searching in loosely structured P2Ps
In loosely structured P2Ps, the overlay structure is not strictly specified. It is either formed based on hints or formed probabilistically. In Freenet and Phenix, the overlay evolves into the intended structure based on hints or preferences. In Symphony the overlay is constructed probabilistically. Searching in loosely structured P2P systems depends on the overlay structure and how the data is stored. In Freenet, data is stored based on the hints used for the overlay construction. Therefore, searching in Freenet is also based on hints. In Phenix, the overlay is constructed independent of the application. The data location is determined by applications using the Phenix. Therefore, searching in Phenix is application dependent. In Symphony, the data location is clearly specified but the neighbouring relationship is probabilistically defined. Searching in Symphony is guided by reducing the numerical distance from the querying source to the node that stores the desired data.1.2.4.3. Audiovisual search Indexing is essential for achieving efficiency in the management and querying of multimedia data. Moreover, index sharing is an essential aspect of the scalability objective, by ensuring a reasonable scaling of network resource consumption by distributed queries. In order to cope with the exponential growth of digital data, scalable and distributed storage structures need to be developed. By dynamically adding new computational and storage resources, such structures would distribute the data so that no centralized nodes are used for both search and maintenance transactions. Provided enough reliable computational power is available, this approach is able to solve the scalability problem through parallel execution of queries. The performance can even be tuned to the needs of specific applications by load balancing and properly adjusting the capacity of computational resources. Multimedia features can be indexed by assuming the metric space model of similarity. In this respect, SAPIR proposed four methods for similarity searching based on the P2P communication paradigm, often referred to in the literature as Scalable and Distributed Data Structures (SDDS). Specifically, the first two, designated GHT* and VPT* structures follow the basic generalized hyperplane and ball partitioning principles. The other two apply transformation strategies, where the metric similarity search problem is transformed into a series of range queries executed on existing distributed hash tables (DHT), for exact (range) matching over traditional attribute-like data. Following the well known designations of the underlying structures, they are called the MCAN and the M-Chord. Each of the four structures is able to execute similarity queries for any metric and they all exploit parallelism during query processing. All of them have experimentally been implemented over the same computer network and tested on several synthetic and real-life datasets. Preliminary results are very encouraging and basically confirm the hypothesis of constant scalability of such implementations. SAPIR aims at defining standard APIs for connecting and querying the distributed indices. The main objective is to achieve multi-feature similarity ranking based on P2P similarity indices developed for single features. The basic lesson learned is that the similarity score (or grade) a retrieved object receives as a whole depends not only on the scores it gets for individual predicates, but also on how such scores are combined. All these aspects influence the query execution costs. In order to understand the problem, consider a query for objects with circular shapes and red colour. In order to find the best match, it is not enough to retrieve the best matches for the colour features and the shapes. Naturally, the best match for the whole query need not be the best match for a single (colour or shape) predicate. To this aim, Fagin has proposed the so-called A0 algorithm that solves the problem. There have been several extensions of this work, but they don’t deal with similarities over different medias. They do not consider distributed environments as well. Complex similarity query execution over multiple distributed single-feature overlays represents an important challenge of SAPIR, because a naïve solution might result in overwhelming increase of communication costs in the underlying computer network. In principle, our approach will be based on the incremental nearest-neighbour algorithm executed on individual peers, coordinated by a modified Threshold Algorithm (TA) to efficiently obtain the global result. In particular, SAPIR will exploit properties of our P2P overlay networks, which pose some difficulties for a distributed execution of complex similarity queries, but at the same time they offer new structural properties that can be for such query execution exploited. Supposing multiple single feature overlays over the same physical P2P network, the routing processes of individual overlays can take advantage of sharing paths or at least some parts of them. At the same time, once a peer with potentially qualified items of feature one is reached and the items tested, the peer can also test the relevance of items belonging to feature two, provided they are derived from the same object. Naturally, this can be generalized to an arbitrary number of features. Such architecture can capitalize on independence of peers resulting in parallel query execution.1.3.1. Introduction Mobile search is the means people use on the ir portable devices to find content on or off portal directly by browsing or by entering a search query via the mobile version of an online Internet search engine, or by using a specialized mobile search function provided by an operator or other service provider and usually based on a white-label solution. This section is primarily focused on search functionality rather than browsing. Some people argue there is no difference between mobile search and traditional search. Others think there are substantial differences in the way results are presented to the user, essentially because of constraints on the size of the screen. Personalisation and localisation of mobile devices are also other important points to address. In this section we define some areas where these differences appear significantly in search. We will also assume the searched content isn’t specifically mobile, i.e. the search concerns the regular web. As P2P mobile isn’t a lot addressed in the research, P2P mobile search will be defined in a further deliverable and not in this section. · Personalized search and Context information
A mobile device is indicative of personalized services offered to each user. Mobile search can be personalized taking into account both the device characteristics (screen analysis, memory capabilities, applications installed), as well as the user’s history and the user’s contextual information. Personalized search in mobile environments has the advantage of focused results to match the user’s interests, as well as limiting the amount of results to cope with, considering the limited mobile capabilities concerning memory, bandwidth and processing power. As far as context is concerned, factors such as time and location can be employed to assist the mobile search. Better results can be yielded that are more relevant to the user’s general interests, but also to their short term interests. For example, for a user searching for musical concert tickets, the system should take into account the user’s location (country, city) and either fetch results with concert tickets in that location or present the “local” results on top on the list. · Results page layout
Mobile search results typically render the results in one long column as opposed to the multiple column layout that is often used to present traditional search results on PC browsers. Consequently, this makes different types of results, such as sponsored links, harder to spot, even when they are labelled, because they appear inline with the ordinary results. In an attempt to improve the usability and appeal of their product, many mobile search engines design their search engine like a portal, with links directly to specific information. This reduces the amount of typing necessary for the user to find what he or she is looking for. · Local & vertical results
The major mobile search engines are competing to create the best user experience possible. In many instances, doing so involves the search engines surmising the user’s search goal and presenting the user with those specific search results first. For that reason, mobile search engines put a higher focus on local and vertical (classical) results, frequently featuring them much more prominently than traditional web results. These can include: maps, local results, links to official sites, images, weather and even sports scores. These results are even more important to consider in the mobile web, because of their premium placement on limited mobile results pages. · Character limits
As you might expect, mobile search results are frequently truncated versions of what would normally appear in the traditional results page. If you are optimizing a mobile-specific site, there is a whole new set of character limits to work with when optimizing metadata. If you are optimizing an existing site to be found in both mobile and traditional search, you should abide by the character limits in traditional search, while at the same time remaining conscious of what will be omitted in the mobile search results. · URL display
In traditional search results, complete URLs are always provided for each search result, but this is not always the case in mobile search engines. Some mobile search engines will eliminate the ‘http://’ from the URL, or display only the domain in the search results, even though the result links to a deeper page on the site. Optimized sub-domains can be very useful in traditional SEO, but might be even more useful in mobile search engines, when everything after the domain extension (.com/.net/.co.uk etc.) is eliminated. Since savvy users sometimes evaluate display URLs to determine which result they will click on, the architecture of the URL can be used to influence that decision. To make this more concrete, consider a person looking for the results of a football game on a mobile phone. Which URL seems like it is the most likely to get you the information in the fewest number of clicks: A ESPN.com B NFL.ESPN.com C Football-Scores.ESPN.com D FootballScores.com The correct answer is likely a tie between options ‘C’ and ‘D.’ While ESPN is clearly an authority site, FootballScores.com and ESPN.com may lure some viewers away because of their simplicity. Optimized sub-domains are a good idea in some cases, but even in mobile SEO they are not always the best option. In some instances, users are more likely to click on simpler URLs, and other times they are not. · Recommendation
Terms recommendation or results recommendation is employed in many known search engines. In mobile environments, this could be useful in order to save users time from typing. Recommendation can either be used in terms of collaborative filtering, where recommended results are produced based on what other users have searched for, when searching for a specific concept, as well as from the user’s past behaviour in terms of a history log. [vsl1]Tout cette discussion ne semble pas tenir compte de la petite revolution apportée par le browser de l’iphone d’apple
There is a great diversity in mobile search engines. While the goal of all the mobile engines is the same, their approaches vary considerably. In this section we will present their main differences and the impact of these differences on Search engines optimisation.
· Presentation of results
One of the more frustrating differences between the mobile search engines is the number of results they present on the main results page, and the number of results that they will present on the secondary ‘web results’ page. Since mobile search engines are designed more like portals rather than traditional search engines, they have all come up with a variety of ways of presenting the information that is yielded from a search result. This can be handy for users, but makes tracking and comparison a bit trickier. In general, mobile search engines provide vertical results, ordered by relevance. Windows Live provides two mobile web results on the main results landing page, Google Mobile and AOL Mobile provide six, and Yahoo provides ten. An exception is Google iPhone, which presents eight web results but providing tabs along the top if the user needs to access local or vertical results.
· Search box location
The AOL mobile landing page provides a search box at the top and bottom of the page, but only on the bottom of the results page. Conversely, Yahoo OneSearch provides a search box at the top of the landing page, and a search box at the bottom and the top of the results page. Windows Live provides one search box at the top of the search landing page, and one at the bottom of the results page. Google iPhone provides only one search box at the top of the landing page and the top of the results page.
· Local & vertical results
Some mobile search engines, like AOL and Google iPhone will break local and vertical results into different tabs along the top of the page. Others present a mixed landing page with vertical results such as maps, weather forecasts, images and sports scores provided inline with web results. Google Mobile and Yahoo OneSearch both maintain results pages where the main focus is web results, but they do integrate some vertical results inline with web results. Conversely, AOL Mobile and Windows Live both provide mixed results that do not focus on any particular type of result.
· Location setting
It won’t be long before GPS enabled mobile devices set and update a user’s location automatically, but for now setting your location is still a manual process. While Google Mobile, AOL Mobile and Windows Live all allow you to set your location, Google iPhone and Yahoo OneSearch do not. Google and AOL Mobile both have options on the main search page to change your location. Google Mobile will allow you to set your location by city or zip code, but AOL takes it a step further and lets you specify your location down to the street address. Windows Live does not have links on the main search page to change your default location; instead, they update the user’s default location whenever the user searches for a specific geographic location, so if your default location is set to Denver, but you want information about a restaurant in Houston you can search for ‘PapaMia Houston’ and your default location will be updated to Houston for subsequent searches. Unfortunately, there are no options or instructions for changing the default location on the main search page, so users are left to figure this out on their own. Location settings can impact the local and vertical results that you are presented, and in the future may also affect the mobile web rankings as well. Currently, Google, AOL Mobile and Windows Live are tailoring the local and vertical results by the user’s default location, but are not tailoring web results by location.
· Keyword bolding
Traditional search engines will sometimes put the keyword(s) that you have searched for in bold to help your eye key into the most relevant results. Most of the mobile search engines, (all but Windows Live) have adopted this practice to varying degrees as well. Yahoo OneSearch will bold keywords in the title line, description and URL, while all of the Google driven engines, including Google Mobile, Google iPhone and AOL Mobile will only bold terms when they are located in the description part of the results. Windows live is the only engine evaluated that is not bolding any keywords in search results pages.
· User agent detection
Currently, Google Mobile, AOL Mobile and Microsoft OneSearch incorporate user agent detection to determine exactly what type of mobile device you are using to access their search engine. They will then use that information to optimize the results pages for viewing on your specific mobile device. This is done primarily to ensure images, maps and other graphics to are sized to fit the screen without right-to-left scrolling. In the future, this information could be integrated into the search algorithm to improve the ranking for pages that display well on your specific mobile device.
· Transcoding
Google Mobile, AOL Mobile and Windows Live all integrate transcoding software to re-arrange web pages that are designed for the traditional web and to make them viewable on a smaller screen. This is good news for sites that have yet to begin optimizing the user experience for the mobile web, but can also cause problems. Forms or JavaScript may be rendered un-usable on the transcoded version of the site, and the transcoded page may not provide adequate idea arrangement of the elements on the page. While transcoding improves the usability of the site in the short term, it may hinder SEO and can make interacting with the site more difficult. The transcoded page is hosted temporarily on the search engine server and domain, rather than on the original website. It is unclear weather transcoding impacts Google’s evaluation of the activity on your site, but it definitely makes it harder to get accurate links to the site because the URLs are re-formulated in the transcoding. Many of the mobile search engines have indicated that they recognize the ‘handheld’ style sheet, and will use it to render the site when it is available, but it is not always the case. In all cases, you can choose to view the html version of the site by clicking on a link at the bottom of the page, or simply performing your search in the traditional version of the search engine, rather than the mobile version.
· The Impact on mobile Search Engine Optimizer?
All of the differences that we can see amongst the mobile search engine players are simply an indication that the industry is still in its infancy, and has yet to develop standards. Mobile search engines are still determining how they can provide users with the best experience, and SEOs are still figuring out how to compare such variable results. The main conclusions that can be drawn is that mobile SEO is different from traditional SEO, but not so different that everything must be re-learned. Mobile SEOs must be patient for the mobile web and the mobile search experience to catch up with the traditional web that we have become so used to. It is an exciting time in mobile search, when things are constantly changing, standards are slowly being formed and nothing is taken for granted.
1.3.3. Main players
For the reasons explained in the last chapter the mobile search market is very fragmented. Expectations for mobile search and local mobile search in particular are rising. As mobile ad networks form, mobile M&A activity heats up and the search engines pour greater attention and resources into their mobile offerings. One could say we are on the cusp of a new mobile era. Indeed, as much as we can be reluctant to use the term, one could dub the forthcoming mobile Internet "Web 3.0." Of course people have been saying and predicting the emergence of the mobile Internet for almost 10 years. Forecasts and predictions rarely come true in their original time frames, but they typically do come true eventually. And today, the resources, infrastructure and consumer demand make a mobile Internet more tangible and much closer to reality. What took the desktop Internet roughly a decade to develop is happening in a much more condensed period of time in mobile. And for all its complexity and fragmentation, there are numerous companies working on making content access and delivery on mobile devices a much more intuitive and user-friendly experience. User experience is the key to mobile services, because once users adopt the mobile Internet (or variations thereof) in meaningful numbers, which is starting to happen, the ad dollars will flow and real money will be made. Right now the "mobile Internet" is really four separate areas that will eventually blend to varying degrees. Each of the four areas has big players we will try to classify in.
· Nouveau Directory Assistance & Voice Search
This category grows out of tried and true "directory assistance," the original form of local mobile search. In 2006 there were roughly 6.5 billion calls to 411 in the United States and many more billions around the world. Because of the Internet and other factors (e.g., corporations blocking 411), directory assistance continues to shift to mobile phones. So-called "operator assisted yellow pages" (live agents helping users finding listings and other information) were repeatedly tried and failed. However, today, ad-supported directory assistance appears here to stay. Eg mobile.Yell.com, V-enable.com,180srch.com.
· Text-Based Local Search
After directory assistance and its more sophisticated cousin voice search, the volume of usage in text messaging. Depending on whose numbers you believe, anywhere from 35 percent to 70 percent of U.S. mobile consumers send and receive text messages (with varying degrees of frequency). This is clearly where the volume of mobile data usage is today, as opposed to WAP browsing. However, text is arguably the least sexy mode of accessing information on a mobile device (if the most practical). One of the leaders in this category is 4Info, which is doing some impressive things and getting some very impressive CPM rates. The company, partly owned by newspaper publisher Gannett, is not exclusively about local but local is an important piece of what it's doing. And many of the voice search options in the first segment allow content and contact details to be received via text message in addition to audio. Eg 4info.com, ask.com, nownow.com.
· WAP Local Search
WAP usage, while numbering in the millions is still in an early stage of development and has much less adoption for many reasons, including hardware limitations, separation of text and mobile Internet pricing plans and so on. All the major search engines and portals, yellow pages sites and local search pure plays now have WAP sites. Yahoo's oneSearch is something of a standout in this category. Among others are wapreview.com and wapmcnearky.com.
· Local Mobile Applications
All major search providers also have downloadable applications, many of which are being pre-loaded on phones. Then there are interesting alternative content and search applications, represented by the Where and ZenZui "platforms." Applications offer by far the best and richest user experience. The problem for search engines (and users) is that they must be downloaded and so represent the smallest segment of the market with intrinsic barriers to adoption. Thus the challenge is to get applications preloaded on the next phone the user buys and/or to bring the application experience into a WAP environment.Eg maporama, local.com, mojopages.com.
· Bringing It All Together
Google's "diversified" approach is a metaphor for the challenges and fragmentation of the mobile market right now: the company has an offering in each of the above segments. There are numerous other companies, including Microsoft and Yahoo that have comparable offerings in most or all of the segments. The mobile market, just because of the proliferation of different handsets, will always be fragmented to some degree. But we can expect to see increasing integration of the types of functionality that are currently largely separated -- the blending of voice interfaces, text and WAP and, potentially, applications that come preloaded on phones (e.g., Google Maps on the iPhone). Speaking of the iPhone, it has done a great service to the market, refocusing the industry on the user experience and general usability in mobile. Consumers fundamentally want local information on the go and thus consumer demand is "pent up." Mobile usability and the "mobile Internet" now just have to catch up to the consumer.
1.3.4. The State of the Art
As the mobile user base expands, so do device storage capacities and wireless services. Not only are these phones accumulating more device-resident data such as email, appointments and photos, but they are also increasingly used as front-end interfaces to ever-larger external data sets, including web sites, traffic information, and Yellow Pages data. Many query-answer systems and web browser interfaces that target mobile platforms have debuted within the last year, including offerings from every major search engine. In many systems in the literature, emphasis is shown on interfaces that serve search engine services in mobile devices. While existing solutions do cater to small screens and low bandwidth, they are modelled after desktop web search, posing three primary usability issues for the mobile setting. First, they rely on text entry as the method of input, even though the persistent trend toward smaller phones is directly at odds with the goal of achieving the efficiency of full-size keyboard text entry. Second, the focus has been on search off the device, under-utilizing the device’s expanding processing power and storage capabilities and thus unnecessarily impoverishing the search UI. Finally, both the SMS and web search models support directed search tasks, but are less appropriate for browsing and exploratory search scenarios (“sense-making”) that are quite complementary to the mobile setting.
· Interfaces
Many information access interfaces present data attributes (metadata) that users can include in queries to large data sets, rather than expecting users to remember them. Dynamic query interfaces (Shneiderman) encourage iterative composition and refinement of complex queries by providing continuous visual update of the query results as users restrict data attributes included in the query.
Standard bookmarks and saved queries (De Luca et al.) help speed page revisitation, but most systems rely on device-specific text entry methods for ad hoc keyword search. Word prediction and completion algorithms have the potential to reduce the number of entered characters, but also have the drawback that most fail for non-dictionary words, and may still require users to select from several alternatives.
Karlson et al. present a novel approach for searching large data sets from a mobile phone. Existing interfaces for mobile search require keyword text entry and are not suited for browsing. They propose an alternative approach which uses a hybrid model that is based on iterative data filtering rather than on the tedious keyword entry. More specifically their approach involves navigation and selection of hierarchical metadata (facet navigation) with incremental text entry to further narrow the results. Information seeking strategies take many forms depending on the task at hand, user knowledge, and target data set. According to the task of search, they provide two definitions: a directed search is one in which the user knows the precise target in advance, while a browsing task is one characterized by specifying criteria that describe a data need, and which may evolve during search. They mention that such a task can assist in selecting between the traditional keyword search and their own facets-based approach. As far as facets are concerned, the use of data attributes (metadata) can be organized into orthogonal dimensions (facets) as a means not only to structure search results, but as a tool to guide users in formulating powerful Boolean queries. This approach not only reduces cognitive load through recognition, but allows users to reliably restrict results by attribute values rather than by keyword alone.
Their structural philosophy is counteracting the limitation of most mobile phones concerning the lack of touch screens. Their system, FaThumb, is optimized for keypad interaction. The Facet Navigation region is intentionally designed to map spatially to numbers 1 though 9 on the numeric keypad. While this design restricts the branching factor of the hierarchy (with a maximum 8 at each level), its depth and balance are dictated only by the target data set. For any domain, we believe consistency of facet location is crucial to promoting user mastery of the hierarchy. Thus they opted for a terminating tree, meaning users must return to the top of the tree to explore paths that lead from other top-level facets. On the other hand, as data sets grow, it may be appropriate to dynamically generate nodes within the facet tree to provide more efficient distribution of data among the available zones (e.g., consider date facets labelled by day for high frequency data, but by year for low frequency data). Dynamic strategies may be most effective at lower levels of the tree, since users may be more willing to browse a less familiar but more useful set of choices once they have already narrowed the data set by a few familiar initial selections.
An interesting work is done by Google. Kamvar and Baluja present a study of search patterns on Google’s mobile search interface. They examine search queries and the general categories under which they fall. Useful conclusions for mobile search interface can be drawn by observing users interaction, referring to the time users spend inputting a query, viewing the search results and how often they click on a result. They provide insight through large scale log analysis. A comparison between Google XHTML and PDA interfaces takes places, where it was found that the number of keywords forming a query is quite similar between desktop, PDAs and XHTML, while mobile users are a bit briefer. Concerning the categories of interest these seem to be more in the entertainment category (ringtones, adult content, celebrity searches) in the case of XHTML, since users consider their cell phone as a more personal and private device, whereas PDAs topics of search tend to be more business-oriented. The click-through rate across all categories was consistently low which suggests users are relying heavily on snippets in wireless search for their information. They believe users requesting the search results from the same query may be confusing the “Search” button for the “Next” link. The next link on the wireless page is much smaller and shown with much less context than its desktop equivalent. This in-depth examination of wireless search patterns seems suggests that the search interface should be alter ed concerning the design of mobile search engines interfaces.
· Community-based search
Church et al. point out that limited screen real-estate and restricted text input capabilities, from which mobile devices suffer, affect the usability of many mobile Internet applications. Most attempts to provide mobile search engines have involved making only simplistic adaptations to standard search interfaces. For example, fewer results per page are returned and the ‘snippet’ text associated with each result may be truncated. They attempt to deal with the snippet text issue proposing the I-SPY system. The I-SPY system can track and record past queries that have resulted in the selection of a given result page and we argue that these related queries can be used to help users understand the context of a search result in place of more verbose snippet text.
More specifically, the I-SPY search engine focuses on community-based search by recording the search histories (queries and result selections) of communities of like-minded individuals. Their concept can actually be considered to belong to the broader machine learning technique of collaborative filtering, as was mentioned in the introductory section concerning mobile search. This information is stored in a query-result hit-matrix that records the number of user selections that a result pj has received in response to a query qi and the information is used to adapt future result-lists for similar queries by promoting results that have been selected in the past. Thus, I-SPY gradually adapts to the learned needs of communities of individuals and this has been previously shown to significantly improve overall search performance.
As a conclusion, mobile internet search engines need an economic way to summarise the contents of their search results. Traditional snippet text is simply too verbose. In the I-SPY system the suggestions made include using previously successful queries as an alternative and provision of some preliminary empirical evidence that implies that these queries may be as informative as snippet text. These resultant queries take up less than half the space of snippet text and can also be used as a simple way for users to launch further more elaborated searches. All of these benefits suggest that related queries could be quite valuable in the mobile search domain.
· Results Classification
Hierarchical classifications have been used previously in search interfaces. Search results that include hierarchical labels can help users identify relevant items or further refine (or broaden) searches. Search engines such as Yahoo and OpenDirectory order results by relevance, but display each with a human-assigned hierarchical category; following the category hyperlink allows users to browse via the hierarchical category directory. Other systems help users quickly identify appropriate subsets from voluminous results by organizing results into hierarchical categories.
Nadamoto et al. deal with the problem of web results organization introducing a way of restructuring web search results for passive viewing. Their restructuring method is to classify search results dynamically into several groups, which they call carousels. By analyzing Web pages, their method classifies web pages of search results into four groups, based on similarity, difference, detail, summary relationships. The user can select a carousel and transit among carousels by a few interactions. The contents of each carousel are presented automatically and repeatedly. The user watches and listens to a carousel and does the simple interaction for a carousel transition.
· Contextual search
All of the above research work seems to focus on the technical limitations of mobile devices and their viewing capabilities. Most approaches develop methods of smart interfaces with navigation schemes or different type of information used for viewing that supplements the viewing of traditional search engines. Another issue is raised by Flanagan et al., where the exploitation of the user’s context seems to be an additional step so that mobile search engines make the difference. The user’s context refers to information related to the situation of an entity (person, object, place) relevant to the interaction between a user and an application:
● The user’s profile (explicit/implicit preferences, current activity in the device, history).
● The user’s general activity (location, date-time, orientation, acceleration).
● The user’s environment (temperature, humidity, sound, light).
● The user’s social environment (nearby people, current social situation).
Flanagan et al. propose an approach for using user’s context for adapting the mobile device profile. In this approach, the low-level context is captured through on-board sensors in the mobile device (user’s physical environment is directly monitored). More specifically, they monitor the user’s activity, i.e. orientation, stability, acceleration, in hand and environmental conditions, such as ambient or artificial illumination, noise, air humidity, temperature. The user does not provide any explicit feedback. The signals obtained from the sensors are recorded and various feature extraction algorithms are applied to generate low-level context. The low-level information is used to determine a higher-level context in a context hierarchy. The representation of low-level context is made using symbols. The generation of higher-level context is base on clustering the symbols (fusion from various sources) with a Symbol String Clustering algorithm. A profile of the device is constructed and is adapted automatically as the user passes through different contexts. Thus, the mobile device responds to the user’s context by changing its profile according to it.
A context-aware mobile application on mobile devices for mobile users is implemented by Coppola et al. Their system is constructed based on a distributed architecture for sending mobile application to the mobile device using context. Their system is constituted by two main modules: the MoBeSoul module which captures and handles user context, and a MoBeLet application, which is downloaded and executed on the mobile device. The concrete context is captured through physical sensors (noise, light level, temperature), “virtual” sensors (date, time, alarm time), explicit user actions (communication, profile selections) and context History (previous user’s actions). An inferential mechanism is implemented to derive abstract context (higher-level) using the concrete context (low-level). Both contexts have a probability measure representing how likely they are for the user. Contexts are divided to public (user’s approximate location) and private (user’s exact position, or other personal data). The automatically collected context, along with user’s demographic information and explicitly denoted user preferences are stored into databases (User Profile and Usage & Download Statistics). Then, the public context descriptors are transmitted to the MoBe Descriptors Server. The MoBe Descriptors Server selects the MoBeLet applications that are more relevant to the user’s context. The descriptors of the selected MoBeLet applications are transmitted to the mobile device and are filtered using the private context descriptors. The finally selected MoBeLet applications are downloaded to the mobile device in order to be executed.
A client-server software architecture, implemented in the mobile terminal, for adapting the mobile device profile, enhancing mobile applications and sending appropriate information to the mobile device by exploiting the user’s context is presented by Korpipää et al. The context information they take into account is provided by: sensors for capturing sound, light, acceleration, temperature, touch, etc, applications currently running, time information, or explicit user actions such as scheduled tasks, preferences, and network resources for communication activities. The captured context is processed for the extraction of the useful features (Resource-server). The extracted information is represented as concepts in a contextual ontology consisting of a Schema representing the structure and properties and a client usable, extendable vocabulary for describing context information (Context manager). Each context expression contain a type and value features. The low-level contexts participate in a reasoning process (using naïve Bayes classifier) for generating higher-level contexts. During application of this system, the device profile is changed, or the currently running application is enhanced according to the generated high-level context.
Apart from the results representation, context search offers improvements on the basic functionality of mobile search engines. The searching process is also affected in the work of Su and Lee. They refer to an approach which adapts the searching process according to the user’s active context. In order for the retrieval results to be more related to the user’s active task, the query is expanded by terms from the active document. The text processing method selects from the document the top ranked N words to expand the query. The search engine retrieves results with high similarity to the active documents besides of the query itself. They have experimented for investigating if similar (in terms of words) documents to the active user context are equal to useful documents. The evaluation methodology included the ranking of similar documents in comparison to a user provided document (pseudo context). They included different modes of query formulation: 5 keywords of current research area and topic, 5 keywords of the methods-algorithms in the topic and random combination of the two sets. The search results were the top ten ranked documents. The users were asked to manually rate the documents in terms of similarity, relevancy and usefulness. In the evaluation results it was shown that users judged similarities very differently to the program scores. In most cases users judged the similarity, relevance and usefulness of a document to a similar level. Documents with low similarity and high usefulness could provide users with additional knowledge. Documents with low usefulness but high similarity of words did not carry the key information (in the words) that the users were looking for.
1.4. References
A. Crespo, and H. Garcia-Molina, “Routing indices for peer-to-peer systems”, Proc. of the 22nd International Conference on Distributed Computing (IEEE ICDCS’02), 2002.
Baeza-Yates, R. Castillo, C. Junqueira, F. Plachouras, V. Silvestri, F. Challenges on Distributed Web Retrieval, ICDE, Istambul, 2007
Barroso, Dean, and Hoelzle, Web search for a planet: The Google cluster architecture, Micro IEEE, 2003. Bender, Michel, Triantafillou, Weikum, Design Alternatives for Large-Scale Web Search: Alexander was Great, Aeneas a Pioneer, and Anakin has the Force, SIGIR Workshop on Large Scale Distributed Systems for IR, 2007. Bhattacharjee, Chawathe, Gopalakrishnan, Keleher and Silaghi, Efficient Peer-To-Peer Searches Using Result-Caching, IPTPS, 2003. Bragante and Melucci, Homepage Finding in Hybrid Peer-to-Peer Networks, 8th RIAO Conference on Large-Scale Semantic Access to Content, 2007. Brin and Page, The anatomy of a large-scale hypertextual Web search engine, 1998. Brown Parallel and Distributed IR in Modern information retrieval Baeza-Yates and Ribeiro-Neto, 1999. B. Yang, and H. Garcia-Molina, Improving search in peer-to-peer networks, Proc. of the 22nd IEEE International Conference on Distributed Computing (IEEE ICDCS’02), 2002. Cacheda, Carneiro, Plachouras, Ounis, Performance analysis of distributed information retrieval architectures using an improved network simulation model, Information Processing Management, 2007. Callan, Distributed IR, 2000. Church, K., Keane, M. T., and Smyth, B. An Evaluation of Gisting in Mobile Search. In Proceedings of the 27th European Conference on Information Retrieval, Santiago de Compostela, Spain, 2005. Coppola, P., Della Mea, V., Di Gaspero, L., Mizzaro, S., Scagnetto, I., Selva, A., Vassena, L., Riziò, P.Z. MoBe: Context-Aware Mobile Applications on Mobile Devices for Mobile Users, Proceedings of the 1st International Conference on Exploiting Context Histories in Smart Environments (ECHISE2005), Munich, 2005. D. Tsoumakos, and N. Roussopoulos, “Adaptive probabilistic search in peer-to-peer networks”, Proc. of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS’03), 2003. D. Tsoumakos, and N. Roussopoulos, “Adaptive probabilistic search in peer-to-peer networks”, technical report, CS-TR-4451, 2003. D. Tsoumakos, and N. Roussopoulos, “A comparison of peer-to-peer search methods”, Proc. of 2003 International Workshop on the Web and Databases, 2003. De Luca, E.W. and Nurnberger, A. Supporting information retrieval on mobile devices. Proc. Mobile HCI, 347-348, 2005. Flanagan, J. A., Himberg, J. and Mäntyjärvi, J. A Hierarchical Approach to Learning Context and Facilitating User Interaction in Mobile Devices, Artificial Intelligence in Mobile System 2003 (AIMS 2003) in conjunction with Ubicomp 2003, Seattle, USA, 2003. Kamvar, M., and Baluja, S. A large scale study of wireless search behavior: Google mobile search, Proceedings of the SIGCHI conference on Human Factors in computing systems, , Montréal, Québec, Canada, 2006. Karlson, A., Robertson, G., Robbins, D., Czerwinski, M. & Smith, G. FaThumb, A facet-based interface for mobile search. Proc. CHI 2006, ACM Press, 711—720, 2006. Korpipää, P., Mäntyjärvi, J., Kela, J., Keränen, H., Malm, E-J. Managing Context Information in Mobile Devices, IEEE Pervasive Computing Vol. 2, Is. 3, pp. 42–51, July-Sept, 2003. Li, Loo, Hellerstein, Kaashoek, Karger and Morris, On the Feasibility of Peer-to-peer Web Indexing and Search, 2003. Long and Suel, Three-Level Caching for Efficient Query Processing in Large Web Search Engines, WWW Conference, 2005. Lu and Callan, Content-based peer-to-peer network overlay for full-text federated search, 8th RIAO Conference on Large-Scale Semantic Access to Content, 2007. Luu, Klemm, Podnar, Rajman, Aberer, ALVIS Peers: A Scalable Full-text Peer-to-Peer Retrieval Engine, P2PIR, 2006. M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer, “P2p content search: Give the Web back to the people,” February 2006, international Workshop on Peer-to-Peer Systems (IPTPS). Meng Yu, and Liu, Building efficient and effective metasearch engines, ACM Computing Surveys, 2002. Moffat, Webber, Zobel and Baeza-Yates, A pipelined architecture for distributed text query evaluation, 2007. Nadamoto, A., Kondo, H., and Tanaka, K. WebCarousel: Restructuring Web search results for passive viewing inmobile environments, Proceedings of the Seventh International Conference on Database Systems for Advanced Applications (DASFAA 2001), Hong Kong, China, 2001. Parreira, Michel, Bender, Size Doesn’t Always Matter: Exploiting PageRank for Query Routing in Distributed IR, P2PIR, 2006. Puppin, Silvestri and Laforenza, Query-Driven Document Partitioning and Collection Selection, Infoscale 2006. Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, “Search and replication in unstructured peer to-peer networks”, Proc. of the 16th ACM International Conference on Supercomputing (ACM ICS’02), 2002. S. C. Rhea, and J. Kubiatowicz, “Probabilistic location and routing”, Proc. of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’02), 2002. Shneiderman, B. Dynamic queries for visual information seeking. IEEE Software, 11, 6 (1994), 70-77. Skobeltsyn, Aberer, Distributed Cache Table: Efficient Query-Driven Processing of Multi-Term Queries in P2P Networks, P2PIR, 2006 Su, J. and Lee, M. An Exploration in Personalized and Context-Sensitive Search, Proceedings of the 7th Annual UK Special Interest Group for Computational Linguists Research Colloquium, 2003. Tang and Dwarkadas, Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval, NSDI’04, 2004. V. Kalogeraki, D. Gunopulos, and D. Zeinalipour-yazti, A local search mecha nism for peerto- peer networks, Proc. of the 11th ACM Conference on Information and Knowledge Management (ACM CIKM’02), 2002. Wang, Reinders, Lagendijk, Pouwelse, Self-organizing distributed collaborative filtering, SIGIR, 2005. Wray Buntine, Open source, distributed and Peer to Peer IR, ESSIR 2007 Xiuqi Li and Jie Wu, "Searching Techniques in Peer-to-Peer Networks", CRC press, 2005. Zhang and Suel, Optimized Inverted List Assignment in Distributed Search Engine Architectures, IPDPS, 2007. Zhang and Suel, Efficient Query Evaluation on Large Textual Collections in a Peer-to-Peer Environment, IEEE International Conference on Peer-to-Peer Computing, 2005.