People are very good at learning new concepts after observing just a few examples. For instance, a child will confidently point out which animals are "dogs" after having seen only a couple of examples of dogs before in their lives. This ability to learn concepts from examples and to generalize to new items is one of the cornerstones of intelligence.

The Xyggy Engine is founded on Bayesian Sets*, a new framework for machine learning based on how humans learn new concepts and generalize. In this framework, a query consists of a set of items which are examples of some concept. In realtime, it infers which other items belong to that concept and retrieves them. As an example, for the query with the two animated movies, “Lilo & Stitch” and “Up”, it would return other similar animated movies, like "Toy Story".

How does this work? Human generalization has been intensely studied in cognitive science and various models have been proposed based on some measure of similarity and feature relevance. Recently, Bayesian methods have emerged as models of both human cognition and as the basis of machine learning systems.

Consider a universe of items, where the items could be web pages, documents, images, ads, social profiles, audio, video, investments, resumes, medical records, or any other class of items we may want to query.

An individual item is represented by a vector of features of that item. For example, for text documents, the features could be counts of word occurrences, while for images the features could be the amounts of different color and texture elements.

Given a query consisting of a small set of items (e.g. a few images of buildings) the task is to retrieve other items (e.g. other images) that belong to the concept exemplified by the query. To achieve the task, we need a measure, or score, of how well an available item fits in with the query items.

A concept can be characterized by using a statistical model, which defines the generative process for the features of items belonging to the concept. Parameters control specific statistical properties of the features of items. For example, a Gaussian distribution has parameters which control the mean and variance of each feature. Generally these parameters are not known, but a prior distribution can represent our beliefs about plausible parameter values.

The score used for ranking the relevance of each item x given the set of query items Q compares the probabilities of two hypotheses. The first hypothesis is that the item x came from the same concept as the query items Q. For this hypothesis, compute the probability that the feature vectors representing all the items in Q and the item x were generated from the same model with the same, though unknown, model parameters. The alternative hypothesis is that the item x does not belong to the same concept as the query examples Q. Under this alternative hypothesis, compute the probability that the features in item x were generated from different model parameters than those that generated the query examples Q. The ratio of the probabilities of these two hypotheses is the Bayesian score, and can be computed efficiently for any item x to see how well it “fits into” the set Q.

This approach to scoring items can be used with any probabilistic generative model for the data, making it applicable to any problem domain for which a probabilistic model of data can be defined. In many instances, items can be represented by a vector of features, where each feature can either be present or absent in the item. For example, in the case of documents the features may be words in some vocabulary, and a document can be represented by a binary vector x where element j of this vector represents the presence or absence of vocabulary word j in the document. For such binary data, a multivariate Bernoulli distribution can be used to model the feature vectors of items, where the jth parameter in the distribution represents the frequency of feature j. Using the beta distribution as the natural prior the score can be computed extremely efficiently.

An important aspect is that the Engine learns in realtime which features are relevant from queries consisting of two or more items. For example, a movie query consisting of “The Terminator” and “Titanic” suggests that the concept of interest is movies directed by James Cameron, and therefore likely to return other movies by Cameron.

The Xyggy Engine method has been applied to diverse problem domains including: unlabelled image search using low-level features such as color, texture and visual bag-of-words; movie suggestions using the MovieLens and Netflix ratings data; music suggestions using last.fm play count and user tag data; finding researchers working on similar topics using a conference paper database; searching the UniProt protein database with features that include annotations, sequence and structure information; searching scientific literature for similar papers; and finding similar legal cases, New York Times articles and patents.

The Xyggy Engine can also be used for ad retrieval through content matching, building suggestion systems (“if you liked this you will also like these” which is about understanding the users mindset instead of the traditional “people who liked your choice also liked these”) and finding similar people based on profiles (e.g. for social networks, online dating, recruitment and security). All these applications illustrate the countless range of problems for which Bayesian Sets provides a powerful new approach to finding relevant information.

The Xyggy Engine demonstrates that realtime machine intelligence is possible, using a Bayesian statistical model of human learning and generalization. This approach, based on sets of items encapsulates novel principles. First, retrieving items based on a query can be seen as a cognitive learning problem; where we have used our understanding of human generalization to design the probabilistic framework. Second, retrieving items from large data sets requires fast algorithms and the exact computations for the Bayesian scoring function are extremely fast.


(*) patent-pending