Brotli is an open-source general-purpose data compressor introduced by Google in late 2013, and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format has been publicized in July 2016 as RFC 7932. Brotli is based on the Lempel-Ziv's compression scheme and planned as a generic replacement of GZIP and ZLIB. The main goal in its design was to compress data on Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This paper is intended to provide the first thorough, systematic description of the Brotli-format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software-engineering problems which deserve further attention from the scientific community.
In this paper, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classify the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A document is filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering, and achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. We also conduct a thorough study about the impact of seed words for the existing dataless classification techniques. The results reveal that the document coverage of the seed words mainly affects the dataless classification performance.
In this paper, we extensively study the use of syntactic and semantic structures obtained with shallow and full syntactic parsers for answer passage reranking. We propose several dependency and constituent-based structures, also enriched with Linked Open Data (LD) knowledge to represent pairs of questions and answer passages. We encode such tree structures in learning to rank (L2R) algorithms using tree kernels, which project them in a tree substructure space, where each dimension represents a powerful syntactic/semantic feature. Additionally, since we define links between structures, tree kernel space also include relational features spanning question and passage structures. Our findings can be useful to build state-of-the-art systems: (i) relational syntactic structures are essential to achieve the state of the art; (ii) full syntactic dependencies can outperform shallow models; (iii) external knowledge obtained with specialized classifiers such as focus and question classification is very effective; and (ii) the semantic information derived by LD and incorporated in syntactic structures can be used to replace specific knowledge classifiers: this is a remarkable advantage with wide coverage. We demonstrate our findings by carrying out an extensive comparative experimentation on two different TREC QA corpora and WikiQA, a recent corpus widely used to test sentence rerankers.
In fine-grained tweet geolocation, tweets are linked to the specific venues (e.g. restaurants, shops) from which they were posted. This explicitly recovers the venue context which is essential for applications such as location-based advertising or user profiling. For this geolocation task, we focus on geolocating tweets which are contained in tweet sequences. In a tweet sequence, tweets are posted from some latent venue(s) by the same user and within a short time interval. This scenario arises from two observations: (1) it is quite common that users post multiple tweets in a short time and (2) most tweets are not geocoded. To more accurately geolocate a tweet, we propose a model that performs query expansion on the tweet (query) using two novel approaches. The first approach temporal query expansion considers users' staying behavior around venues. The second approach visitation query expansion leverages on user revisiting the same or similar venues in the past. We combine both query expansion approaches via a novel fusion framework and overlay them on a Hidden Markov Model to account for sequential information. In our comprehensive experiments across multiple datasets and metrics, we show our proposed model to be more robust and accurate than other baselines.
Efficient object retrieval based on a generic similarity is a fundamental problem in information retrieval. We focus on queries by example, which retrieve the most similar objects to query object q. We propose an enhancement of techniques working on the filter and refine paradigm. The filtering phase identifies objects likely to be similar to q. During the refinement, these objects are accessed and their similarity to q is evaluated. Problem valid for current datasets is, that a lot of non-relevant objects usually remain after filtering. We propose the secondary filtering, that further reduces the number of objects to be refined while almost preserving the quality of query result. It is based on sketches - compact bit strings compared by the Hamming distance. Our approach to identify objects to be filtered out by sketches is based on a probabilistic model, which describes relationships between similarities of objects and sketches. The secondary filtering thus efficiently identifies non-relevant objects which remain after primary filtering, and we suggest a mechanism to tune the trade-off between its accuracy and effectiveness. The main advantage of our technique is its wide applicability. It can be easily incorporated into any arbitrary filter and refine technique for similarity search.
User interactions can be considered to constitute different feedback channels, e.g., view, click, like or follow, that provide implicit information on users preferences. Each implicit feedback channel typically carries a unary, positive-only signal, which can be exploited by collaborative fltering models to generate lists of personalized recommendations. This paper investigates how a learning-to-rank recommender system can best take advantage of implicit feedback signals from multiple channels. We focus on Factorization Machines (FM) with Bayesian Personalized Ranking (BPR), a pairwise learning-to-rank method, that allows us to experiment with different forms of exploitation. We perform extensive experiments on three datasets with multiple types of feedback to arrive at a series of insights.We compare conventional, direct integration of feedback types with our proposed method that exploits multiple feedback channels during the sampling process of training.We refer our method as multi-channel sampling. Our results show that multi-channel feedback sampling outperforms conventional integration, and that sampling with the relative level" of feedback, is always superior to a level-blind sampling approach.We evaluated our method experimentally on three datasets in different domains and found out that with our multi-channel sampler the accuracy and the item coverage of recommendations can be improved significantly compared to state-of-the-art models.
Personalized rating prediction is an important research problem in recommender systems. Although the latent factor model achieves good accuracy in rating prediction, it suffers from many problems including cold-start, non-transparency, and suboptimal results for individual user-item pairs. In this paper, we exploit textual reviews and item images together with ratings to tackle these limitations. Specifically, we first apply the proposed multi-modal aspect-aware topic model on text reviews and item images to model user's preferences and item's properties from different aspects. Then the learned user preferences and item features are integrated into a novel aspect-aware latent factor model for aspect rating estimation. Finally, the overall rating is computed via a linear combination of the aspect ratings, which are weighted by the corresponding aspect importance. Comprehensive experimental studies have been conducted on the Yelp 2017 Challenge dataset. Results show that (1) our method achieves significant improvement compared to strong baseline methods, especially for users with only few ratings; (2) item visual features can improve the prediction performance - the effects of item image features on improving the prediction results depend on the importance of the visual features for the items; and (3) our model can explicitly interpret the predicted results in great detail.
Auxiliary information such as user reviews or product images has been extensively leveraged in recommender systems. Although it has been verified to be effective in terms of enhancing the recommendation performance, most state-of-the-art models have to embed complicated models in their framework to tame such unstructured data, which limits the model runtime efficiency. To solve this problem, in this paper, we decompose the modeling of auxiliary information and user-item interactions by a generalized distillation framework, where in the training phase, we leverage a powerful teacher model for auxiliary information modeling to teach a simple collaborative filtering (CF) student model, and then only this succinct yet enhanced student model is used to make fast predictions at test time. Specifically, we take user reviews as the auxiliary information, and according to their specific characters, we propose a Selective Distillation Network (SDNet) to leverage textual information in a more effective manner. Extensive experiments verify that our model can not only improve the performance of rating prediction, but also can significantly reduce the time consumption when making predictions as compared with several state-of-the-art methods.
Faceted search is quickly becoming a common feature on most search interfaces in e-commerce websites, digital libraries, government's open information portals, etc. Beyond the existing studies on developing algorithms for faceted search and empirical studies on facet usage, this study investigated user real-time interactions with facets over the course of a search from both data science and human factor perspectives. It adopted a Random Forest (RF) model to successfully predict facet use using search dynamic variables. In addition, the RF model provided a ranking of variables by their predictive power, which suggests that the search process follows rhythmic flow of a sequence within which facet addition is mostly influenced by its immediately preceding action. In the follow-up user study, we found that participants used facets at critical points from the beginning to end of search sessions. Participants used facets for distinctive reasons at different stages. They also used facets implicitly without applying the facets to their search. Most participants liked the faceted search, although a few participants were concerned about the choice overload introduced by facets. The results of this research can be used to understand information seekers and propose or refine a set of practical design guidelines for faceted search.
Heterogeneous one-class collaborative filtering (HOCCF) is an emerging and important problem in recommender systems, where two different types of one-class feedback, i.e., purchases and browses, are available as input data. The associated challenges include ambiguity of browses, scarcity of purchases, and heterogeneity arising from different feedback. In this paper, we propose to model purchases and browses from a new perspective, i.e., users' roles of mixer, browser and purchaser. Specifically, we design a novel transfer learning solution termed role-based transfer to rank (RoToR), which contains two variants, i.e., integrative RoToR and sequential RoToR. In integrative RoToR, we leverage browses into the preference learning task of purchases, in which we take each user as a sophisticated customer (i.e., mixer) that is able to take different types of feedback into consideration. In sequential RoToR, we aim to simplify the integrative one by decomposing it into two dependent phases according to a typical shopping process. Furthermore, we instantiate both variants using different preference learning paradigms such as pointwise preference learning and pairwise preference learning. Finally, we conduct extensive empirical studies with various baseline methods on two large public datasets and find that our RoToR can perform significantly more accurate than the state-of-the-art methods.
With the availability of abundant online multi-relational video information, recommender systems that can effectively exploit these sorts of data and suggest creatively interesting items will become increasingly important. Recent research illustrates that tensor models offer effective approaches for complex multi-relational data learning and missing element completion. So far, most tensor-based clustering has focused on accuracy. Given the dynamic nature of online media, recommendation in this setting is more challenging, as it is difficult to capture the users' dynamic topic distributions in sparse data settings. Targeting at constructing a recommender system that can compromise between accuracy and creativity, a deep Bayesian probabilistic tensor framework for tag and item recommendation is proposed. Based on the Canonical PARAFAC (CP) decomposition, a Bayesian multi-layer factorization is imposed on the mode factor matrix to find a more compact representation. During the score ranking processes, a metric called Bayesian surprise is incorporated to increase the creativity of the recommended candidates. The new algorithm is evaluated on both synthetic and large-scale real-world problems. An empirical study for video recommendation demonstrates the superiority of the proposed model and indicates that it can better capture the latent patterns of interactions and generates interesting recommendations based on creative tag combinations.
As a necessary step towards automatic understanding of open-domain web-search queries, we study the problem of linking queries to their semantic representation, in terms of unambiguous entities mentioned by the query. We introduce SMAPH, a second-order approach that, by ``piggybacking'' on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. Thanks to the information provided by the search engine, we discover a high-coverage set of candidate entities. The key algorithmic idea underlying the design of SMAPH-3, our best performing algorithm, is to link-back those entities to the terms occurring in the input query. This is implemented with a greedy disambiguation algorithm that iteratively builds the solution by keeping a notion of coherence among its annotations. We show that SMAPH-3 outperforms the state-of-the-art on the GERDAQ benchmark dataset, a novel, public dataset we built specifically for web-query entity linking via a crowdsourcing effort, that we publish jointly with this paper.
Discovery is an important aspect of the civil litigation process in the USA, in which parties to a lawsuit are permitted to request relevant evidence from other parties. With the rapid growth of digital content, the emerging need for e-discovery has created a demand for techniques that can be used to review massive collections both for responsiveness (i.e., relevance) to the request and for privilege (i.e., presence of legally protected content that the party performing the review may have a right to withhold). In this process, the party performing the review may incur costs of two types, namely, annotation costs (deriving from the fact that human reviewers need to be paid for their work) and misclassification costs (deriving from the fact that failing to correctly determine the responsiveness or privilege of a document may adversely affect the interests of the parties in various ways). Relying exclusively on automatic classification would minimize annotation costs but could result in substantial misclassification costs, while relying exclusively on manual classification could generate the opposite consequences. This paper proposes a risk minimization framework (called MINECORE, for minimizing the expected costs of review) that seeks to strike an optimal balance between these two extreme stands.