Brotli is an open-source general-purpose data compressor introduced by Google in late 2013, and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format has been publicized in July 2016 as RFC 7932. Brotli is based on the Lempel-Ziv's compression scheme and planned as a generic replacement of GZIP and ZLIB. The main goal in its design was to compress data on Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This paper is intended to provide the first thorough, systematic description of the Brotli-format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software-engineering problems which deserve further attention from the scientific community.
In this paper, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classify the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A document is filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering, and achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. We also conduct a thorough study about the impact of seed words for the existing dataless classification techniques. The results reveal that the document coverage of the seed words mainly affects the dataless classification performance.
In this paper, we extensively study the use of syntactic and semantic structures obtained with shallow and full syntactic parsers for answer passage reranking. We propose several dependency and constituent-based structures, also enriched with Linked Open Data (LD) knowledge to represent pairs of questions and answer passages. We encode such tree structures in learning to rank (L2R) algorithms using tree kernels, which project them in a tree substructure space, where each dimension represents a powerful syntactic/semantic feature. Additionally, since we define links between structures, tree kernel space also include relational features spanning question and passage structures. Our findings can be useful to build state-of-the-art systems: (i) relational syntactic structures are essential to achieve the state of the art; (ii) full syntactic dependencies can outperform shallow models; (iii) external knowledge obtained with specialized classifiers such as focus and question classification is very effective; and (ii) the semantic information derived by LD and incorporated in syntactic structures can be used to replace specific knowledge classifiers: this is a remarkable advantage with wide coverage. We demonstrate our findings by carrying out an extensive comparative experimentation on two different TREC QA corpora and WikiQA, a recent corpus widely used to test sentence rerankers.
User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as online advertising, recommender system, and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into a sparse binary representation via one-hot encoding. Due to the sparsity problems in representation and training, most research has been focused on shallow models and feature engineering. During the recent 3 years, deep neural networks (DNNs) have attracted research attention on such a problem for their high capacity and end-to-end training scheme, but their learning difficulties have not been well studied. In this paper, we study the difficulties of user response prediction in the scenario of click prediction. We first refine the latent patterns in multi-field categorical data as field-aware feature interactions and propose kernel product to tackle problems in feature representations. Then we propose Product-based Neural Networks (PNNs), which incorporate product layers to learn such interactions. Generalizing the product layer to a net-in-net architecture, we propose Product-network In Network (PIN) to seamlessly incorporate the previous models. Extensive experiments on 4 large-scale real-world click datasets demonstrate that PNNs consistently outperform 7 baselines and state-of-the-art models on various metrics.
Understanding unstructured texts is an essential skill for human beings as it enables knowledge acquisition. Although understanding unstructured texts is easy for we human beings with good education, it is a great challenge for machines. Recently, with the rapid development of artificial intelligence techniques, researchers put efforts to teach machines to understand texts, and justify the educated machines by letting them solve the questions upon the given unstructured texts, inspired by the reading comprehension test as we human do. However, feature effectiveness with respect to different questions significantly hinders the performance of answer selection, because different questions may focus on various aspects of the given text. Moreover, the correct answer is usually inferred according to multiple sentences in the text. In order to solve these problems, we propose a question-oriented feature attention mechanism to emphasize useful features according to the given question. These weighted features from multiple sentences are then merged together with a cross-sentence max pooling to joint consider information in multiple sentences. And the generated representations are finally used to infer the correct answer. Experiments on the MCTest dataset have well-validated the effectiveness of the proposed method. As a byproduct, we released all the involved codes, data, and parameters to facilitate other researchers.
Efficient object retrieval based on a generic similarity is a fundamental problem in information retrieval. We focus on queries by example, which retrieve the most similar objects to query object q. We propose an enhancement of techniques working on the filter and refine paradigm. The filtering phase identifies objects likely to be similar to q. During the refinement, these objects are accessed and their similarity to q is evaluated. Problem valid for current datasets is, that a lot of non-relevant objects usually remain after filtering. We propose the secondary filtering, that further reduces the number of objects to be refined while almost preserving the quality of query result. It is based on sketches - compact bit strings compared by the Hamming distance. Our approach to identify objects to be filtered out by sketches is based on a probabilistic model, which describes relationships between similarities of objects and sketches. The secondary filtering thus efficiently identifies non-relevant objects which remain after primary filtering, and we suggest a mechanism to tune the trade-off between its accuracy and effectiveness. The main advantage of our technique is its wide applicability. It can be easily incorporated into any arbitrary filter and refine technique for similarity search.
Auxiliary information such as user reviews or product images has been extensively leveraged in recommender systems. Although it has been verified to be effective in terms of enhancing the recommendation performance, most state-of-the-art models have to embed complicated models in their framework to tame such unstructured data, which limits the model runtime efficiency. To solve this problem, in this paper, we decompose the modeling of auxiliary information and user-item interactions by a generalized distillation framework, where in the training phase, we leverage a powerful teacher model for auxiliary information modeling to teach a simple collaborative filtering (CF) student model, and then only this succinct yet enhanced student model is used to make fast predictions at test time. Specifically, we take user reviews as the auxiliary information, and according to their specific characters, we propose a Selective Distillation Network (SDNet) to leverage textual information in a more effective manner. Extensive experiments verify that our model can not only improve the performance of rating prediction, but also can significantly reduce the time consumption when making predictions as compared with several state-of-the-art methods.
Faceted search is quickly becoming a common feature on most search interfaces in e-commerce websites, digital libraries, government's open information portals, etc. Beyond the existing studies on developing algorithms for faceted search and empirical studies on facet usage, this study investigated user real-time interactions with facets over the course of a search from both data science and human factor perspectives. It adopted a Random Forest (RF) model to successfully predict facet use using search dynamic variables. In addition, the RF model provided a ranking of variables by their predictive power, which suggests that the search process follows rhythmic flow of a sequence within which facet addition is mostly influenced by its immediately preceding action. In the follow-up user study, we found that participants used facets at critical points from the beginning to end of search sessions. Participants used facets for distinctive reasons at different stages. They also used facets implicitly without applying the facets to their search. Most participants liked the faceted search, although a few participants were concerned about the choice overload introduced by facets. The results of this research can be used to understand information seekers and propose or refine a set of practical design guidelines for faceted search.
Heterogeneous one-class collaborative filtering (HOCCF) is an emerging and important problem in recommender systems, where two different types of one-class feedback, i.e., purchases and browses, are available as input data. The associated challenges include ambiguity of browses, scarcity of purchases, and heterogeneity arising from different feedback. In this paper, we propose to model purchases and browses from a new perspective, i.e., users' roles of mixer, browser and purchaser. Specifically, we design a novel transfer learning solution termed role-based transfer to rank (RoToR), which contains two variants, i.e., integrative RoToR and sequential RoToR. In integrative RoToR, we leverage browses into the preference learning task of purchases, in which we take each user as a sophisticated customer (i.e., mixer) that is able to take different types of feedback into consideration. In sequential RoToR, we aim to simplify the integrative one by decomposing it into two dependent phases according to a typical shopping process. Furthermore, we instantiate both variants using different preference learning paradigms such as pointwise preference learning and pairwise preference learning. Finally, we conduct extensive empirical studies with various baseline methods on two large public datasets and find that our RoToR can perform significantly more accurate than the state-of-the-art methods.
With the availability of abundant online multi-relational video information, recommender systems that can effectively exploit these sorts of data and suggest creatively interesting items will become increasingly important. Recent research illustrates that tensor models offer effective approaches for complex multi-relational data learning and missing element completion. So far, most tensor-based clustering has focused on accuracy. Given the dynamic nature of online media, recommendation in this setting is more challenging, as it is difficult to capture the users' dynamic topic distributions in sparse data settings. Targeting at constructing a recommender system that can compromise between accuracy and creativity, a deep Bayesian probabilistic tensor framework for tag and item recommendation is proposed. Based on the Canonical PARAFAC (CP) decomposition, a Bayesian multi-layer factorization is imposed on the mode factor matrix to find a more compact representation. During the score ranking processes, a metric called Bayesian surprise is incorporated to increase the creativity of the recommended candidates. The new algorithm is evaluated on both synthetic and large-scale real-world problems. An empirical study for video recommendation demonstrates the superiority of the proposed model and indicates that it can better capture the latent patterns of interactions and generates interesting recommendations based on creative tag combinations.
The task and worker recommendation problems in crowdsourcing systems have brought up unique characteristics that are not present in traditional recommendation scenarios, i.e., the huge flow of tasks with short lifespans, the importance of workers' capabilities, and the quality of completed tasks. These unique features make traditional recommendation approaches no longer satisfactory for task and worker recommendation in crowdsourcing systems. In this article, we propose a two-tier data representation scheme (defining a worker-category suitability score and a worker-task attractiveness score) to support personalized task and worker recommendation. We also extend two optimization methods, namely least mean square error and Bayesian personalized rank in order to better fit the characteristics of task/worker recommendation in crowdsourcing systems. We then integrate the proposed representation scheme and the extended optimization methods along with the two adapted popular learning models, i.e., matrix factorization and kNN, and result in two lines of top-N recommendation algorithms for crowdsourcing systems: (1) Top-N-Tasks (TNT) recommendation algorithms for discovering the top-N most suitable tasks for a given worker, and (2) Top-N-Workers (TNW) recommendation algorithms for identifying the top-N best workers for a task requester. Extensive experimental study is conducted that validates the effectiveness and efficiency of a broad spectrum of algorithms, accompanied by our analysis and the insights gained.
As a necessary step towards automatic understanding of open-domain web-search queries, we study the problem of linking queries to their semantic representation, in terms of unambiguous entities mentioned by the query. We introduce SMAPH, a second-order approach that, by ``piggybacking'' on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. Thanks to the information provided by the search engine, we discover a high-coverage set of candidate entities. The key algorithmic idea underlying the design of SMAPH-3, our best performing algorithm, is to link-back those entities to the terms occurring in the input query. This is implemented with a greedy disambiguation algorithm that iteratively builds the solution by keeping a notion of coherence among its annotations. We show that SMAPH-3 outperforms the state-of-the-art on the GERDAQ benchmark dataset, a novel, public dataset we built specifically for web-query entity linking via a crowdsourcing effort, that we publish jointly with this paper.
Information Retrieval systems involve a large number of parameters. For example, a system may choose from a set of possible retrieval models, or various query expansion parameters, whose values greatly influence the retrieval effectiveness. Traditionally, these parameters are set at system level based on training queries, and then used for different queries. We observe that it may not be easy to set all these parameters separately since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries. The parameters should be set according to these characteristics. We propose a novel approach to tackle this problem by dealing with the entire system configurations instead of selecting a single parameter at a time. The selection of the best configuration is cast as a problem of ranking different possible configurations given a query using learning-to-rank (LTR). We exploit both the query features and the system configuration features in the LTR method, so that the selection of configuration is query-dependent. The experiments we conducted on four TREC ad-hoc collections show that this approach significantly outperforms the traditional method to tune system configuration globally. We also show that query expansion features are among the most important.
Discovery is an important aspect of the civil litigation process in the USA, in which parties to a lawsuit are permitted to request relevant evidence from other parties. With the rapid growth of digital content, the emerging need for e-discovery has created a demand for techniques that can be used to review massive collections both for responsiveness (i.e., relevance) to the request and for privilege (i.e., presence of legally protected content that the party performing the review may have a right to withhold). In this process, the party performing the review may incur costs of two types, namely, annotation costs (deriving from the fact that human reviewers need to be paid for their work) and misclassification costs (deriving from the fact that failing to correctly determine the responsiveness or privilege of a document may adversely affect the interests of the parties in various ways). Relying exclusively on automatic classification would minimize annotation costs but could result in substantial misclassification costs, while relying exclusively on manual classification could generate the opposite consequences. This paper proposes a risk minimization framework (called MINECORE, for minimizing the expected costs of review) that seeks to strike an optimal balance between these two extreme stands.