Brotli is an open-source general-purpose data compressor introduced by Google in late 2013, and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format has been publicized in July 2016 as RFC 7932. Brotli is based on the Lempel-Ziv's compression scheme and planned as a generic replacement of GZIP and ZLIB. The main goal in its design was to compress data on Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This paper is intended to provide the first thorough, systematic description of the Brotli-format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software-engineering problems which deserve further attention from the scientific community.
We investigate how two factors influence the way people formulate information requests. Our first factor, medium, considers whether the request is produced using text or voice. Our second factor, target, considers whether the request is intended for a search engine or a human search intermediary. In particular, we study how these two factors influence the way people formulate requests in situations where the information need has a specific type of extra-topical dimension. We focus on six extra-topical dimensions: (1) domain knowledge, (2) viewpoint, (3) experiential, (4) venue location, (5) source location, and (6) temporal. We analyzed information requests gathered through a crowdsourced study and address three research questions. We study the effects of our two factors (medium and target) on: (RQ1) participants' perceptions about their information requests, (RQ2) the different characteristics of their information requests (e.g., natural language structure, retrieval performance), and (RQ3) participants' strategies for requesting information when the search task has a specific type of extra-topical dimension. Our results found that both factors influenced participants' perceptions about their own information requests, the characteristics of participants' requests, and the strategies used by participants to request information matching the extra-topical dimension. Our results call for future research on retrieval algorithms that can effectively harness (rather than ignore) extra-topical query-terms.
In this paper, we extensively study the use of syntactic and semantic structures obtained with shallow and full syntactic parsers for answer passage reranking. We propose several dependency and constituent-based structures, also enriched with Linked Open Data (LD) knowledge to represent pairs of questions and answer passages. We encode such tree structures in learning to rank (L2R) algorithms using tree kernels, which project them in a tree substructure space, where each dimension represents a powerful syntactic/semantic feature. Additionally, since we define links between structures, tree kernel space also include relational features spanning question and passage structures. Our findings can be useful to build state-of-the-art systems: (i) relational syntactic structures are essential to achieve the state of the art; (ii) full syntactic dependencies can outperform shallow models; (iii) external knowledge obtained with specialized classifiers such as focus and question classification is very effective; and (ii) the semantic information derived by LD and incorporated in syntactic structures can be used to replace specific knowledge classifiers: this is a remarkable advantage with wide coverage. We demonstrate our findings by carrying out an extensive comparative experimentation on two different TREC QA corpora and WikiQA, a recent corpus widely used to test sentence rerankers.
User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as online advertising, recommender system, and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into a sparse binary representation via one-hot encoding. Due to the sparsity problems in representation and training, most research has been focused on shallow models and feature engineering. During the recent 3 years, deep neural networks (DNNs) have attracted research attention on such a problem for their high capacity and end-to-end training scheme, but their learning difficulties have not been well studied. In this paper, we study the difficulties of user response prediction in the scenario of click prediction. We first refine the latent patterns in multi-field categorical data as field-aware feature interactions and propose kernel product to tackle problems in feature representations. Then we propose Product-based Neural Networks (PNNs), which incorporate product layers to learn such interactions. Generalizing the product layer to a net-in-net architecture, we propose Product-network In Network (PIN) to seamlessly incorporate the previous models. Extensive experiments on 4 large-scale real-world click datasets demonstrate that PNNs consistently outperform 7 baselines and state-of-the-art models on various metrics.
Influence maximization (IM), which selects a set of k seeds to maximize the influence spread over a social network, is a fundamental problem in a wide range of applications. Most existing IM algorithms are static and location-unaware. They cannot provide high-quality seeds efficiently when the social network evolves rapidly and IM queries are location-aware. To address the problems, we define the Stream Influence Maximization (SIM) and Location-aware SIM (LSIM) query to track influencers over social streams. SIM adopts the sliding window model and maintains a set of k seeds with the largest influence over recent social actions. LSIM further considers social actions are associated with geo-tags and identifies a seed set for a query region over geo-tagged social streams. We propose the Sparse Influential Checkpoints (SIC) and Location-based SIC (LSIC) framework for SIM and LSIM query processing respectively. The SIC framework keeps a logarithmic number of influential checkpoints w.r.t. the window size and maintains an approximate solution for SIM. The LSIC framework integrates a spatial index with SIC and answers both ad-hoc and continuous LSIM queries with the same approximation ratio. Experimental results on real-world datasets confirm the effectiveness and efficiency of the proposed frameworks against the state-of-the-art IM algorithms.
Understanding unstructured texts is an essential skill for human beings as it enables knowledge acquisition. Although understanding unstructured texts is easy for we human beings with good education, it is a great challenge for machines. Recently, with the rapid development of artificial intelligence techniques, researchers put efforts to teach machines to understand texts, and justify the educated machines by letting them solve the questions upon the given unstructured texts, inspired by the reading comprehension test as we human do. However, feature effectiveness with respect to different questions significantly hinders the performance of answer selection, because different questions may focus on various aspects of the given text. Moreover, the correct answer is usually inferred according to multiple sentences in the text. In order to solve these problems, we propose a question-oriented feature attention mechanism to emphasize useful features according to the given question. These weighted features from multiple sentences are then merged together with a cross-sentence max pooling to joint consider information in multiple sentences. And the generated representations are finally used to infer the correct answer. Experiments on the MCTest dataset have well-validated the effectiveness of the proposed method. As a byproduct, we released all the involved codes, data, and parameters to facilitate other researchers.
Efficient object retrieval based on a generic similarity is a fundamental problem in information retrieval. We focus on queries by example, which retrieve the most similar objects to query object q. We propose an enhancement of techniques working on the filter and refine paradigm. The filtering phase identifies objects likely to be similar to q. During the refinement, these objects are accessed and their similarity to q is evaluated. Problem valid for current datasets is, that a lot of non-relevant objects usually remain after filtering. We propose the secondary filtering, that further reduces the number of objects to be refined while almost preserving the quality of query result. It is based on sketches - compact bit strings compared by the Hamming distance. Our approach to identify objects to be filtered out by sketches is based on a probabilistic model, which describes relationships between similarities of objects and sketches. The secondary filtering thus efficiently identifies non-relevant objects which remain after primary filtering, and we suggest a mechanism to tune the trade-off between its accuracy and effectiveness. The main advantage of our technique is its wide applicability. It can be easily incorporated into any arbitrary filter and refine technique for similarity search.
We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.
With the availability of abundant online multi-relational video information, recommender systems that can effectively exploit these sorts of data and suggest creatively interesting items will become increasingly important. Recent research illustrates that tensor models offer effective approaches for complex multi-relational data learning and missing element completion. So far, most tensor-based clustering has focused on accuracy. Given the dynamic nature of online media, recommendation in this setting is more challenging, as it is difficult to capture the users' dynamic topic distributions in sparse data settings. Targeting at constructing a recommender system that can compromise between accuracy and creativity, a deep Bayesian probabilistic tensor framework for tag and item recommendation is proposed. Based on the Canonical PARAFAC (CP) decomposition, a Bayesian multi-layer factorization is imposed on the mode factor matrix to find a more compact representation. During the score ranking processes, a metric called Bayesian surprise is incorporated to increase the creativity of the recommended candidates. The new algorithm is evaluated on both synthetic and large-scale real-world problems. An empirical study for video recommendation demonstrates the superiority of the proposed model and indicates that it can better capture the latent patterns of interactions and generates interesting recommendations based on creative tag combinations.
Exploratory search requires the system to assist the user in comprehending the information space, and expressing evolving search intents for iterative exploration and retrieval of information. We introduce interactive intent modeling, a technique that models a user's evolving search intents and visualizes them as keywords for interaction. The user can provide feedback on the keywords, from which the system learns and visualizes an improved intent estimates and retrieves information. We report experiments comparing variants of a system implementing interactive intent modeling to a control system. Data comprising of search logs, interaction logs, essay answers, and questionnaires indicate significant improvements in task performance, session-level information retrieval performance, information comprehension performance and user experience. The improvements retrieval effectiveness can be attributed to the intent modeling, but the effect on users' task performance, breadth of information comprehension, and user experience are shown to be dependent on a richer visualization. Our results demonstrate the utility of combining interactive modeling of search intentions with interactive visualization of the models that can benefit both directing the exploratory search process and making sense of the information space. Our findings can help design personalized systems that support exploratory information seeking and discovery of novel information.
Personalized recommendation of point-of-interests plays a key role in satisfying users on location-based social networks. In this paper, we propose a probabilistic model to find the mapping between user-annotated tags and locations' taste keywords. Furthermore, we introduce a dataset on locations' contextual appropriateness and demonstrate its usefulness in predicting the contextual relevance of locations. We investigate four approaches to use our proposed mapping for addressing the data sparsity problem: one model to reduce the dimensionality of location taste keywords and three models to predict user tags for a new location. Moreover, we present different scores calculated from multiple LBSNs and show how we incorporate new information from the mapping into a POI recommendation approach. Then, the computed scores are integrated using learning to rank techniques. The experiments on two TREC datasets show the effectiveness of our approach, beating state-of-the-art methods.
The task and worker recommendation problems in crowdsourcing systems have brought up unique characteristics that are not present in traditional recommendation scenarios, i.e., the huge flow of tasks with short lifespans, the importance of workers' capabilities, and the quality of completed tasks. These unique features make traditional recommendation approaches no longer satisfactory for task and worker recommendation in crowdsourcing systems. In this article, we propose a two-tier data representation scheme (defining a worker-category suitability score and a worker-task attractiveness score) to support personalized task and worker recommendation. We also extend two optimization methods, namely least mean square error and Bayesian personalized rank in order to better fit the characteristics of task/worker recommendation in crowdsourcing systems. We then integrate the proposed representation scheme and the extended optimization methods along with the two adapted popular learning models, i.e., matrix factorization and kNN, and result in two lines of top-N recommendation algorithms for crowdsourcing systems: (1) Top-N-Tasks (TNT) recommendation algorithms for discovering the top-N most suitable tasks for a given worker, and (2) Top-N-Workers (TNW) recommendation algorithms for identifying the top-N best workers for a task requester. Extensive experimental study is conducted that validates the effectiveness and efficiency of a broad spectrum of algorithms, accompanied by our analysis and the insights gained.
Information Retrieval systems involve a large number of parameters. For example, a system may choose from a set of possible retrieval models, or various query expansion parameters, whose values greatly influence the retrieval effectiveness. Traditionally, these parameters are set at system level based on training queries, and then used for different queries. We observe that it may not be easy to set all these parameters separately since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries. The parameters should be set according to these characteristics. We propose a novel approach to tackle this problem by dealing with the entire system configurations instead of selecting a single parameter at a time. The selection of the best configuration is cast as a problem of ranking different possible configurations given a query using learning-to-rank (LTR). We exploit both the query features and the system configuration features in the LTR method, so that the selection of configuration is query-dependent. The experiments we conducted on four TREC ad-hoc collections show that this approach significantly outperforms the traditional method to tune system configuration globally. We also show that query expansion features are among the most important.
People often conduct exploratory search to explore unfamiliar information space and learn new knowledge. While supporting the highly dynamic and interactive exploratory search is still challenging for the search system, we want to investigate which factors can make the exploratory search successful and satisfying from the user's perspective. Previous research suggests that domain experts have different search strategies and are more successful in finding domain-specific information, but how domain expertise level will influence user's interaction and search outcomes in exploratory search, especially in different knowledge domains, is still unclear. In this work, via a carefully designed user study that involves 30 participants, we investigate the influence of domain expertise levels on the interaction and outcome of exploratory search in three different domains: environment, medicine, and politics. We record participants' search behaviors, including their explicit feedbacks and eye fixation sequences, in a laboratory setting. With this dataset, we identify both domain-independent and domain-dependent effects on user behaviors and search outcomes. Our results extend existing research on the effect of domain expertise in search and suggest different strategies for exploiting domain expertise to support exploratory search in different knowledge domains.