Brotli is an open-source general-purpose data compressor introduced by Google in late 2013, and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format has been publicized in July 2016 as RFC 7932. Brotli is based on the Lempel-Ziv's compression scheme and planned as a generic replacement of GZIP and ZLIB. The main goal in its design was to compress data on Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This paper is intended to provide the first thorough, systematic description of the Brotli-format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software-engineering problems which deserve further attention from the scientific community.
We investigate how two factors influence the way people formulate information requests. Our first factor, medium, considers whether the request is produced using text or voice. Our second factor, target, considers whether the request is intended for a search engine or a human search intermediary. In particular, we study how these two factors influence the way people formulate requests in situations where the information need has a specific type of extra-topical dimension. We focus on six extra-topical dimensions: (1) domain knowledge, (2) viewpoint, (3) experiential, (4) venue location, (5) source location, and (6) temporal. We analyzed information requests gathered through a crowdsourced study and address three research questions. We study the effects of our two factors (medium and target) on: (RQ1) participants' perceptions about their information requests, (RQ2) the different characteristics of their information requests (e.g., natural language structure, retrieval performance), and (RQ3) participants' strategies for requesting information when the search task has a specific type of extra-topical dimension. Our results found that both factors influenced participants' perceptions about their own information requests, the characteristics of participants' requests, and the strategies used by participants to request information matching the extra-topical dimension. Our results call for future research on retrieval algorithms that can effectively harness (rather than ignore) extra-topical query-terms.
Influence maximization (IM), which selects a set of k seeds to maximize the influence spread over a social network, is a fundamental problem in a wide range of applications. Most existing IM algorithms are static and location-unaware. They cannot provide high-quality seeds efficiently when the social network evolves rapidly and IM queries are location-aware. To address the problems, we define the Stream Influence Maximization (SIM) and Location-aware SIM (LSIM) query to track influencers over social streams. SIM adopts the sliding window model and maintains a set of k seeds with the largest influence over recent social actions. LSIM further considers social actions are associated with geo-tags and identifies a seed set for a query region over geo-tagged social streams. We propose the Sparse Influential Checkpoints (SIC) and Location-based SIC (LSIC) framework for SIM and LSIM query processing respectively. The SIC framework keeps a logarithmic number of influential checkpoints w.r.t. the window size and maintains an approximate solution for SIM. The LSIC framework integrates a spatial index with SIC and answers both ad-hoc and continuous LSIM queries with the same approximation ratio. Experimental results on real-world datasets confirm the effectiveness and efficiency of the proposed frameworks against the state-of-the-art IM algorithms.
Efficient object retrieval based on a generic similarity is a fundamental problem in information retrieval. We focus on queries by example, which retrieve the most similar objects to query object q. We propose an enhancement of techniques working on the filter and refine paradigm. The filtering phase identifies objects likely to be similar to q. During the refinement, these objects are accessed and their similarity to q is evaluated. Problem valid for current datasets is, that a lot of non-relevant objects usually remain after filtering. We propose the secondary filtering, that further reduces the number of objects to be refined while almost preserving the quality of query result. It is based on sketches - compact bit strings compared by the Hamming distance. Our approach to identify objects to be filtered out by sketches is based on a probabilistic model, which describes relationships between similarities of objects and sketches. The secondary filtering thus efficiently identifies non-relevant objects which remain after primary filtering, and we suggest a mechanism to tune the trade-off between its accuracy and effectiveness. The main advantage of our technique is its wide applicability. It can be easily incorporated into any arbitrary filter and refine technique for similarity search.
We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.
Exploratory search requires the system to assist the user in comprehending the information space, and expressing evolving search intents for iterative exploration and retrieval of information. We introduce interactive intent modeling, a technique that models a user's evolving search intents and visualizes them as keywords for interaction. The user can provide feedback on the keywords, from which the system learns and visualizes an improved intent estimates and retrieves information. We report experiments comparing variants of a system implementing interactive intent modeling to a control system. Data comprising of search logs, interaction logs, essay answers, and questionnaires indicate significant improvements in task performance, session-level information retrieval performance, information comprehension performance and user experience. The improvements retrieval effectiveness can be attributed to the intent modeling, but the effect on users' task performance, breadth of information comprehension, and user experience are shown to be dependent on a richer visualization. Our results demonstrate the utility of combining interactive modeling of search intentions with interactive visualization of the models that can benefit both directing the exploratory search process and making sense of the information space. Our findings can help design personalized systems that support exploratory information seeking and discovery of novel information.
Personalized recommendation of point-of-interests plays a key role in satisfying users on location-based social networks. In this paper, we propose a probabilistic model to find the mapping between user-annotated tags and locations' taste keywords. Furthermore, we introduce a dataset on locations' contextual appropriateness and demonstrate its usefulness in predicting the contextual relevance of locations. We investigate four approaches to use our proposed mapping for addressing the data sparsity problem: one model to reduce the dimensionality of location taste keywords and three models to predict user tags for a new location. Moreover, we present different scores calculated from multiple LBSNs and show how we incorporate new information from the mapping into a POI recommendation approach. Then, the computed scores are integrated using learning to rank techniques. The experiments on two TREC datasets show the effectiveness of our approach, beating state-of-the-art methods.
The task and worker recommendation problems in crowdsourcing systems have brought up unique characteristics that are not present in traditional recommendation scenarios, i.e., the huge flow of tasks with short lifespans, the importance of workers' capabilities, and the quality of completed tasks. These unique features make traditional recommendation approaches no longer satisfactory for task and worker recommendation in crowdsourcing systems. In this article, we propose a two-tier data representation scheme (defining a worker-category suitability score and a worker-task attractiveness score) to support personalized task and worker recommendation. We also extend two optimization methods, namely least mean square error and Bayesian personalized rank in order to better fit the characteristics of task/worker recommendation in crowdsourcing systems. We then integrate the proposed representation scheme and the extended optimization methods along with the two adapted popular learning models, i.e., matrix factorization and kNN, and result in two lines of top-N recommendation algorithms for crowdsourcing systems: (1) Top-N-Tasks (TNT) recommendation algorithms for discovering the top-N most suitable tasks for a given worker, and (2) Top-N-Workers (TNW) recommendation algorithms for identifying the top-N best workers for a task requester. Extensive experimental study is conducted that validates the effectiveness and efficiency of a broad spectrum of algorithms, accompanied by our analysis and the insights gained.
Information Retrieval systems involve a large number of parameters. For example, a system may choose from a set of possible retrieval models, or various query expansion parameters, whose values greatly influence the retrieval effectiveness. Traditionally, these parameters are set at system level based on training queries, and then used for different queries. We observe that it may not be easy to set all these parameters separately since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries. The parameters should be set according to these characteristics. We propose a novel approach to tackle this problem by dealing with the entire system configurations instead of selecting a single parameter at a time. The selection of the best configuration is cast as a problem of ranking different possible configurations given a query using learning-to-rank (LTR). We exploit both the query features and the system configuration features in the LTR method, so that the selection of configuration is query-dependent. The experiments we conducted on four TREC ad-hoc collections show that this approach significantly outperforms the traditional method to tune system configuration globally. We also show that query expansion features are among the most important.
People often conduct exploratory search to explore unfamiliar information space and learn new knowledge. While supporting the highly dynamic and interactive exploratory search is still challenging for the search system, we want to investigate which factors can make the exploratory search successful and satisfying from the user's perspective. Previous research suggests that domain experts have different search strategies and are more successful in finding domain-specific information, but how domain expertise level will influence user's interaction and search outcomes in exploratory search, especially in different knowledge domains, is still unclear. In this work, via a carefully designed user study that involves 30 participants, we investigate the influence of domain expertise levels on the interaction and outcome of exploratory search in three different domains: environment, medicine, and politics. We record participants' search behaviors, including their explicit feedbacks and eye fixation sequences, in a laboratory setting. With this dataset, we identify both domain-independent and domain-dependent effects on user behaviors and search outcomes. Our results extend existing research on the effect of domain expertise in search and suggest different strategies for exploiting domain expertise to support exploratory search in different knowledge domains.