ACM Transactions on

Information Systems (TOIS)

Latest Articles

Swipe and Tell: Using Implicit Feedback to Predict User Engagement on Tablets

When content consumers explicitly judge content positively, we consider them to be engaged. Unfortunately, explicit user evaluations are difficult to collect, as they require user effort. Therefore, we propose to use device interactions as implicit feedback to detect engagement. We assess the usefulness of swipe interactions on tablets for... (more)

Further Insights on Drawing Sound Conclusions from Noisy Judgments

The effectiveness of a search engine is typically evaluated using hand-labeled datasets, where the labels indicate the relevance of documents to... (more)

An Integrated Signature-Based Framework for Efficient Visual Similarity Detection and Measurement in Video Shots

This article presents a framework for speedy video matching and retrieval through detection and... (more)

Neural Vector Spaces for Unsupervised Information Retrieval

We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article... (more)

Sentence Relations for Extractive Summarization with Deep Neural Networks

Sentence regression is a type of extractive summarization that achieves state-of-the-art performance and is commonly used in practical systems. The... (more)

Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging

Location extraction, also called “toponym extraction,” is a field covering geoparsing, extracting spatial representations from location mentions in text, and geotagging, assigning spatial coordinates to content items. This article evaluates five “best-of-class” location extraction algorithms. We develop a geoparsing... (more)

Factors Influencing Users’ Information Requests: Medium, Target, and Extra-Topical Dimension

We report on a crowdsourced study that investigated how two factors influence the way people formulate information requests. Our first factor, medium,... (more)

How Does Domain Expertise Affect Users’ Search Interaction and Outcome in Exploratory Search?

People often conduct exploratory search to explore unfamiliar information space and learn new... (more)

Location-aware Influence Maximization over Dynamic Social Streams

Influence maximization (IM), which selects a set of k seed users (a.k.a., a seed set) to maximize the influence spread over a social network, is a... (more)


Call for Special Issue Proposals

ACM Transactions on Information Systems (TOIS) invites proposals for a special issue of the journal devoted to any topic in information retrieval. Click here to see more details.

New options for ACM authors to manage rights and permissions for their work


ACM introduces a new publishing license agreement, an updated copyright transfer agreement, and a new authorpays option which allows for perpetual open access through the ACM Digital Library. For more information, visit the ACM Author Rights webpage.


About TOIS

ACM Transactions on Information Systems (TOIS) is a scholarly journal that publishes previously unpublished high-quality scholarly articles in all areas of information retrieval. TOIS is published quarterly.

read more
Forthcoming Articles
Brotli: A general-purpose data compressor

Brotli is an open-source general-purpose data compressor introduced by Google in late 2013, and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format has been publicized in July 2016 as RFC 7932. Brotli is based on the Lempel-Ziv's compression scheme and planned as a generic replacement of GZIP and ZLIB. The main goal in its design was to compress data on Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This paper is intended to provide the first thorough, systematic description of the Brotli-format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software-engineering problems which deserve further attention from the scientific community.

Seed-Guided Topic Model for Document Filtering and Classification

In this paper, we propose a seed-guided topic model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the irrelevant documents and classify the relevant documents into the corresponding categories through topic influence. DFC models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The irrelevant-topics represent the semantics of the unknown categories covered by the document collection. DFC assumes that each document is associated with a single category-topic and a mixture of general-topics. A document is filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and classification without filtering, and achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. We also conduct a thorough study about the impact of seed words for the existing dataless classification techniques. The results reveal that the document coverage of the seed words mainly affects the dataless classification performance.

Shallow and Deep Syntactic/Semantic Structures for Passage Reranking in Question Answering Systems

In this paper, we extensively study the use of syntactic and semantic structures obtained with shallow and full syntactic parsers for answer passage reranking. We propose several dependency and constituent-based structures, also enriched with Linked Open Data (LD) knowledge to represent pairs of questions and answer passages. We encode such tree structures in learning to rank (L2R) algorithms using tree kernels, which project them in a tree substructure space, where each dimension represents a powerful syntactic/semantic feature. Additionally, since we define links between structures, tree kernel space also include relational features spanning question and passage structures. Our findings can be useful to build state-of-the-art systems: (i) relational syntactic structures are essential to achieve the state of the art; (ii) full syntactic dependencies can outperform shallow models; (iii) external knowledge obtained with specialized classifiers such as focus and question classification is very effective; and (ii) the semantic information derived by LD and incorporated in syntactic structures can be used to replace specific knowledge classifiers: this is a remarkable advantage with wide coverage. We demonstrate our findings by carrying out an extensive comparative experimentation on two different TREC QA corpora and WikiQA, a recent corpus widely used to test sentence rerankers.

Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data

User response prediction is a crucial component for personalized information retrieval and filtering scenarios, such as online advertising, recommender system, and web search. The data in user response prediction is mostly in a multi-field categorical format and transformed into a sparse binary representation via one-hot encoding. Due to the sparsity problems in representation and training, most research has been focused on shallow models and feature engineering. During the recent 3 years, deep neural networks (DNNs) have attracted research attention on such a problem for their high capacity and end-to-end training scheme, but their learning difficulties have not been well studied. In this paper, we study the difficulties of user response prediction in the scenario of click prediction. We first refine the latent patterns in multi-field categorical data as field-aware feature interactions and propose kernel product to tackle problems in feature representations. Then we propose Product-based Neural Networks (PNNs), which incorporate product layers to learn such interactions. Generalizing the product layer to a net-in-net architecture, we propose Product-network In Network (PIN) to seamlessly incorporate the previous models. Extensive experiments on 4 large-scale real-world click datasets demonstrate that PNNs consistently outperform 7 baselines and state-of-the-art models on various metrics.

From Question to Text: Question-Oriented Feature Attention for Answer Selection

Understanding unstructured texts is an essential skill for human beings as it enables knowledge acquisition. Although understanding unstructured texts is easy for we human beings with good education, it is a great challenge for machines. Recently, with the rapid development of artificial intelligence techniques, researchers put efforts to teach machines to understand texts, and justify the educated machines by letting them solve the questions upon the given unstructured texts, inspired by the reading comprehension test as we human do. However, feature effectiveness with respect to different questions significantly hinders the performance of answer selection, because different questions may focus on various aspects of the given text. Moreover, the correct answer is usually inferred according to multiple sentences in the text. In order to solve these problems, we propose a question-oriented feature attention mechanism to emphasize useful features according to the given question. These weighted features from multiple sentences are then merged together with a cross-sentence max pooling to joint consider information in multiple sentences. And the generated representations are finally used to infer the correct answer. Experiments on the MCTest dataset have well-validated the effectiveness of the proposed method. As a byproduct, we released all the involved codes, data, and parameters to facilitate other researchers.

Binary Sketches for Secondary Filtering

Efficient object retrieval based on a generic similarity is a fundamental problem in information retrieval. We focus on queries by example, which retrieve the most similar objects to query object q. We propose an enhancement of techniques working on the filter and refine paradigm. The filtering phase identifies objects likely to be similar to q. During the refinement, these objects are accessed and their similarity to q is evaluated. Problem valid for current datasets is, that a lot of non-relevant objects usually remain after filtering. We propose the secondary filtering, that further reduces the number of objects to be refined while almost preserving the quality of query result. It is based on sketches - compact bit strings compared by the Hamming distance. Our approach to identify objects to be filtered out by sketches is based on a probabilistic model, which describes relationships between similarities of objects and sketches. The secondary filtering thus efficiently identifies non-relevant objects which remain after primary filtering, and we suggest a mechanism to tune the trade-off between its accuracy and effectiveness. The main advantage of our technique is its wide applicability. It can be easily incorporated into any arbitrary filter and refine technique for similarity search.

Transfer to Rank for Heterogeneous One-Class Collaborative Filtering

Heterogeneous one-class collaborative filtering (HOCCF) is an emerging and important problem in recommender systems, where two different types of one-class feedback, i.e., purchases and browses, are available as input data. The associated challenges include ambiguity of browses, scarcity of purchases, and heterogeneity arising from different feedback. In this paper, we propose to model purchases and browses from a new perspective, i.e., users' roles of mixer, browser and purchaser. Specifically, we design a novel transfer learning solution termed role-based transfer to rank (RoToR), which contains two variants, i.e., integrative RoToR and sequential RoToR. In integrative RoToR, we leverage browses into the preference learning task of purchases, in which we take each user as a sophisticated customer (i.e., mixer) that is able to take different types of feedback into consideration. In sequential RoToR, we aim to simplify the integrative one by decomposing it into two dependent phases according to a typical shopping process. Furthermore, we instantiate both variants using different preference learning paradigms such as pointwise preference learning and pairwise preference learning. Finally, we conduct extensive empirical studies with various baseline methods on two large public datasets and find that our RoToR can perform significantly more accurate than the state-of-the-art methods.

A Deep Bayesian Tensor Based System for Video Recommendation

With the availability of abundant online multi-relational video information, recommender systems that can effectively exploit these sorts of data and suggest creatively interesting items will become increasingly important. Recent research illustrates that tensor models offer effective approaches for complex multi-relational data learning and missing element completion. So far, most tensor-based clustering has focused on accuracy. Given the dynamic nature of online media, recommendation in this setting is more challenging, as it is difficult to capture the users' dynamic topic distributions in sparse data settings. Targeting at constructing a recommender system that can compromise between accuracy and creativity, a deep Bayesian probabilistic tensor framework for tag and item recommendation is proposed. Based on the Canonical PARAFAC (CP) decomposition, a Bayesian multi-layer factorization is imposed on the mode factor matrix to find a more compact representation. During the score ranking processes, a metric called Bayesian surprise is incorporated to increase the creativity of the recommended candidates. The new algorithm is evaluated on both synthetic and large-scale real-world problems. An empirical study for video recommendation demonstrates the superiority of the proposed model and indicates that it can better capture the latent patterns of interactions and generates interesting recommendations based on creative tag combinations.

Interactive Intent Modeling for Exploratory Search

Exploratory search requires the system to assist the user in comprehending the information space, and expressing evolving search intents for iterative exploration and retrieval of information. We introduce interactive intent modeling, a technique that models a user's evolving search intents and visualizes them as keywords for interaction. The user can provide feedback on the keywords, from which the system learns and visualizes an improved intent estimates and retrieves information. We report experiments comparing variants of a system implementing interactive intent modeling to a control system. Data comprising of search logs, interaction logs, essay answers, and questionnaires indicate significant improvements in task performance, session-level information retrieval performance, information comprehension performance and user experience. The improvements retrieval effectiveness can be attributed to the intent modeling, but the effect on users' task performance, breadth of information comprehension, and user experience are shown to be dependent on a richer visualization. Our results demonstrate the utility of combining interactive modeling of search intentions with interactive visualization of the models that can benefit both directing the exploratory search process and making sense of the information space. Our findings can help design personalized systems that support exploratory information seeking and discovery of novel information.

Personalized Context-Aware Point of Interest Recommendation

Personalized recommendation of point-of-interests plays a key role in satisfying users on location-based social networks. In this paper, we propose a probabilistic model to find the mapping between user-annotated tags and locations' taste keywords. Furthermore, we introduce a dataset on locations' contextual appropriateness and demonstrate its usefulness in predicting the contextual relevance of locations. We investigate four approaches to use our proposed mapping for addressing the data sparsity problem: one model to reduce the dimensionality of location taste keywords and three models to predict user tags for a new location. Moreover, we present different scores calculated from multiple LBSNs and show how we incorporate new information from the mapping into a POI recommendation approach. Then, the computed scores are integrated using learning to rank techniques. The experiments on two TREC datasets show the effectiveness of our approach, beating state-of-the-art methods.

Efficient Learning-Based Recommendation Algorithms for Top-N Tasks and Top-N Workers in Large-Scale Crowdsourcing Systems

The task and worker recommendation problems in crowdsourcing systems have brought up unique characteristics that are not present in traditional recommendation scenarios, i.e., the huge flow of tasks with short lifespans, the importance of workers' capabilities, and the quality of completed tasks. These unique features make traditional recommendation approaches no longer satisfactory for task and worker recommendation in crowdsourcing systems. In this article, we propose a two-tier data representation scheme (defining a worker-category suitability score and a worker-task attractiveness score) to support personalized task and worker recommendation. We also extend two optimization methods, namely least mean square error and Bayesian personalized rank in order to better fit the characteristics of task/worker recommendation in crowdsourcing systems. We then integrate the proposed representation scheme and the extended optimization methods along with the two adapted popular learning models, i.e., matrix factorization and kNN, and result in two lines of top-N recommendation algorithms for crowdsourcing systems: (1) Top-N-Tasks (TNT) recommendation algorithms for discovering the top-N most suitable tasks for a given worker, and (2) Top-N-Workers (TNW) recommendation algorithms for identifying the top-N best workers for a task requester. Extensive experimental study is conducted that validates the effectiveness and efficiency of a broad spectrum of algorithms, accompanied by our analysis and the insights gained.

Learning to Adaptively Rank Document Retrieval System Configurations

Information Retrieval systems involve a large number of parameters. For example, a system may choose from a set of possible retrieval models, or various query expansion parameters, whose values greatly influence the retrieval effectiveness. Traditionally, these parameters are set at system level based on training queries, and then used for different queries. We observe that it may not be easy to set all these parameters separately since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries. The parameters should be set according to these characteristics. We propose a novel approach to tackle this problem by dealing with the entire system configurations instead of selecting a single parameter at a time. The selection of the best configuration is cast as a problem of ranking different possible configurations given a query using learning-to-rank (LTR). We exploit both the query features and the system configuration features in the LTR method, so that the selection of configuration is query-dependent. The experiments we conducted on four TREC ad-hoc collections show that this approach significantly outperforms the traditional method to tune system configuration globally. We also show that query expansion features are among the most important.

Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-Discovery

Discovery is an important aspect of the civil litigation process in the USA, in which parties to a lawsuit are permitted to request relevant evidence from other parties. With the rapid growth of digital content, the emerging need for e-discovery has created a demand for techniques that can be used to review massive collections both for responsiveness (i.e., relevance) to the request and for privilege (i.e., presence of legally protected content that the party performing the review may have a right to withhold). In this process, the party performing the review may incur costs of two types, namely, annotation costs (deriving from the fact that human reviewers need to be paid for their work) and misclassification costs (deriving from the fact that failing to correctly determine the responsiveness or privilege of a document may adversely affect the interests of the parties in various ways). Relying exclusively on automatic classification would minimize annotation costs but could result in substantial misclassification costs, while relying exclusively on manual classification could generate the opposite consequences. This paper proposes a risk minimization framework (called MINECORE, for minimizing the expected costs of review) that seeks to strike an optimal balance between these two extreme stands.

All ACM Journals | See Full Journal Index

Search TOIS
enter search term and/or author name