Location recommendation is an important means to help people discover attractive locations. However, extreme sparsity of user-location matrices leads to a severe challenge, so it is necessary to take implicit feedback characteristics of user mobility data into account and leverage location's spatial information. To this end, based on previously developed GeoMF, we propose a scalable and flexible framework, dubbed GeoMF++, for joint geographical modeling and implicit feedback based matrix factorization. We then develop an efficient optimization algorithm for parameter learning, which scales linearly with data size and the total number of neighbor grids of all locations. GeoMF++ can be well explained from two perspectives. First, it subsumes two-dimensional kernel density estimation so that it captures spatial clustering phenomenon in user mobility data; Second, it is strongly connected with widely-used neighbor additive models and graph Laplacian regularized models. We finally evaluate GeoMF++ on two large-scale LBSN datasets with respect to both warm-start and cold-start scenarios. The experimental results show that GeoMF++ consistently outperforms the state-of-the-arts and other competing baselines on both datasets in terms of NDCG and Recall. Besides, the efficiency studies show that GeoMF++ is much more scalable with the increase of data size and the dimension of latent space.
In this paper, we take the initiative to study the learning behavioral characteristics of users on online judge systems, which are open, competitive, and self-regulated. We propose to automatically mine subject and difficulty information for problems based on users' learning trace data. An OJ system maintains a large pool of problems related to some specific domain, typically organized by volume. By performing careful data analysis, we have observed two major learning modes (i.e., patterns) of users on OJ systems with the volume organization, namely volume-consecutive and subject-consecutive. It has been found that although the problems from the same subject are distributed across multiple volumes, the users are indeed able to spontaneously identify problems from the same subject for practice. Our observation can find relevance and support in classic educational psychology. To capture the two learning modes, we propose a novel two-mode Markov topic model. For estimating the difficulty of problems, we further propose a subject-aware competition-based expertise model based on the learned topic information. Extensive experiments on three large online judge datasets have demonstrated the effectiveness of our approach in three different tasks, including skill topic extraction, expertise competition prediction and problem recommendation
Many social network applications depend on robust representations of spatio-temporal data. In this work, we present an embedding model based on feed-forward neural networks which transforms social media check-ins into dense feature vectors encoding geographic, temporal, and functional aspects for modelling places, neighborhoods, and users. We employ the embedding model in a variety of applications including location recommendation, urban functional zone study, and crime prediction. For location recommendation, we propose a Spatio-Temporal Embedding Similarity algorithm (STES) based on the embedding model. In a range of experiments on real life data collected from Foursquare, we demonstrate our models effectiveness at characterizing places and people and its efficacy in aforementioned problem domains. Finally, we select eight major cities around the globe and verify the robustness and generality of our model by porting pre-trained models from one city to another, and thereby alleviating the need for costly local training.
Which venue is a tweet posted from? We referred this as the fine-grained geolocation problem. To solve this problem, we extensively study venue and user characteristics in Twitter, using tweets that are tied to posting venues. For venues, we observe spatial homophily where venues near each other tend to have more similar tweet content, compared to venues further apart. For users, we observe that they are spatially focused and more likely to visit venues spatially near the venues they have previously visited. We also find that a substantial proportion of users post one or more geocoded tweet(s), thus providing their location history data. We then propose a set of geolocation models to leverage on the mentioned user and venue characteristics. We also exploit venue temporal popularity, whereby venues differ in popularity at different times of the day. Our models rank candidate venues of test tweets such that the actual posting venue is ranked high. To better tune the model parameters, we introduce a learning to rank framework. Our best performing model incorporates all the mentioned characteristics and significantly outperforms state of the art baselines. Furthermore, we show that tweets without any location-indicative words can be geolocated meaningfully as well.
The growing popularity of mobile search and the advancement in voice recognition technologies have opened the door for web search users to speak their queries rather than type them. While this kind of voice search is still in its infancy, it is gradually becoming more widespread. In this paper, we report a comprehensive voice search query log analysis of a commercial web search engines mobile application. We compare voice and text search by various aspects, with special focus on the semantic and syntactic characteristics of the queries. Our analysis suggests that voice queries focus more on audio-visual content and question answering and less on social networking and adult domains. In addition, voice queries are more commonly submitted on the go. We also conduct an empirical evaluation showing that the language of voice queries is closer to natural language than the language of text queries. Our analysis points out further differences between voice and text search. We discuss the implications of these differences for the design of future voice-enabled web search tools.
Web search engines present, for some queries, a cluster of results from the same specialized domain (vertical) on the search results page (SERP). We introduce a comprehensive analysis of the presentation of such clusters from seven different verticals, based on the logs of a commercial web search engine. This analysis reveals several unique characteristics, such as size, rank, and clicks, of result clusters from community question and answering websites. The study of properties of this result cluster, specifically, as part of the SERP, has received little attention in previous work. Our analysis also motivates the pursuit of a long-standing challenge in ad hoc retrieval, namely selective cluster retrieval. In our setting, the specific challenge is to select for presentation the documents most highly ranked either by a cluster-based approach (namely, those in the top-retrieved cluster) or by a document-based approach. We address this classification task by representing queries with features based on those utilized for ranking the clusters, query performance predictors, and properties of the document clustering structure. Empirical evaluation performed with TREC data shows that our approach outperforms a recently proposed state-of-the-art cluster-based document retrieval method, as well as state-of-the-art document retrieval methods which do not account for inter-document similarities.
User-generated trajectories (UGT), such as travel records from bus companies, capture rich information of human mobility in the offline world. However, some interesting applications of these raw footprints have not been well exploited due to the lack of textual information to infer the subject's personal interests. Although there are rich semantic information contained in the user-generated contents (STUGC) published in the online world, such as Twitter, less effort has been made to utilize these information to facilitate the interest discovery process. In this paper, we design an effective probabilistic framework named CO2 to connect the offline world with the online world in order to discover the users' interests directly from their raw footprints in UGT. CO2 first infers the trip intentions by utilizing the semantic information in STUGC, and then discovers the user interests by aggregating the intentions. To evaluate the effectiveness of CO2, we use two large-scale real world datasets as a case study and further conduct a questionnaire survey to show the superior performance of CO2. In our preliminary study, we briefly introduce the overall idea of CO2. In this paper, we present a thorough analysis of CO2 for the first time, including the model inference procedure, the complexity analysis, the related work and the extensive experiments.
During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events, by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique, but are semantically redundant or non-informative. In this paper, we propose a framework for the diversification of snippets using explicit event aspects, building upon recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014 and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness.