Datamining Projects – ElysiumPro

Datamining Projects ElysiumPro

Datamining Projects

CSE Projects
Description
D Data Mining is the computing process of discovering patterns in large data sets involving the intersection of machine learning, statistics and database. We provide data mining algorithms with source code to students that can solve many real time issues with various software based systems.
Download Project List

Quality Factor

  • 100% Assured Results
  • Best Project Explanation
  • Tons of Reference
  • Cost optimized
  • Controlpanel Access


1Discrete Nonnegative Spectral Clustering
Spectral clustering has been playing a vital role in various research areas. Most traditional spectral clustering algorithms comprise two independent stage (e.g., first learning continuous labels and then rounding the learned labels into discrete ones), which may cause unpredictable deviation of resultant cluster labels from genuine ones, thereby leading to severe information loss and performance degradation. In this work, we study how to achieve discrete clustering as well as reliably generalize to unseen data. We propose a novel spectral clustering scheme which deeply explores cluster label properties, including discreteness, non-negativity and discrimination, as well as learns robust out-of-sample prediction functions. Specifically, we explicitly enforce a discrete transformation on the intermediate continuous labels, which leads to a tractable optimization problem with a discrete solution. Besides, we preserve the natural nonnegative characteristic of the clustering labels to enhance the interpretability of the results. Moreover, to further compensate the unreliability of the learned clustering labels, we integrate an adaptive robust module with l2, p loss to learn prediction function for grouping unseen data. We also show that the out-of-sample component can inject discriminative knowledge into the learning of cluster labels under certain conditions. Extensive experiments conducted on various data sets have demonstrated the superiority of our proposal as compared to several existing clustering approaches.

2Finding Related Forum Posts through Content Similarity over Intention-based Segmentation
We study the problem of finding related forum posts to a post at hand. In contrast to traditional approaches for finding related documents that perform content comparisons across the content of the posts as a whole, we consider each post as a set of segments, each written with a different goal in mind. We advocate that the relatedness between two posts should be based on the similarity of their respective segments that are intended for the same goal, i.e., are conveying the same intention. This means that it is possible for the same terms to weigh differently in the relatedness score depending on the intention of the segment in which they are found. We have developed a segmentation method that by monitoring a number of text features can identify the parts of a post where significant jumps occur indicating a point where a segmentation should take place. The generated segments of all the posts are clustered to form intention clusters and then similarities across the posts are calculated through similarities across segments with the same intention. We experimentally illustrate the effectiveness and efficiency of our segmentation method and our overall approach of finding related forum posts.

3Hierarchical Contextual Attention Recurrent Neural Network for Map Query Suggestion
The query logs from an on-line map query system provide rich cues to understand the behaviours of human crowds. With the growing ability of collecting large scale query logs, the query suggestion has been a topic of recent interest. In general, query suggestion aims at recommending a list of relevant queries w.r.t. users’ inputs via an appropriate learning of crowds’ query logs. In this paper, we are particularly interested in map query suggestions (e.g., the predictions of location-related queries) and propose a novel model Hierarchical Contextual Attention Recurrent Neural Network (HCAR-NN) for map query suggestion in an encoding-decoding manner. Given crowds map query logs, our proposed HCAR-NN not only learns the local temporal correlation among map queries in a query session (e.g., queries in a short-term interval are relevant to accomplish a search mission), but also captures the global longer range contextual dependencies among map query sessions in query logs (e.g., how a sequence of queries within a short-term interval has an influence on another sequence of queries). We evaluate our approach over millions of queries from a commercial search engine (i.e., Baidu Map). Experimental results show that the proposed approach provides significant performance improvements over the competitive existing methods in terms of classical metrics (i.e. Recall@K and MRR) as well as the prediction of crowds’ search missions.

4A Systematic Review on Educational Data Mining
Presently educational institutions compile and store huge volumes of data such as student enrolment and attendance records, as well as their examination results. Mining such data yields stimulating information that serves its handlers well. Rapid growth in educational data points to the fact that distilling massive amounts of data requires a more sophisticated set of algorithms. This issue led to the emergence of the field of Educational Data Mining (EDM). Traditional data mining algorithms cannot be directly applied to educational problems, as they may have a specific objective and function. This implies that a pre-processing algorithm has to be enforced first and only then some specific data mining methods can be applied to the problems. One such pre-processing algorithm in EDM is Clustering. Many studies on EDM have focused on the application of various data mining algorithms to educational attributes. Therefore, this paper provides over three decades long (1983-2016) systematic literature review on clustering algorithm and its applicability and usability in the context of EDM. Future insights are outlined based on the literature reviewed, and avenues for further research are identified.

5A Workflow Management System for Scalable Data Mining on Clouds
The extraction of useful information from data is often a complex process that can be conveniently modeled as a data analysis workflow. When very large data sets must be analyzed and/or complex data mining algorithms must be executed, data analysis workflows may take very long times to complete their execution. Therefore, efficient systems are required for the scalable execution of data analysis workflows, by exploiting the computing services of the Cloud platforms where data is increasingly being stored. The objective of the paper is to demonstrate how Cloud software technologies can be integrated to implement an effective environment for designing and executing scalable data analysis workflows. We describe the design and implementation of the Data Mining Cloud Framework (DMCF), a data analysis system that integrates a visual workflow language and a parallel runtime with the Software-as-a-Service (SaaS) model. DMCF was designed taking into account the needs of real data mining applications, with the goal of simplifying the development of data mining applications compared to generic workflow management systems that are not specifically designed for this domain. The result is a high-level environment that, through an integrated visual workflow language, minimizes the programming effort, making easier to domain experts the use of common patterns specifically designed for the development and the parallel execution of data mining applications. The DMCF’s visual workflow language, system architecture and runtime mechanisms are presented. We also discuss several data mining workflows developed with DMCF and the scalability obtained executing such workflows on a public Cloud.

6Stochastic Blockmodeling and Variational Bayes Learning for Signed Network Analysis
Signed networks with positive and negative links attract considerable interest in their studying since they contain more information than unsigned networks. Community detection and sign (or attitude) prediction are still primary challenges, as the fundamental problems of signed network analysis. For this, a generative Bayesian approach is presented wherein 1) a signed stochastic blockmodel is proposed to characterize the community structure in the context of signed networks, by explicit formulating the distributions of the density and frustration of signed links from a stochastic perspective, and 2) a model learning algorithm is advanced by theoretical deriving a variational Bayes EM for the parameter estimation and variation-based approximate evidence for the model selection. The comparison of the above approach with the state-of-the-art methods on synthetic and real-world networks, shows its advantage in the community detection and sign prediction for the exploratory networks.

7Analyzing Sentiments in One Go: A Supervised Joint Topic Modeling Approach
In this work, we focus on modeling user-generated review and overall rating pairs, and aim to identify semantic aspects and aspect-level sentiments from review data as well as to predict overall sentiments of reviews. We propose a novel probabilistic supervised joint aspect and sentiment model (SJASM) to deal with the problems in one go under a unified framework. SJASM represents each review document in the form of opinion pairs, and can simultaneously model aspect terms and corresponding opinion words of the review for hidden aspect and sentiment detection. It also leverages sentimental overall ratings, which often come with online reviews, as supervision data, and can infer the semantic aspects and aspect-level sentiments that are not only meaningful but also predictive of overall sentiments of reviews. Moreover, we also develop efficient inference method for parameter estimation of SJASM based on collapsed Gibbs sampling. We evaluate SJASM extensively on real-world review data, and experimental results demonstrate that the proposed model outperforms seven well-established baseline methods for sentiment analysis tasks.

8A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big Sensing Data Processing on Cloud
Big sensing data is prevalent in both industry and scientific research applications where the data is generated with high volume and velocity. Cloud computing provides a promising platform for big sensing data processing and storage as it provides a flexible stack of massive computing, storage, and software services in a scalable manner. Current big sensing data processing on Cloud have adopted some data compression techniques. However, due to the high volume and velocity of big sensing data, traditional data compression techniques lack sufficient efficiency and scalability for data processing. Based on specific on-Cloud data compression requirements, we propose a novel scalable data compression approach based on calculating similarity among the partitioned data chunks. Instead of compressing basic data units, the compression will be conducted over partitioned data chunks. To restore original data sets, some restoration functions and predictions will be designed. MapReduce is used for algorithm implementation to achieve extra scalability on Cloud. With real world meteorological big sensing data experiments on U-Cloud platform, we demonstrate that the proposed scalable compression approach based on data chunk similarity can significantly improve data compression efficiency with affordable data accuracy loss.

9Search Rank Fraud and Malware Detection in Google Play
Fraudulent behaviors in Google Play, the most popular Android app market, fuel search rank abuse and malware proliferation. To identify malware, previous work has focused on app executable and permission analysis. In this paper, we introduce FairPlay, a novel system that discovers and leverages traces left behind by fraudsters, to detect both malware and apps subjected to search rank fraud. FairPlay correlates review activities and uniquely combines detected review relations with linguistic and behavioral signals gleaned from Google Play app data (87 K apps, 2.9 M reviews, and 2.4M reviewers, collected over half a year), in order to identify suspicious apps. FairPlay achieves over 95 percent accuracy in classifying gold standard datasets of malware, fraudulent and legitimate apps. We show that 75 percent of the identified malware apps engage in search rank fraud. FairPlay discovers hundreds of fraudulent apps that currently evade Google Bouncer's detection technology. FairPlay also helped the discovery of more than 1,000 reviews, reported for 193 apps that reveal a new type of “coercive” review campaign: users are harassed into writing positive reviews, and install and review other apps.

10Probase+: Inferring Missing Links in Conceptual Taxonomies
Much work has focused on automatically constructing conceptual taxonomies or semantic networks from large text corpora. In this paper, we use a state-of-the-art data-driven conceptual taxonomy, Probase, to show that missing links in taxonomies are the chief problem that hinders their adoption by many real life applications, for the missing links break the inferencing that the conceptual taxonomy claims to support. To solve this problem, we devise a collaborative filtering framework to infer missing links in taxonomies derived from text corpora. We implement our method mainly on Probase, creating a denser taxonomy containing 5.1 million (about 30 percent) more isA relationships, with an accuracy of above 90 percent. We conduct comprehensive experiments to demonstrate the quality of the revised conceptual taxonomies.

11User Vitality Ranking and Prediction in Social Networking Services: A Dynamic Network Perspective
Social networking services have been prevalent at many online communities such as Twitter.com and Weibo.com, where millions of users keep interacting with each other every day. One interesting and important problem in the social networking services is to rank users based on their vitality in a timely fashion. An accurate ranking list of user vitality could benefit many parties in social network services such as the ads providers and site operators. Although it is very promising to obtain a vitality-based ranking list of users, there are many technical challenges due to the large scale and dynamics of social networking data. In this paper, we propose a unique perspective to achieve this goal, which is quantifying user vitality by analyzing the dynamic interactions among users on social networks. Examples of social network include but are not limited to social networks in microblog sites and academical collaboration networks. Intuitively, if a user has many interactions with his friends within a time period and most of his friends do not have many interactions with their friends simultaneously, it is very likely that this user has high vitality. Based on this idea, we develop quantitative measurements for user vitality and propose our first algorithm for ranking users based vitality. Also, we further consider the mutual influence between users while computing the vitality measurements and propose the second ranking algorithm, which computes user vitality in an iterative way. Other than user vitality ranking, we also introduce a vitality prediction problem, which is also of great importance for many applications in social networking services. Along this line, we develop a customized prediction model to solve the vitality prediction problem. To evaluate the performance of our algorithms, we collect two dynamic social network data sets. The experimental results with both data sets clearly demonstrate the advantage of our ranking and prediction methods.

12RaPare: A Generic Strategy for Cold-Start Rating Prediction Problem
In recent years, recommender system is one of indispensable components in many e-commerce websites. One of the major challenges that largely remains open is the cold-start problem, which can be viewed as a barrier that keeps the cold-start users/items away from the existing ones. In this paper, we aim to break through this barrier for cold-start users/items by the assistance of existing ones. In particular, inspired by the classic Elo Rating System, which has been widely adopted in chess tournaments, we propose a novel rating comparison strategy (RAPARE) to learn the latent profiles of cold-start users/items. The centerpiece of our RAPARE is to provide a fine-grained calibration on the latent profiles of cold-start users/items by exploring the differences between cold-start and existing users/items. As a generic strategy, our proposed strategy can be instantiated into existing methods in recommender systems. To reveal the capability of RAPARE strategy, we instantiate our strategy on two prevalent methods in recommender systems, i.e., the matrix factorization based and neighborhood based collaborative filtering. Experimental evaluations on five real data sets validate the superiority of our approach over the existing methods in cold-start scenario.

13Multi-Behavioral Sequential Prediction with Recurrent Log-Bilinear Mode
With the rapid growth of Internet applications, sequential prediction in collaborative filtering has become an emerging and crucial task. Given the behavioral history of a specific user, predicting his or her next choice plays a key role in improving various online services. Meanwhile, there are more and more scenarios with multiple types of behaviors, while existing works mainly study sequences with a single type of behavior. As a widely used approach, Markov chain based models are based on a strong independence assumption. As two classical neural network methods for modeling sequences, recurrent neural networks cannot well model short-term contexts, and the log-bilinear model is not suitable for long-term contexts. In this paper, we propose a Recurrent Log-BiLinear (RLBL) model. It can model multiple types of behaviors in historical sequences with behavior-specific transition matrices. RLBL applies a recurrent structure for modeling long-term contexts. It models several items in each hidden layer and employs position-specific transition matrices for modeling short-term contexts. Moreover, considering continuous time difference in behavioral history is a key factor for dynamic prediction, we further extend RLBL and replace position-specific transition matrices with time-specific transition matrices, and accordingly propose a Time-Aware Recurrent Log-BiLinear (TA-RLBL) model. Experimental results show that the proposed RLBL model and TA-RLBL model yield significant improvements over the competitive compared methods on three datasets, i.e., Movielens-1M dataset, Global Terrorism Database and Tmall dataset with different numbers of behavior types.

14Modeling the Evolution of Users’ Preferences and Social Links in Social Networking Services
Sociologists have long converged that the evolution of a Social Networking Service(SNS) is driven by the interplay between users' preferences (reflected in user-item interaction behavior) and the social network structure (reflected in user-user interaction behavior). Nevertheless, traditional approaches either modeled these two kinds of behaviors in isolation or relied on a static assumption of a SNS. Thus, it is still unclear how do the roles of the dynamic social network structure and users' historical preferences affect the evolution of SNSs. Furthermore, can transforming the underlying social theories in the platform evolution modeling process benefit both behavior prediction tasks? In this paper, we incorporate the underlying social theories to explain and model the evolution of users' two kinds of behaviors in SNSs. Specifically, we present two kinds of representations for users' behaviors: a direct (latent) representation that presumes users' behaviors are represented directly (latently) by their historical behaviors. Under each representation, we associate each user's two kinds of behaviors with two vectors at each time. Then, for each representation, we propose the corresponding learning model to fuse the interplay between users' two kinds of behaviors. Finally, extensive experimental results demonstrate the effectiveness of our proposed models for both user preference prediction and social link suggestion.

15A Unified Framework for Metric Transfer Learning
Transfer learning has been proven to be effective for the problems where training data from a source domain and test data from a target domain are drawn from different distributions. To reduce the distribution divergence between the source domain and the target domain, many previous studies have been focused on designing and optimizing objective functions with the Euclidean distance to measure dissimilarity between instances. However, in some real-world applications, the Euclidean distance may be inappropriate to capture the intrinsic similarity or dissimilarity between instances. To deal with this issue, in this paper, we propose a metric transfer learning framework (MTLF) to encode metric learning in transfer learning. In MTLF, instance weights are learned and exploited to bridge the distributions of different domains, while Mahalanobis distance is learned simultaneously to maximize the intra-class distances and minimize the inter-class distances for the target domain. Unlike previous work where instance weights and Mahalanobis distance are trained in a pipelined framework that potentially leads to error propagation across different components, MTLF attempts to learn instance weights and a Mahalanobis distance in a parallel framework to make knowledge transfer across domains more effective. Furthermore, we develop general solutions to both classification and regression problems on top of MTLF, respectively. We conduct extensive experiments on several real-world datasets on object recognition, handwriting recognition, and WiFi location to verify the effectiveness of MTLF compared with a number of state-of-the-art methods.

16Efficient Top-k Dominating Computation on Massive Data
In many applications, top-k dominating query is an important operation to return k tuples with the highest domination scores in a potentially huge data space. It is analyzed that the existing algorithms have their performance problems when performed on massive data. This paper proposes a novel table-scan-based TDTS algorithm to efficiently compute top-k dominating results. TDTS first presorts the table for early termination. The early termination checking is proposed in this paper, along with the theoretical analysis of scan depth. The pruning operation for tuples is devised in this paper. The theoretical pruning effect shows that the number of tuples maintained in TDTS can be reduced substantially. The extensive experimental results, conducted on synthetic and real-life data sets, show that TDTS outperforms the existing algorithms significantly.

17Keyword Search over Distributed Graphs with Compressed Signature
Graph keyword search has drawn many research interests, since graph models can generally represent both structured and unstructured databases and keyword searches can extract valuable information for users without the knowledge of the underlying schema and query language. In practice, data graphs can be extremely large, e.g., a Web-scale graph containing billions of vertices. The state-of-the-art approaches employ centralized algorithms to process graph keyword searches, and thus they are infeasible for such large graphs, due to the limited computational power and storage space of a centralized server. To address this problem, we investigate keyword search for Web-scale graphs deployed in a distributed environment. We first give a naive search algorithm to answer the query efficiently. However, the naive search algorithm uses a flooding search strategy that incurs large time and network overhead. To remedy this shortcoming, we then propose a signature-based search algorithm. Specifically, we design a vertex signature that encodes the shortest-path distance from a vertex to any given keyword in the graph. As a result, we can find query answers by exploring fewer paths, so that the time and communication costs are low. Moreover, we reorganize the graph data in the cluster after its initial random partitioning so that the signature-based techniques are more effective. Finally, our experimental results demonstrate the feasibility of our proposed approach in performing keyword searches over Web-scale graph data.

18Flow Classification for Software-Defined Data Centers Using Stream Mining
A schedule-based system is a system that operates on or contains within a schedule of events and breaks at particular time intervals. Entities within the system show presence or absence in these events by entering or exiting the locations of the events. Given radio frequency identification (RFID) data from a schedule-based system, what can we learn about the system (the events and entities) through data mining? Which data mining methods can be applied so that one can obtain rich actionable insights regarding the system and the domain? The research goal of this paper is to answer these posed research questions, through the development of a framework that systematically produces actionable insights for a given schedule-based system. We show that through integrating appropriate data mining methodologies as a unified framework, one can obtain many insights from even a very simple RFID dataset, which contains only very few fields. The developed framework is general, and is applicable to any schedule-based system, as long as it operates under certain basic assumptions. The types of insights are also general, and are formulated in this paper in the most abstract way. The applicability of the developed framework is illustrated through a case study, where real world data from a schedule-based system is analyzed using the introduced framework. Insights obtained include the profiling of entities and events, the interactions between entity and events, and the relations between events.

19A Framework for Mining RFID Data from Schedule-Based Systems
Traffic management is known to be important to effectively utilize the high bandwidth provided by datacenters. Recent works have focused on identifying elephant flows and rerouting them to improve network utilization. These approaches however require either a significant monitoring overhead or hardware/end-host modifications. In this paper, we propose FlowSeer, a fast, low-overhead elephant flow detection and scheduling system using data stream mining. Our key idea is that the features from flows’ first few packets allow us to train the streaming classification models that can accurately and quickly predict the rate and duration of any initiated flow. With these predicted information, FlowSeer can adapt routing polices of elephant flows to their demands and dynamic network conditions. Another nice property of FlowSeer is its capability of enabling the controller and switches to perform cooperative prediction. Most of decisions can be made by switches locally, thereby reducing both detection latency and signaling overhead. FlowSeer requires less than 100 flow table entries at each switch to enable cooperative prediction, and hence can be implemented on off-the-shelf switches. The evaluation via both experiments in realistic virtual networks and trace-driven simulations shows that FlowSeer improves the throughput by multiple times over Hedera, which pulls flow statistics, and performs comparably to Mahout, which needs end-host modification.

20Optimizing Parallel Clustering Throughput in Shared Memory
This article studies the optimization of parallel clustering throughput in the context of variant-based parallelism, which exploits commonalities and reuse among variant computations for multithreading scalability. This direction is motivated by challenging scientific applications where scientists have to execute multiple runs of clustering algorithms with different parameters to determine which ones best explain phenomena observed in empirical data. To make this process more efficient, we propose a novel set of optimizations to maximize the throughput of Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a frequently used algorithm for scientific data mining in astronomy, geoscience, and many other fields. Our approach executes multiple algorithm variants in parallel, computes clusters concurrently, and leverages heuristics to maximize the reuse of results from completed variants. As scientific datasets continue to grow, maximizing clustering throughput with our techniques may accelerate the search and identification of natural phenomena of interest with computational support, i.e., Computer-Aided Discovery. We present evaluations on a whole spectrum data sets, such as geoscience data on space weather phenomena, astronomical data from the Sloan Digital Sky Survey on intermediate-redshift galaxies, as well as synthetic datasets to characterize performance properties. Selected results show a 1115% performance improvement due to indexing tailored for variant-based clustering, and a 2209% performance improvement when applying all of our proposed optimizations.

21Efficient kNN Classification with Different Numbers of Nearest Neighbours
k nearest neighbor (kNN) method is a popular classification method in data mining and statistics because of its simple implementation and significant classification performance. However, it is impractical for traditional kNN methods to assign a fixed k value (even though set by experts) to all test samples. Previous solutions assign different k values to different test samples by the cross validation method but are usually time-consuming. This paper proposes a kTree method to learn different optimal k values for different test/new samples, by involving a training stage in the kNN classification. Specifically, in the training stage, kTree method first learns optimal k values for all training samples by a new sparse reconstruction model, and then constructs a decision tree (namely, kTree) using training samples and the learned optimal k values. In the test stage, the kTree fast outputs the optimal k value for each test sample, and then, the kNN classification can be conducted using the learned optimal k value and all training samples. As a result, the proposed kTree method has a similar running cost but higher classification accuracy, compared with traditional kNN methods, which assign a fixed k value to all test samples. Moreover, the proposed kTree method needs less running cost but achieves similar classification accuracy, compared with the newly kNN methods, which assign different k values to different test samples. This paper further proposes an improvement version of kTree method (namely, k*Tree method) to speed its test stage by extra storing the information of the training samples in the leaf nodes of kTree, such as the training samples located in the leaf nodes, their kNNs, and the nearest neighbor of these kNNs. We call the resulting decision tree as k*Tree, which enables to conduct kNN classification using a subset of the training samples in the leaf nodes rather than all training samples used in the newly kNN methods. This actually reduces running cost of test stage.

22Design and Implementation of an RFID-Based Customer Shopping Behavior Mining System
Shopping behavior data is of great importance in understanding the effectiveness of marketing and merchandising campaigns. Online clothing stores are capable of capturing customer shopping behavior by analyzing the click streams and customer shopping carts. Retailers with physical clothing stores, however, still lack effective methods to comprehensively identify shopping behaviors. In this paper, we show that backscatter signals of passive RFID tags can be exploited to detect and record how customers browse stores, which garments they pay attention to, and which garments they usually pair up. The intuition is that the phase readings of tags attached to items will demonstrate distinct yet stable patterns in a time-series when customers look at, pick out, or turn over desired items. We design ShopMiner, a framework that harnesses these unique spatial-temporal correlations of time-series phase readings to detect comprehensive shopping behaviors. We have implemented a prototype of ShopMiner with a COTS RFID reader and four antennas, and tested its effectiveness in two typical indoor environments. Empirical studies from two-week shopping-like data show that ShopMiner is able to identify customer shopping behaviors with high accuracy and low overhead.

23Efficient Keyword-aware Representative Travel Route Recommendation
With the popularity of social media (e.g., Facebook and Flicker), users can easily share their check-in records and photos during their trips. In view of the huge number of user historical mobility records in social media, we aim to discover travel experiences to facilitate trip planning. When planning a trip, users always have specific preferences regarding their trips. Instead of restricting users to limited query options such as locations, activities or time periods, we consider arbitrary text descriptions as keywords about personalized requirements. Moreover, a diverse and representative set of recommended travel routes is needed. Prior works have elaborated on mining and ranking existing routes from check-in data. To meet the need for automatic trip organization, we claim that more features of Places of Interest (POIs) should be extracted. Therefore, in this paper, we propose an efficient Keyword-aware Representative Travel Route framework that uses knowledge extraction from users’ historical mobility records and social interactions. Explicitly, we have designed a keyword extraction module to classify the POI-related tags, for effective matching with query keywords. We have further designed a route reconstruction algorithm to construct route candidates that fulfill the requirements. To provide befitting query results, we explore Representative Skyline concepts, that is, the Skyline routes which best describe the trade-offs among different POI features. To evaluate the effectiveness and efficiency of the proposed algorithms, we have conducted extensive experiments on real location-based social network datasets, and the experiment results show that our methods do indeed demonstrate good performance compared to state-of-the-art works.

24Durable and Energy Efficient In-Memory Frequent Pattern Mining
It is a significant problem to efficiently identify the frequently-occurring patterns in a given dataset, so as to unveil the trends hidden behind the dataset. This work is motivated by the serious demands of a high-performance inmemory frequent-patternmining strategy, with joint optimization over the mining performance and system durability. While the widely-used frequent-pattern tree (FP-tree) serves as an efficient approach for frequent-pattern mining, its construction procedure often makes it unfriendly for nonvolatile memories (NVMs). In particular, the incremental construction of FP-tree could generate many unnecessary writes to the NVM and greatly degrade the energy efficiency, because NVM writes typically take more time and energy than reads. To overcome the drawbacks of FP-tree on NVMs, this paper proposes evergreen FP-tree (EvFP-tree), which includes a lazy counter and a minimum-bit-altered (MBA) encoding scheme to make FP-tree friendly for NVMs. The basic idea of the lazy counter is to greatly eliminate the redundant writes generated in FP-tree construction. On the other hand, the MBA encoding scheme is to complement existing wear-leveling techniques to evenly write each memory cell to extend the NVM lifetime. As verified by experiments, EvFP-tree greatly enhances the mining performance and system lifetime by 40.28% and 87.20% on average, respectively. And EvFP-tree reduces the energy consumption by 50.30% on average.

25Uncertain Data Clustering in Distributed Peer-to-Peer Networks
Uncertain data clustering has been recognized as an essential task in the research of data mining. Many centralized clustering algorithms are extended by defining new distance or similarity measurements to tackle this issue. With the fast development of network applications, these centralized methods show their limitations in conducting data clustering in a large dynamic distributed peer-to-peer network due to the privacy and security concerns or the technical constraints brought by distributive environments. In this paper, we propose a novel distributed uncertain data clustering algorithm, in which the centralized global clustering solution is approximated by performing distributed clustering. To shorten the execution time, the reduction technique is then applied to transform the proposed method into its deterministic form by replacing each uncertain data object with its expected centroid. Finally, the attribute-weight-entropy regularization technique enhances the proposed distributed clustering method to achieve better results in data clustering and extract the essential features for cluster identification. The experiments on both synthetic and real-world data have shown the efficiency and superiority of the presented algorithm.

26Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to support a variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involved to outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health records in electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released or shared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encrypt data before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analytic operators like keyword-based top-k document retrieval obsolete. In this paper, we investigate the multi-keyword top-k search problem for big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, for the privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, which makes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchanged under stronger privacy. For improving the query efficiency, we propose a group multi-keyword top-k search scheme based on the idea of partition, where a group of tree-based indexes are constructed for all documents. Finally, we combine these methods together into an efficient and secure approach to address our proposed top-k similarity search. Extensive experimental results on real-life data sets demonstrate that our proposed approach can significantly improve the capability of defending the privacy breaches, the scalability and the time efficiency of query processing over the state-of-the-art methods.

27Anomaly Detection for Road Traffic: A Visual Analytics Framework
The analysis of large amounts of multidimensional road traffic data for anomaly detection is a complex task. Visual analytics can bridge the gap between computational and human approaches to detecting anomalous behavior in road traffic, making the data analysis process more transparent. In this paper, we present a visual analytics framework that provides support for: 1) the exploration of multidimensional road traffic data; 2) the analysis of normal behavioral models built from data; 3) the detection of anomalous events; and 4) the explanation of anomalous events. We illustrate the use of this framework with examples from a large database of real road traffic data collected from several areas in Europe. Finally, we report on feedback provided by expert analysts from Volvo Group Trucks Technology, regarding its design and usability.

28Durable and Energy Efficient In-Memory Frequent Pattern Mining
It is a significant problem to efficiently identify the frequently-occurring patterns in a given dataset, so as to unveil the trends hidden behind the dataset. This work is motivated by the serious demands of a high-performance inmemory frequent-patternmining strategy, with joint optimization over the mining performance and system durability. While the widely-used frequent-pattern tree (FP-tree) serves as an efficient approach for frequent-pattern mining, its construction procedure often makes it unfriendly for nonvolatile memories (NVMs). In particular, the incremental construction of FP-tree could generate many unnecessary writes to the NVM and greatly degrade the energy efficiency, because NVM writes typically take more time and energy than reads. To overcome the drawbacks of FP-tree on NVMs, this paper proposes evergreen FP-tree (EvFP-tree), which includes a lazy counter and a minimum-bit-altered (MBA) encoding scheme to make FP-tree friendly for NVMs. The basic idea of the lazy counter is to greatly eliminate the redundant writes generated in FP-tree construction. On the other hand, the MBA encoding scheme is to complement existing wear-leveling techniques to evenly write each memory cell to extend the NVM lifetime. As verified by experiments, EvFP-tree greatly enhances the mining performance and system lifetime by 40.28% and 87.20% on average, respectively. And EvFP-tree reduces the energy consumption by 50.30% on average.

29Uncertain Data Clustering in Distributed Peer-to-Peer Networks
Uncertain data clustering has been recognized as an essential task in the research of data mining. Many centralized clustering algorithms are extended by defining new distance or similarity measurements to tackle this issue. With the fast development of network applications, these centralized methods show their limitations in conducting data clustering in a large dynamic distributed peer-to-peer network due to the privacy and security concerns or the technical constraints brought by distributive environments. In this paper, we propose a novel distributed uncertain data clustering algorithm, in which the centralized global clustering solution is approximated by performing distributed clustering. To shorten the execution time, the reduction technique is then applied to transform the proposed method into its deterministic form by replacing each uncertain data object with its expected centroid. Finally, the attribute-weight-entropy regularization technique enhances the proposed distributed clustering method to achieve better results in data clustering and extract the essential features for cluster identification. The experiments on both synthetic and real-world data have shown the efficiency and superiority of the presented algorithm.

30A Technique for Efficient Query Estimation over Distributed Data Streams
Distributed data stream mining in a sliding window has emerged recently, due to its applications in many domains including large Telecoms and Internet Service Providers, financial tickers, ATM and credit card operations in banks and transactions in retail chains. Many of these large-scale applications prohibit monitoring data centrally at a single location due to their massive volume of the data; therefore, data acquisition, processing, and mining tasks are often distributed to a number of processing nodes, which monitor their local streams and exchange only the summary of data either periodically or on demand. While this offer many advantages, distributed stream applications possess significant challenges including problems related to an online analysis of the recent data, communication efficiency and various estimation of various complex queries. There are few existing techniques which solve problems related to distributed sliding window data stream; however, those techniques are focused on solving only simple problems and require high space, query, and communication cost, which can be a bottleneck for many of these large scale applications. In this paper, we propose an efficient query estimation technique by constructing a small sketch of the data stream. The constructed sketch uses a deterministic sliding window model and can estimate various complex queries, for both centralized and distributed applications; including point queries (i.e., range queries and heavy hitter queries), quantiles, inner product, and self-join size queries, with deterministic guarantees on the precision. The proposed approach improves upon recent existing work for these problems, in terms of the memory and query cost in a centralized setting and in terms of communication cost and merge complexity in a distributed setting. It requires O( 1 2 log (N)) memory (where 0 < < 1 is a user defined parameter).[/faq_item]
31Privacy-Preserving Multi-keyword Top-k Similarity Search Over Encrypted Data
Cloud computing provides individuals and enterprises massive computing power and scalable storage capacities to support a variety of big data applications in domains like health care and scientific research, therefore more and more data owners are involved to outsource their data on cloud servers for great convenience in data management and mining. However, data sets like health records in electronic documents usually contain sensitive information, which brings about privacy concerns if the documents are released or shared to partially untrusted third-parties in cloud. A practical and widely used technique for data privacy preservation is to encrypt data before outsourcing to the cloud servers, which however reduces the data utility and makes many traditional data analytic operators like keyword-based top-k document retrieval obsolete. In this paper, we investigate the multi-keyword top-k search problem for big data encryption against privacy breaches, and attempt to identify an efficient and secure solution to this problem. Specifically, for the privacy concern of query data, we construct a special tree-based index structure and design a random traversal algorithm, which makes even the same query to produce different visiting paths on the index, and can also maintain the accuracy of queries unchanged under stronger privacy. For improving the query efficiency, we propose a group multi-keyword top-k search scheme based on the idea of partition, where a group of tree-based indexes are constructed for all documents.

32A Technique for Efficient Query Estimation over Distributed Data Streams
Distributed data stream mining in a sliding window has emerged recently, due to its applications in many domains including large Telecoms and Internet Service Providers, financial tickers, ATM and credit card operations in banks and transactions in retail chains. Many of these large-scale applications prohibit monitoring data centrally at a single location due to their massive volume of the data; therefore, data acquisition, processing, and mining tasks are often distributed to a number of processing nodes, which monitor their local streams and exchange only the summary of data either periodically or on demand. While this offer many advantages, distributed stream applications possess significant challenges including problems related to an online analysis of the recent data, communication efficiency and various estimation of various complex queries. There are few existing techniques which solve problems related to distributed sliding window data stream; however, those techniques are focused on solving only simple problems and require high space, query, and communication cost, which can be a bottleneck for many of these large scale applications. In this paper, we propose an efficient query estimation technique by constructing a small sketch of the data stream. The constructed sketch uses a deterministic sliding window model and can estimate various complex queries, for both centralized and distributed applications; including point queries (i.e., range queries and heavy hitter queries), quantiles, inner product, and self-join size queries, with deterministic guarantees on the precision. The proposed approach improves upon recent existing work for these problems.

33Mining weighted-frequent-regular itemsets from transactional database
Frequent-regular itemsets mining has been explored and proposed to find interesting itemsets based on their own occurrence behavior. Traditionally, an itemset is identified as interesting, if it occurs frequently and regularly in a database. However, this task only considers items without defining difference or significance of each item which may affect the missing of important/interesting knowledge in real-world applications. To address this issue, we introduce an approach on mining weighted-frequent-regular itemsets, (also called mining WFRIs). To mine WFRIs, a tree-and-pattern growth based algorithm called WFRIM (Weighted-Frequent-Regular Itemsets Miner) is proposed. An FP-tree like structure named WFRI-tree is designed to efficiently maintain candidate itemsets during mining process. The concept of overestimated-weighted-frequency of items/itemsets under global/local maximum weight is also applied to early prune search space. Experimental results on synthetic and real datasets show efficiency of WFRIM in the terms of computational time, memory consumption and capability to find valuable itemsets.

34SeLINA: A Self-Learning Insightful Network Analyzer
Understanding the behavior of a network from a large scale traffic dataset is a challenging problem. Big data frameworks offer scalable algorithms to extract information from raw data, but often require a sophisticated fine-tuning and a detailed knowledge of machine learning algorithms. To streamline this process, we propose self-learning insightful network analyzer (SeLINA), a generic, self-tuning, simple tool to extract knowledge from network traffic measurements. SeLINA includes different data analytics techniques providing self-learning capabilities to state-of-the-art scalable approaches, jointly with parameter auto-selection to off-load the network expert from parameter tuning. We combine both unsupervised and supervised approaches to mine data with a scalable approach. SeLINA embeds mechanisms to check if the new data fits the model, to detect possible changes in the traffic, and to, possibly automatically, trigger model rebuilding. The result is a system that offers human-readable models of the data with minimal user intervention, supporting domain experts in extracting actionable knowledge and highlighting possibly meaningful interpretations. SeLINA's current implementation runs on Apache Spark. We tested it on large collections of real-world passive network measurements from a nationwide ISP, investigating YouTube, and P2P traffic. The experimental results confirmed the ability of SeLINA to provide insights and detect changes in the data that suggest further analyses.

35Anomaly Detection for Road Traffic: A Visual Analytics Framework
The analysis of large amounts of multidimensional road traffic data for anomaly detection is a complex task. Visual analytics can bridge the gap between computational and human approaches to detecting anomalous behavior in road traffic, making the data analysis process more transparent. In this paper, we present a visual analytics framework that provides support for: 1) the exploration of multidimensional road traffic data; 2) the analysis of normal behavioral models built from data; 3) the detection of anomalous events; and 4) the explanation of anomalous events. We illustrate the use of this framework with examples from a large database of real road traffic data collected from several areas in Europe. Finally, we report on feedback provided by expert analysts from Volvo Group Trucks Technology, regarding its design and usability.

36A Framework for Mining RFID Data from Schedule-Based Systems
A schedule-based system is a system that operates on or contains within a schedule of events and breaks at particular time intervals. Entities within the system show presence or absence in these events by entering or exiting the locations of the events. Given radio frequency identification (RFID) data from a schedule-based system, what can we learn about the system (the events and entities) through data mining? Which data mining methods can be applied so that one can obtain rich actionable insights regarding the system and the domain? The research goal of this paper is to answer these posed research questions, through the development of a framework that systematically produces actionable insights for a given schedule-based system. We show that through integrating appropriate data mining methodologies as a unified framework, one can obtain many insights from even a very simple RFID dataset, which contains only very few fields. The developed framework is general, and is applicable to any schedule-based system, as long as it operates under certain basic assumptions. The types of insights are also general, and are formulated in this paper in the most abstract way. The applicability of the developed framework is illustrated through a case study, where real world data from a schedule-based system is analyzed using the introduced framework. Insights obtained include the profiling of entities and events, the interactions between entity and events, and the relations between events.

37Efficient kNN Classification with Different Numbers of Nearest Neighbors
k nearest neighbor (kNN) method is a popular classification method in data mining and statistics because of its simple implementation and significant classification performance. However, it is impractical for traditional kNN methods to assign a fixed k value (even though set by experts) to all test samples. Previous solutions assign different k values to different test samples by the cross validation method but are usually time-consuming. This paper proposes a kTree method to learn different optimal k values for different test/new samples, by involving a training stage in the kNN classification. Specifically, in the training stage, kTree method first learns optimal k values for all training samples by a new sparse reconstruction model, and then constructs a decision tree (namely, kTree) using training samples and the learned optimal k values. In the test stage, the kTree fast outputs the optimal k value for each test sample, and then, the kNN classification can be conducted using the learned optimal k value and all training samples. As a result, the proposed kTree method has a similar running cost but higher classification accuracy, compared with traditional kNN methods, which assign a fixed k value to all test samples. Moreover, the proposed kTree method needs less running cost but achieves similar classification accuracy, compared with the newly kNN methods, which assign different k values to different test samples. This paper further proposes an improvement version of kTree method (namely, k*Tree method) to speed its test stage by extra storing the information of the training samples in the leaf nodes of kTree, such as the training samples located in the leaf nodes, their kNNs, and the nearest neighbor of these kNNs.

38A Proactive Workflow Model for Healthcare Operation and Management
Advances in real-time location systems have enabled us to collect massive amounts of fine-grained semantically rich location traces, which provide unparalleled opportunities for understanding human activities and generating useful knowledge. This, in turn, delivers intelligence for real-time decision making in various fields, such as workflow management. Indeed, it is a new paradigm to model workflows through knowledge discovery in location traces. To that end, in this paper, we provide a focused study of workflow modeling by integrated analysis of indoor location traces in the hospital environment. In particular, we develop a workflow modeling framework that automatically constructs the workflow states and estimates the parameters describing the workflow transition patterns. More specifically, we propose effective and efficient regularizations for modeling the indoor location traces as stochastic processes. First, to improve the interpretability of the workflow states, we use the geography relationship between the indoor rooms to define a prior of the workflow state distribution. This prior encourages each workflow state to be a contiguous region in the building. Second, to further improve the modeling performance, we show how to use the correlation between related types of medical devices to reinforce the parameter estimation for multiple workflow models. In comparison with our preliminary work [11], we not only develop an integrated workflow modeling framework applicable to general indoor environments, but also improve the modeling accuracy significantly.

39Probabilistic Models for Ad View ability Prediction on the Web
Online display advertising has becomes a billion-dollar industry, and it keeps growing. Advertisers attempt to send marketing messages to attract potential customers via graphic banner ads on publishers’ webpages. Advertisers are charged for each view of a page that delivers their display ads. However, recent studies have discovered that more than half of the ads are never shown on users' screens due to insufficient scrolling. Thus, advertisers waste a great amount of money on these ads that do not bring any return on investment. Given this situation, the Interactive Advertising Bureau calls for a shift toward charging by viewable impression, i.e., charge for ads that are viewed by users. With this new pricing model, it is helpful to predict the viewability of an ad. This paper proposes two probabilistic latent class models (PLC) that predict the viewability of any given scroll depth for a user-page pair. Using a real-life dataset from a large publisher, the experiments demonstrate that our models outperform comparison systems.

40Efficient Clue-based Route Search on Road Networks
With the advances in geo-positioning technologies and location-based services, it is nowadays quite common for road networks to have textual contents on the vertices. Previous work on identifying an optimal route that covers a sequence of query keywords has been studied in recent years. However, in many practical scenarios, an optimal route might not always be desirable. For example, a personalized route query is issued by providing some clues that describe the spatial context between PoIs along the route, where the result can be far from the optimal one. Therefore, in this paper, we investigate the problem of clue-based route search (CRS), which allows a user to provide clues on keywords and spatial relationships. First, we propose a greedy algorithm and a dynamic programming algorithm as baselines. To improve efficiency, we develop a branch-and-bound algorithm that prunes unnecessary vertices in query processing. In order to quickly locate candidate, we propose an AB-tree that stores both the distance and keyword information in tree structure. To further reduce the index size, we construct a PB-tree by utilizing the virtue of 2-hop label index to pinpoint the candidate. Extensive experiments are conducted and verify the superiority of our algorithms and index structures.

41Large-scale Location Prediction for Web Pages
Location information of Web pages plays an important role in location-sensitive tasks such as Web search ranking for location-sensitive queries. However, such information is usually ambiguous, incomplete or even missing, which raises the problem of location prediction for Web pages. Meanwhile, Web pages are massive and often noisy, which pose challenges to the majority of existing algorithms for location prediction. In this paper, we propose a novel and scalable location prediction framework for Web pages based on the query-URL click graph. In particular, we introduce a concept of term location vectors to capture location distributions for all terms and develop an automatic approach to learn the importance of each term location vector for location prediction. Empirical results on a large URL set demonstrate that the proposed framework significantly improves the location prediction accuracy comparing with various representative baselines. We further provide a principled way to incorporate the proposed framework into the search ranking task and experimental results on a commercial search engine show that the proposed method remarkably boosts the ranking performance for location-sensitive queries.

42Detecting Stress Based on Social Interactions in Social Networks
Psychological stress is threatening people’s health. It is non-trivial to detect stress timely for proactive care. With the popularity of social media, people are used to sharing their daily activities and interacting with friends on social media platforms, making it feasible to leverage online social network data for stress detection. In this paper, we find that users stress state is closely related to that of his/her friends in social media, and we employ a large-scale dataset from real-world social platforms to systematically study the correlation of users’ stress states and social interactions. We first define a set of stress-related textual, visual, and social attributes from various aspects, and then propose a novel hybrid model - a factor graph model combined with Convolutional Neural Network to leverage tweet content and social interaction information for stress detection. Experimental results show that the proposed model can improve the detection performance by 6-9% in F1-score. By further analyzing the social interaction data, we also discover several intriguing phenomena, i.e. the number of social structures of sparse connections (i.e. with no delta connections) of stressed users is around 14% higher than that of non-stressed users, indicating that the social structure of stressed users’ friends tend to be less connected and less complicated than that of non-stressed users.

43Facilitating Time Critical Information Seeking in Social Media
Social media plays a major role in helping people affected by natural calamities. These people use social media to request information and help in situations where time is a critical commodity. However, generic social media platforms like Twitter and Facebook are not conducive for obtaining answers in a timely manner. Algorithms to ensure prompt responders for questions in social media have to understand and model the factors affecting their response time. In this paper, we draw from sociological studies on information seeking and organizational behavior to identify users who can provide timely and relevant responses to questions posted in social media. We first draw from these theories to model the future availability and past response behavior of the candidate responders and integrate these criteria with user relevance. We propose a learning algorithm from these criteria to derive optimal rankings of responders for a given question. We present questions posted on Twitter as a form of information seeking activity in social media and use them to evaluate our framework. Our experiments demonstrate that the proposed framework is useful in identifying timely and relevant responders for questions in social media.

44Hybrid Classification System for Uncertain Data
In classification problem, several different classes may be partially overlapped in their borders. The objects in the border are usually quite difficult to classify. A hybrid classification system (HCS) is proposed to adaptively utilize the proper classification method for each object according to the K-nearest neighbors (K-NNs), which are found in the weighting vector space obtained by self-organizing map (SOM) in each class. If the K-close weighting vectors (nodes) are all from the same class, it indicates that this object can be correctly classified with high confidence, and the simple hard classification will be adopted to directly classify this object into the corresponding class. If the object likely lies in the border of classes, it implies that this object could be difficult to classify, and the credal classification working with belief functions is recommended. The credal classification allows the object to belong to both singleton classes and sets of classes (meta-class) with different masses of belief, and it is able to well capture the potential imprecision of classification thanks to the meta-class and also reduce the errors. Fuzzy classification is selected for the object close to the border and hard to clearly classify, and it associates the object with different classes by different membership (probability) values. HCS generally takes full advantage of the three classification ways and produces good performance. Moreover, it requires quite low computational burden compared with other K-NNs-based methods due to the use of SOM. The effectiveness of HCS is demonstrated by several experiments with synthetic and real datasets.

45Approaches to Cross-Domain Sentiment Analysis: A Systematic Literature Review
Sentiment analysis has received a lot of attention from researchers working in the fields of natural language processing and text mining. However, there is a lack of annotated datasets that can be used to train a model for all domains, which is hampering the accuracy of sentiment analysis. Many research studies have attempted to tackle this issue and to improve cross-domain sentiment classification. In this paper, we present the results of a comprehensive systematic literature review of the methods and techniques employed in cross-domain sentiment analysis. We focus on studies published during the period 2010–2016. From our analysis of those works it is clear that there is no perfect solution. Hence one of the aims of this review is to create a resource in the form of an overview of the techniques, methods, and approaches that have been used to attempt to solve the problem of cross-domain sentiment analysis in order to assist researchers in developing new and more accurate techniques in the future.

46Analysis of users' behaviour in structured e-commerce websites
Online shopping is becoming more and more common in our daily lives. Understanding users’ interests and behaviour is essential in order to adapt e-commerce websites to customers’ requirements. The information about users’ behaviour is stored in the web server logs. The analysis of such information has focused on applying data mining techniques where a rather static characterization is used to model users’ behaviour and the sequence of the actions performed by them is not usually considered. Therefore, incorporating a view of the process followed by users during a session can be of great interest to identify more complex behavioural patterns. To address this issue, this paper proposes a linear-temporal logic model checking approach for the analysis of structured e-commerce web logs. By defining a common way of mapping log records according to the e-commerce structure, web logs can be easily converted into event logs where the behaviour of users is captured. Then, different predefined queries can be performed to identify different behavioural patterns that consider the different actions performed by a user during a session. Finally, the usefulness of the proposed approach has been studied by applying it to a real case study of a Spanish e-commerce website. The results have identified interesting findings that have made possible to propose some improvements in the website design with the aim of increasing its efficiency.

47Collaboratively Training Sentiment Classifiers for Multiple Domains
We propose a collaborative multi-domain sentiment classification approach to train sentiment classifiers for multiple domains simultaneously. In our approach, the sentiment information in different domains is shared to train more accurate and robust sentiment classifiers for each domain when labeled data is scarce. Specifically, we decompose the sentiment classifier of each domain into two components, a global one and a domain-specific one. The global model can capture the general sentiment knowledge and is shared by various domains. The domain-specific model can capture the specific sentiment expressions in each domain. In addition, we extract domain-specific sentiment knowledge from both labeled and unlabeled samples in each domain and use it to enhance the learning of domain-specific sentiment classifiers. Besides, we incorporate the similarities between domains into our approach as regularization over the domain-specific sentiment classifiers to encourage the sharing of sentiment information between similar domains. Two kinds of domain similarity measures are explored, one based on textual content and the other one based on sentiment expressions. Moreover, we introduce two efficient algorithms to solve the model of our approach. Experimental results on benchmark datasets show that our approach can effectively improve the performance of multi-domain sentiment classification and significantly outperform baseline methods.

48Sentiment Computing for the News Event Based on the Social Media Big Data
The explosive increasing of the social media data on the Web has created and promoted the development of the social media big data mining area welcomed by researchers from both academia and industry. The sentiment computing of news event is a significant component of the social media big data. It has also attracted a lot of researches, which could support many real-world applications, such as public opinion monitoring for governments and news recommendation for Websites. However, existing sentiment computing methods are mainly based on the standard emotion thesaurus or supervised methods, which are not scalable to the social media big data. Therefore, we propose an innovative method to do the sentiment computing for news events. More specially, based on the social media data (i.e., words and emoticons) of a news event, a word emotion association network (WEAN) is built to jointly express its semantic and emotion, which lays the foundation for the news event sentiment computation. Based on WEAN, a word emotion computation algorithm is proposed to obtain the initial words emotion, which are further refined through the standard emotion thesaurus. With the words emotion in hand, we can compute every sentence's sentiment. Experimental results on real-world data sets demonstrate the excellent performance of the proposed method on the emotion computing for news events.

49SVM-based Web Content Mining with Leaf Classification Unit from DOM-tree
In order to analyze a news article dataset, we first extract important information such as title, date, and paragraph of the body. At the same time, we remove unnecessary information such as image, caption, footer, advertisement, navigation and recommended-news. The problem is that the formats of news articles are changing according to time and also they vary according to news source and even section of it. So, it is important for a model to generalize when predicting unseen formats of news articles. We confirmed that a machine learning based model is better to predict new data than a rule-based model by some experiments. Also, we suggest that noise information in the body possibly can be removed because we define a classification unit as a leaf node itself. On the other hand, general machine learning based models cannot remove noise information. Since they consider the classification unit as an intermediate node which consists of the set of leaf nodes, they cannot classify a leaf node itself.

50Adaptive ensembling of semi-supervised clustering solutions
Conventional semi-supervised clustering approaches have several shortcomings, such as (1) not fully utilizing all useful must-link and cannot-link constraints, (2) not considering how to deal with high dimensional data with noise, and (3) not fully addressing the need to use an adaptive process to further improve the performance of the algorithm. In this paper, we first propose the transitive closure based constraint propagation approach, which makes use of the transitive closure operator and the affinity propagation to address the first limitation. Then, the random subspace based semi-supervised clustering ensemble framework with a set of proposed confidence factors is designed to address the second limitation and provide more stable, robust and accurate results. Next, the adaptive semi-supervised clustering ensemble framework is proposed to address the third limitation, which adopts a newly designed adaptive process to search for the optimal subspace set. Finally, we adopt a set of nonparametric tests to compare different semi-supervised clustering ensemble approaches over multiple datasets. The experimental results on 20 real high dimensional cancer datasets with noisy genes and 10 datasets from UCI datasets and KEEL datasets show that (1) The proposed approaches work well on most of the real-world datasets. (2) It outperforms other state-of-the-art approaches on 12 out of 20 cancer datasets, and 8 out of 10 UCI machine learning datasets.

LiveZilla Live Chat Software

Hi there! Click one of our representatives below and we will get back to you as soon as possible.

Chat with us on WhatsApp
Project Title & Abstract