1Big Data Analytics in Intelligent Transportation Systems: A Survey
Big data is becoming a research focus in intelligent transportation systems (ITS), which can be seen in many projects around the world. Intelligent transportation systems will produce a large amount of data. The produced big data will have profound impacts on the design and application of intelligent transportation systems, which makes ITS safer, more efficient, and profitable. Studying big data analytics in ITS is a flourishing field. This paper first reviews the history and characteristics of big data and intelligent transportation systems. The framework of conducting big data analytics in ITS is discussed next, where the data source and collection methods, data analytics methods and platforms, and big data analytics application categories are summarized. Several case studies of big data analytics applications in intelligent transportation systems, including road traffic accidents analysis, road traffic flow prediction, public transportation service plan, personal travel route plan, and rail transportation management and control, and assets maintenance are introduced. Finally, this paper discusses some open challenges of using big data analytics in ITS.
2Big Data Based Security Analytics for Protecting Virtualized Infrastructures in Cloud Computing
Virtualized infrastructure in cloud computing has become an attractive target for cyber attackers to launch advanced attacks. This paper proposes a novel big data based security analytics approach to detecting advanced attacks in virtualized infrastructures. Network logs as well as user application logs collected periodically from the guest virtual machines (VMs) are stored in the Hadoop Distributed File System (HDFS). Then, extraction of attack features is performed through graph-based event correlation and Map Reduce parser based identification of potential attack paths. Next, determination of attack presence is performed through two-step machine learning, namely logistic regression is applied to calculate attack's conditional probabilities with respect to the attributes, and belief propagation is applied to calculate the belief in existence of an attack based on them. Experiments are conducted to evaluate the proposed approach using well-known malware as well as in comparison with existing security techniques for virtualized infrastructure. The results show that our proposed approach is effective in detecting attacks with minimal performance overhead.
3Event detection and identification of influential spreaders in social media data streams
Microblogging, a popular social media service platform, has become a new information channel for users to receive and exchange the most up-to-date information on current events. Consequently, it is a crucial platform for detecting newly emerging events and for identifying influential spreaders who have the potential to actively disseminate knowledge about events through microblogs. However, traditional event detection models require human intervention to detect the number of topics to be explored, which significantly reduces the efficiency and accuracy of event detection. In addition, most existing methods focus only on event detection and are unable to identify either influential spreaders or key event-related posts, thus making it challenging to track momentous events in a timely manner. To address these problems, we propose a Hypertext-Induced Topic Search (HITS) based Topic-Decision method (TD-HITS), and a Latent Dirichlet Allocation (LDA) based Three-Step model (TS-LDA). TDHITS can automatically detect the number of topics as well as identify associated key posts in a large number of posts. TS-LDA can identify influential spreaders of hot event topics based on both post and user information. The experimental results, using a Twitter dataset, demonstrate the effectiveness of our proposed methods for both detecting events and identifying influential spreaders.
4DP-MCDBSCAN: Differential Privacy Preserving Multi-Core DBSCAN Clustering for Network User Data
The proliferation of ubiquitous Internet and mobile devices has brought about the exponential growth of individual data in big data era. The network user data has been confronted with serious privacy concerns for extracting valuable information during the process of data mining. Differential privacy preservation is a new paradigm independent of the adversaries' prior knowledge, which protects sensitive data while maintaining certain statistical properties by adding random noise. In this paper, we put forward a differential privacy preservation multiple cores DBSCAN clustering schema based on the powerful differential privacy and DBSCAN algorithm for network user data to effectively leverage the privacy leakage issue in the process of data mining, enhancing data clustering efficaciously by adding Laplace noise. We perform extensive theoretical analysis and simulations to evaluate our schema and the results show better efficiency, accuracy, and privacy preservation effect than previous schemas.
5SecSVA: Secure Storage, Verification, and Auditing of Big Data in the Cloud Environment
With the widespread popularity of Internet-enabled devices, there is an exponential increase in the information sharing among different geographically located smart devices. These smart devices may be heterogeneous in nature and may use different communication protocols for information sharing among themselves. Moreover, the data shared may also change with respect to various Vs (volume, velocity, variety, and value) to categorize it as big data. However, as these devices communicate with each other using an open channel, the Internet, there is a higher chance of information leakage during communication. Most of the existing solutions reported in the literature ignore these facts. Keeping focus on these points, in this article, we propose secure storage, verification, and auditing (SecSVA) of big data in cloud environment. SecSVA includes the following modules: an attribute-based secure data deduplication framework for data storage on the cloud, Kerberos-based identity verification and authentication, and Merkle hash-tree-based trusted third-party auditing on cloud. From the analysis, it is clear that SecSVA can provide secure third party auditing with integrity preservation across multiple domains in the cloud environment.
6Optimized Big Data Management across Multi-Cloud Data Centers: Software-Defined-Network-Based Analysis
With an exponential increase in smart device users, there is an increase in the bulk amount of data generation from various smart devices, which varies with respect to all the essential V's used to categorize it as big data. Generally, most service providers, including Google, Amazon, Microsoft and so on, have deployed a large number of geographically distributed data centers to process this huge amount of data generated from various smart devices so that users can get quick response time. For this purpose, Hadoop, and SPARK are widely used by these service providers for processing large datasets. However, less emphasis has been given on the underlying infrastructure (the network through which data flows), which is one of the most important components for successful implementation of any designed solution in this environment. In the worst case, due to heavy network traffic with respect to data migrations across different data centers, the underlying network infrastructure may not be able to transfer data packets from source to destination, resulting in performance degradation. Focusing on all these issues, in this article, we propose a novel SDN-based big data management approach with respect to the optimized network resource consumption such as network bandwidth and data storage units. We analyze various components at both the data and control planes that can enhance the optimized big data analytics across multiple cloud data centers. For example, we analyze the performance of the proposed solution using Bloom-filter-based insertion and deletion of an element in the flow table maintained at the OpenFlow controller.
7CityLines: Designing Hybrid Hub-and-Spoke Transit System with Urban Big Data
Rapid urbanization has posed significant burden on urban transportation infrastructures. In today's cities, both private and public transits have clear limitations to fulfill passengers' needs for quality of experience (QoE): Public transits operate along fixed routes with long wait time and total transit time; Private transits, such as taxis, private shuttles and ride-hailing services, provide point-to-point transits with high trip fare. In this paper, we propose CityLines, a transformative urban transit system, employing hybrid hub-and-spoke transit model with shared shuttles. Analogous to Airlines services, the proposed CityLines system routes urban trips among spokes through a few hubs or direct paths, with travel time as short as private transits and fare as low as public transits. CityLines allows both point-to-point connection to improve the passenger QoE, and hub-and-spoke connection to reduce the system operation cost. To evaluate the performance of CityLines, we conduct extensive data-driven experiments using one-month real-world trip demand data (from taxis, buses and subway trains) collected from Shenzhen, China. The results demonstrate that CityLines reduces 12.5%-44% average travel time, and aggregates 8.5%-32.6% more trips with ride-sharing over other implementation baselines.
8Privacy-Preserving Collaborative Model Learning: The Case of Word Vector Training
Nowadays machine learning is becoming a new paradigm for mining hidden knowledge in big data. The collection and manipulation of big data not only create considerable values, but also raise serious privacy concerns. To protect the huge amount of potentially sensitive data, a straightforward approach is to encrypt data with specialized cryptographic tools. However, it is challenging to utilize or operate on encrypted data, especially to perform machine learning algorithms. In this paper, we investigate the problem of training high quality word vectors over large-scale encrypted data (from distributed data owners) with the privacy-preserving collaborative neural network learning algorithms. We leverage and also design a suite of arithmetic primitives (e.g., multiplication, fixed-point representation and sigmoid function computation etc.) on encrypted data, served as components of our construction. We theoretically analyze the security and efficiency of our proposed construction, and conduct extensive experiments on representative real-world datasets to verify its practicality and effectiveness.
9Enabling Efficient User Revocation in Identity-based Cloud Storage Auditing for Shared Big Data
Cloud storage auditing schemes for shared data refer to checking the integrity of cloud data shared by a group of users. User revocation is commonly supported in such schemes, as users may be subject to group membership changes for various reasons. Previously, the computational overhead for user revocation in such schemes is linear with the total number of file blocks possessed by a revoked user. The overhead, however, may become a heavy burden because of the sheer amount of the shared cloud data. Thus, how to reduce the computational overhead caused by user revocations becomes a key research challenge for achieving practical cloud data auditing. In this paper, we propose a novel storage auditing scheme that achieves highly-efficient user revocation independent of the total number of file blocks possessed by the revoked user in the cloud. This is achieved by exploring a novel strategy for key generation and a new private key update technique. Using this strategy and the technique, we realize user revocation by just updating the nonrevoked group users' private keys rather than authenticators of the revoked user. The integrity auditing of the revoked user's data can still be correctly performed when the authenticators are not updated. Meanwhile, the proposed scheme is based on identity-base cryptography, which eliminates the complicated certificate management in traditional Public Key Infrastructure (PKI) systems. The security and efficiency of the proposed scheme are validated via both analysis and experimental results.
10Clustering model of cloud centers for big data processing
In order to solve the problems of efficient functioning of cloud centers with a high degree of virtualization, by increasing the life cycle of physical machines, in the conditions of group How of tasks or processing of flows of type Big Data, we propose a new algorithm for clustering and selection of the main cluster node using fuzzy logic and Voronoi diagrams. Criteria for the selection of the main node of the cluster are the centrality of the Voronoi diagrams, residual energy, and availability of free resources of physical machines.
11Location prediction on trajectory data: A review
Location prediction is the key technique in many location based services including route navigation, dining location recommendations, and traffic planning and control, to mention a few. This survey provides a comprehensive overview of location prediction, including basic definitions and concepts, algorithms, and applications. First, we introduce the types of trajectory data and related basic concepts. Then, we review existing location-prediction methods, ranging from temporal-pattern-based prediction to spatiotemporal-pattern-based prediction. We also discuss and analyze the advantages and disadvantages of these algorithms and briefly summarize current applications of location prediction in diverse fields. Finally, we identify the potential challenges and future research directions in location prediction.
12Apriori Versions Based on MapReduce for Mining Frequent Patterns on Big Data
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed. The proposed algorithms can be divided into three main groups. First, two algorithms [Apriori MapReduce (AprioriMR) and iterative AprioriMR] with no pruning strategy are proposed, which extract any existing item-set in data. Second, two algorithms (space pruning AprioriMR and top AprioriMR) that prune the search space by means of the well-known anti-monotone property are proposed. Finally, a last algorithm (maximal AprioriMR) is also proposed for mining condensed representations of frequent patterns. To test the performance of the proposed algorithms, a varied collection of big data datasets have been considered, comprising up to 3 · 10#x00B9;⁸ transactions and more than 5 million of distinct single-items. The experimental stage includes comparisons against highly efficient and well-known pattern mining algorithms. Results reveal the interest of applying MapReduce versions when complex problems are considered, and also the unsuitability of this paradigm when dealing with small data.
13Effective Features to Classify Big Data Using Social Internet of Things
Social Internet of Things (SIoT) supports many novel applications and networking services for the IoT in a more powerful and productive way. In this paper, we have introduced a hierarchical framework for feature extraction in SIoT big data using map-reduced framework along with a supervised classifier model. Moreover, a Gabor filter is used to reduce noise and unwanted data from the database, and Hadoop Map Reduce has been used for mapping and reducing big databases, to improve the efficiency of the proposed work. Furthermore, the feature selection has been performed on a filtered data set by using Elephant Herd Optimization. The proposed system architecture has been implemented using Linear Kernel Support Vector Machine-based classifier to classify the data and for predicting the efficiency of the proposed work. From the results, the maximum accuracy, specificity, and sensitivity of our work is 98.2%, 85.88%, and 80%, moreover analyzed time and memory, and these results have been compared with the existing literature.
14Data Lake Lambda Architecture for Smart Grids Big Data Analytics
The advances in smart grids are enabling huge amount of data to be aggregated and analyzed for various smart grid applications. However, the traditional smart grid data management systems cannot scale and provide sufficient storage and processing capabilities. To address these challenges, this paper presents a smart grid big data eco-system based on the state-of-the-art Lambda architecture that is capable of performing parallel batch and real-time operations on distributed data. Further, the presented eco-system utilizes a Hadoop Big Data Lake to store various types of smart grid data including smart meter, images and video data. An implementation of the smart grid big data eco-system on a cloud computing platform is presented. To test the capability of the presented eco-system, real-time visualization and data mining applications were performed on real smart grid data. The results of those applications on top of the eco-system suggest that it is capable of performing numerous smart grid big data analytics.
15Neurally-Guided Semantic Navigation in Knowledge Graph
In this big data era, knowledge becomes increasingly linked, along with the rapid growth in data volume. Connected knowledge is naturally represented and stored as knowledge graphs, which are of great importance for many frontier research areas. Effectively finding relations between entities in large knowledge graphs plays a key role in many knowledge graph applications, as the most valuable part of a knowledge graph is its rich connectedness, which captures rich information about the real-world objects. However, due to the intrinsic complexity of real-world knowledge, finding semantically close relations by navigation in a large knowledge graph is challenging. Canonical graph exploration methods inevitably result in combinatorial explosion especially when the paths connecting two entities are long: the search space is O(dl), where d is the average graph node degree and l is the path length. In this paper, we will systematically study the semantic navigation problem for large knowledge graphs. Inspired by AlphaGo, which was overwhelmingly successful in game Go, we designed an efficient semantic navigation method based on a well-tailored Monte Carlo Tree Search algorithm with the unique characteristics of knowledge graphs considered. Extensive experiments show that our method is not only effective but also very efficient.
16A Novel Hilbert Curve for Cache-locality Preserving Loops
Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data.
17A Dynamic and Failure-aware Task Scheduling Framework for Hadoop
Hadoop has become a popular framework for processing data-intensive applications in cloud environments. A core constituent of Hadoop is the scheduler, which is responsible for scheduling and monitoring the jobs and tasks, and rescheduling them in case of failures. Although fault-tolerance mechanisms have been proposed for Hadoop, the performance of Hadoop can be significantly impacted by unforeseen events in the cloud environment. In this paper, we introduce a dynamic and failure-aware framework that can be integrated within Hadoop scheduler and adjust the scheduling decisions based on collected information about the cloud environment. Our framework relies on predictions made by machine learning algorithms and scheduling policies generated by a Markovian Decision Process (MDP), to adjust its scheduling decisions on the fly. Instead of the fixed heartbeat-based failure detection commonly used in Hadoop to track active TaskTrackers (i.e., nodes that process the scheduled tasks), our proposed framework implements an adaptive algorithm that can dynamically detect the failures of the TaskTracker. To deploy our proposed framework, we have built, ATLAS+, an AdapTive Failure-Aware Scheduler for Hadoop. To assess the performance of ATLAS+, we conduct a large empirical study on a 100- node Hadoop cluster deployed on Amazon Elastic MapReduce (EMR), comparing the performance of ATLAS+ with those of three Hadoop schedulers (FIFO, Fair, and Capacity).
18A Distributed Fuzzy Associative Classifier for Big Data
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes. Then, a set of candidate fuzzy association rules is generated by employing a distributed fuzzy extension of the well-known FP-Growth algorithm. Finally, this set is pruned by using three purposely adapted types of pruning. We implemented our approach on the popular Hadoop framework. Hadoop allows distributing storage and processing of very large data sets on computer clusters built from commodity hardware. We have performed an extensive experimentation and a detailed analysis of the results using six very large datasets with up to 11,000,000 instances. We have also experimented different types of reasoning methods. Focusing on accuracy, model complexity, computation time, and scalability, we compare the results achieved by our approach with those obtained by two distributed nonfuzzy ACs recently proposed in the literature. We highlight that, although the accuracies result to be comparable, the complexity, evaluated in terms of number of rules, of the classifiers generated by the fuzzy distributed approach is lower than the one of the nonfuzzy classifiers.
19Distributed Relationship Mining over Big Scholar Data
In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge graph, discover the Meta paths between the entities and calculate the relevancy between entities in the graph. The core infrastructure is established on the secured and private Amazon Elastic Compute Cloud (Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories, processes the data parallel by utilizing open source Spark framework, manages computing resources optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributive between different types of entities. We incorporate four relationship discovery tasks including citation recommendation, potential collaborator discovery, similar venue measurement and paper to venue recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and weighted Meta path (MWMP) method to explore the potential relationship between different types of entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters through Amazon EC2 platform.
20ODDS: Optimizing Data-locality Access for Scientific Data Analysis
Whereas traditional scientific applications are computationally intensive, recent applications require more data-intensive analysis and visualization to extract knowledge from the explosive growth of scientific information and simulation data. As the computational power and size of compute clusters continue to increase, the I/O read rates and associated network for these data-intensive applications have been unable to keep pace. These applications suffer from long I/O latency due to the movement of “big data” from the network/parallel file system, which results in a serious performance bottleneck. To address this problem, we proposed a novel approach called “ODDS” to optimize data-locality access in scientific data analysis and visualization. ODDS leverages a distributed file system (DFS) to provide scalable data access for scientific analysis. Through exploiting the information of underlying data distribution in DFS, ODDS employs a novel data-locality scheduler to transform a compute-centric mapping into a data-centric one and enables each computational process to access the needed data from a local or nearby storage node. ODDS is suitable for parallel applications with dynamic process-to-data scheduling and for applications with static process-to-data assignment.
21Distributed Relationship Mining over Big Scholar Data
In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge graph, discover the Meta paths between the entities and calculate the relevancy between entities in the graph. The core infrastructure is established on the secured and private Amazon Elastic Compute Cloud (Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories, processes the data parallel by utilizing open source Spark framework, manages computing resources optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributively between different types of entities. We incorporate four relationship discovery tasks including citation recommendation, potential collaborator discovery, similar venue measurement and paper to venue recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and weighted Meta path (MWMP) method to explore the potential relationship between different types of entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters through Amazon EC2 platform.
22Energy Efficiency Aware Task Assignment with DVFS in Heterogeneous Hadoop Clusters
While Hadoop ecosystems become increasingly important for practitioners of large-scale data analysis, they also incur tremendous energy cost. This trend is driving up the need for designing energy-efficient Hadoop clusters in order to reduce the operational costs and the carbon emission associated with its energy consumption. However, despite extensive studies of the problem, existing approaches for energy efficiency have not fully considered the heterogeneity of both workload and machine hardware found in production environments. In this paper, we find that heterogeneity-oblivious task assignment approaches are detrimental to both performance and energy efficiency of Hadoop clusters. Our observation shows that even heterogeneity-aware techniques that aim to reduce the job completion time do not guarantee a reduction in energy consumption of heterogeneous machines. We propose a heterogeneity-aware task assignment approach, E-Ant, that aims to improve the overall energy consumption in a heterogeneous Hadoop cluster without sacrificing job performance. It adaptively schedules heterogeneous workloads on energy-efficient machines, without a priori knowledge of the workload properties. E-Ant employs an ant colony optimization approach that generates task assignment solutions based on the feedback of each task's energy consumption reported by Hadoop TaskTrackers in an agile way. Furthermore, we integrate DVFS technique with E-Ant to further improve the energy efficiency of heterogeneous Hadoop clusters by 23 and 17 percent compared to Fair Scheduler and Tarazu, respectively.
23Hermes: A Privacy-Preserving Approximate Search Framework for Big Data
We propose a sampling-based framework for privacy-preserving approximate data search in the context of big data. The framework is designed to bridge multi-target query needs from users and the data platform, including required query accuracy, timeliness, and query privacy constraints. A novel privacy metric, (ε, δ)-approximation, is presented to uniformly measure accuracy, efficiency and privacy breach risk. Based on this, we employ bootstrapping to efficiently produce approximate results that meet the preset query requirements. Moreover, we propose a quick response mechanism to deal with homogeneous queries, and discuss the reusage of results when appending data. Theoretical analyses and experimental results demonstrate that the framework is capable of effectively fulfilling multi-target query requirements with high efficiency and accuracy.
24Social Network Analysis of Cricket Community Using a Composite Distributed Framework: From Implementation Viewpoint
This paper proposes an alternate ranking system based on social network metrics and their evaluation in a composite distributed framework. The most important aspect of social network domain is voluminous data. In order to know key trends, predictive analysis of huge sets of raw data can be effective. Again these analytics are generally very compute intensive. The most efficient solution of such compute-intensive analysis is to map such problems in a distributed domain. The main contributions of this paper are twofold. First, the analysis of cricket community from the viewpoint of social network and finding ranking of players and countries based on the properties of graph centrality measures. Second, the paper proposes a comprehensive distributed framework to offer infrastructural support for the large data analysis as well as graph processing. Hadoop is an open-source framework to store and process large data sets in a cloud environment. MapReduce is a popular programming model for processing that data in a distributed manner. However, MapReduce alone is not efficient for graph processing. Giraph is an alternative programming paradigm for graph processing in Hadoop. Using a practical case study of social network analysis of cricket community, this paper captures the significance of the alternate ranking in this sport, as well as shows effectiveness of the proposed framework in the process of analyzing voluminous data.
25Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
The capability of switching into the islanded operation mode of microgrids has been advocated as a viable solution to achieve high system reliability. This paper proposes a new model for the microgrids optimal scheduling and load curtailment problem. The proposed problem determines the optimal schedule for local generators of microgrids to minimize the generation cost of the associated distribution system in the normal operation. Moreover, when microgrids have to switch into the islanded operation mode due to reliability considerations, the optimal generation solution still guarantees for the minimal amount of load curtailment. Due to the large number of constraints in both normal and islanded operations, the formulated problem becomes a large-scale optimization problem and is very challenging to solve using the centralized computational method. Therefore, we propose a decomposition algorithm using the alternating direction method of multipliers that provides a parallel computational framework. The simulation results demonstrate the efficiency of our proposed model in reducing generation cost, as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe the detailed implementation of parallel computation for our proposed algorithm to run on a computer cluster using the Hadoop Map Reduce software framework.
26Fair Resource Allocation for Data-Intensive Computing in the Cloud
To address the computing challenge of `big data', a number of data-intensive computing frameworks (e.g., MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto resource management platform that enables these frameworks running together in a shared system. However, we observe that, in cloud computing environment, the fair resource allocation policy implemented in YARN is not suitable because of its memoryless resource allocation fashion leading to violations of a number of good properties in shared computing systems. This paper attempts to address these problems for YARN. Both single-level and hierarchical resource allocations are considered. For single-level resource allocation, we propose a novel fair resource allocation mechanism called Long-Term Resource Fairness (LTRF)for such computing. For hierarchical resource allocation, we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending LTRF. We show that both LTRF and H-LTRF can address these fairness problems of current resource allocation policy and are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than.
27Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.
28Auditing Big Data Storage in Cloud Computing Using Divide and Conquer Tables
Cloud computing has arisen as the mainstream platform of utility computing paradigm that offers reliable and robust infrastructure for storing data remotely, and provides on demand applications and services. Currently, establishments that produce huge volume of sensitive data, leverage data outsourcing to reduce the burden of local data storage and maintenance. The outsourced data, however, in the cloud are not always trustworthy because of the inadequacy of physical control over the data for data owners. To better streamline this issue, scientists have now focused on relieving the security threats by designing remote data checking (RDC) techniques. However, the majority of these techniques are inapplicable to big data storage due to incurring huge computation cost on the user and cloud sides. Such schemes in existence suffer from data dynamicity problem from two sides. First, they are only applicable for static archive data and are not subject to audit the dynamic outsourced data. Second, although, some of the existence methods are able to support dynamic data update, increasing the number of update operations impose high computation and communication cost on the auditor due to maintenance of data structure, i.e., merkle hash tree. This paper presents an efficient RDC method on the basis of algebraic properties of the outsourced files in cloud computing, which inflicts the least computation and communication cost. The main contribution of this paper is to present a new data structure, called Divide and Conquer Table (D&CT), which proficiently supports dynamic data for normal file sizes. Moreover, this data structure empowers our method to be applicable for large-scale data storage with minimum computation cost.