lunes, 1 de mayo de 2017

Comparative Study of Big Data Computing and Storage Tools

Comparative Study of Big Data Computing and Storage Tools

Comparative Study of Big Data Computing and Storage Tools


Current world is the world of data. We have data all around us. This data is huge in volume and being generated exponentially from multiple sources like social media (Facebook, Twitter etc.) and forums, mail systems, scholarly as well as research articles, online transactions and company data being generated daily, various sensors' data collected from multiple sources like health care systems, meteorological department, environmental organizations etc.
The data in their native form has multiple formats too. Also, this data is no longer static in nature; rather it is changing over time at rapid speed. These features owned by bulk of current data, put a lot of challenges on the storage and computation of it. As a result, the conventional data storage and management techniques as well as computing tools and algorithms have become incapable to deal with these data. Despite of so many challenges associated with these data, we cannot ignore the potentials and possibilities lying in it that can support for analytics and for hidden patterns identification. These analytics can be very effective in making business strategies and predicting effective decisions, finding various hidden patterns associated with several diseases and their attributes, in genomics to analyze thousands of genes and their associated roles in biological systems, in climate monitoring and prediction, GPS and other satellite parameters mining etc. 

This survey aims to find some available computing and storage paradigms and tools that are being used in current scenario to address challenges of Big Data processing. We have categorized the survey into two streams. One stream contains study and survey of existing computing paradigms and tools used to perform computation on Big Data and the other stream gives a detailed survey of storage mechanisms and tools available today. In this reference, we focused on Apache Hadoop, Cloudera Impala and Enterprise RTQ, IBM Netezza and Apache Giraph as computing tools and HBase, Hive, Neo4j and Apache Cassandra as storage tools. Based on deep and detailed analysis of their features, relative advantages and disadvantages we have made a critical comparison among these tools. The comparison is made on the most striking attributes that one looks for before choosing these tools for its application domain to handle Big Data. We have discussed various issues associated with various tools and compared them accordingly and gave critical review on the suitability and applicability of different storage and computing tools with respect to a variety of situations, domains, users and requirements. We found that Hadoop is an economic choice in many ways, but if some company or enterprise has no issue with spending money at all then high-end IBM Netezza AMPP is a better choice. Also, the world wide adoption of Hadoop has caused significant rise in the NoSQL databases that could be easily integrated with Hadoop. In this reference the HBase has supports a wide range of a community of users, multiple commercial vendors and developers and provision for cloud storage through Amazon Web Services (AWS). Also, it has shown a strong integration with Hadoop using Apache Hive. Due to strong consistency and easier International Journal of Database Theory and Application Vol.9, No.1 (2016) Copyright ⓒ 2016 SERSC 63 application development, it is a good choice from developer point of view. Due to very high latency (in minutes), Hive is not a good solution for real time query applications and/or OLTP applications that require frequent write operations. Graphs are best suited for modeling real world situations such as computer networks, social networks, geographic pathways that calculate the shortest paths in graphs, etc. Hence, Neo4j and Giraph are the best choices for storage and computation respectively, to model such vertex-edge scenarios.