Big Data Analytics Tools Comparison
Data science is an emerging field which intersects data mining, machine learning, predictive analytics, statistics, and business intelligence. The data scientist has been coined the “sexiest job of the 21st century” (Davenport & Patil, 2012). The data science field is so new that the U.S. bureau of labor and statistics does not yet list it as a profession; yet, CNN’s Money lists the data scientist as #32 on their best jobs in America list with a median salary of $124,000 (Money, 2015). Fortune lists the data scientist as the hot tech gig of 2022 (Hempel, 2012).
The volume of data has exploded (Brown, Chui, & Manyika, 2011); however, a shortfall of skilled data scientists remain (Lake & Drake, 2014) which helps justify the high median salary.
Comparison Matrix
The comparison matrix shows the open source tools and their support for common data science techniques. Based on the matrix, WEKA offers the most support on an open source basis; however, each software tool has unique features and strengths. While R is a close second, R requires more in-depth technical skills to execute basic tasks. Tools like Rapid Miner, KNIME, Orange, and Tanagra provide more visual approaches; however, there is an associated cost. KNIME requires a complicated installation process. Along those lines, Tanagra was developed for teaching and research; therefore, its capabilities may be outside the reach of the lay-person. Rapid Miner has a simple installation; however, much functionality is removed from the open source version. Similar to Rapid Miner, Orange’s visual approach and widget functionality introduces a simplified approach to creating data science tasks. One advantage to Rapid Miner is the availability of commercial support.
Data science is one of the most in demand professions available with projected growth and shortfalls in supply driving up salary for the position. Efforts in data science are challenging with high software costs that are prohibitive to small and medium size organizations whether in a business or a clinical environment. Data science provides a competitive advantage to business and can be employed to lower the costs of healthcare and has the potential to improve quality of life for patients. Training the next generation of data scientist in an academic setting is challenging due to shrinking academic budgets for software. In order to address these issues, this work provides an overview of the open source tools available to the data scientist.