Ask anyone about the 5 things they just can’t live without and you’ll get the usual responses of food, car, internet etc. But ask a data scientist the same thing and they’ll regale you with a quick countdown of their five favourite analytics tools, the ones that make work and life that much easier to handle. Let’s take a quick look at what these tools are and what they do.
Microsoft Excel is a spreadsheet application that is a part of the MS Office suite of office productivity tools. We’ve all used it at some point or the other, whether at school or in college, to make lists and create tables. But there is more to Excel than that. Excel has a wide range of functionalities, from sorting and manipulating data to representing that data in the form of graphs and charts. It can be used to perform all sorts of arithmetic operations, particularly those relating to statistics, engineering and finance. It also supports programming through VBA (Visual Basic for Application).
Excel is one of the easiest data tools to learn and access, due its widespread availability. There aren’t too many computers without some version of MS Office (paid and unpaid) and by extension, MS Excel. Excel’s biggest advantage is that users can manipulate GUIs (graphical user interface) and utilise fair level of data visualization (nothing too complex though). While it can handle small chunks of data, it is not equipped to deal with large data sets or perform exercises like predictive modelling.
Nevertheless, it is still by far one of the most widely-used data manipulation tools out there and stands every aspiring data scientist in good stead. It also has a very user-friendly interface for non-technical people who want to venture into the world of data analysis.
Pros Everyone is familiar with Excel. Most people have Excel installed on their systems even if they do not have any other analytics tool. Excel is easy to use. The GUI is intuitive. Excel provides excellent visualization options.
Cons Excel is not meant for sophisticated statistical analyses. While simple predictive modelling techniques like clustering and regression can be performed in Excel using add-ons, more complex techniques like machine learning are not possible. Excel can handle up to 1 million rows and over 16000 columns. However, dealing with even 100,000 rows and 1000 columns is very painful. Excel becomes slow and may crash, for example, if you are doing a pivot on data that size.
Verdict Excel is an excellent starting tool for any data scientist. It is great for slicing and dicing small to mid-size data sets, which is what most people need. Excel expertise is a must for every data scientist. Do you aspire to become a data analyst? Then you should check out our Analytics for Beginners course to get the perfect start.
SAS is a software suite developed by SAS Institute for advanced analytics, predictive modelling, business intelligence and data management. Though considered difficult to use and learn, SAS can juggle numerous data management and analytics tasks unlike many of its competitors. It is excellent for power users, and is one of the most robust and fast analytics software suites in the world and one of the best for complex analyses.
While it’s pricing and licensing is a pain point, many mid to large sized companies still employ it for the sheer computational power it brings to the table. Though it does not offer great visualization, it still is the go-to-guy for complex analysis of large data sets.
Pros SAS is an extremely versatile tool that can handle small to massive data sets and can be used from simple slice and dice analysis to complex multi-variate analysis. There is extensive online support for SAS.
Cons It is an expensive tool. SAS licenses (even the non-GUI versions) can cost as much or more than it does to employ a data scientist. Limited visualization.
Verdict If SAS were cheap or free, it would completely dominate the analytics market. The tool is so versatile that it can meet the needs of most businesses. However, the pricing is high and this has forced individuals and businesses to look for more affordable options. Get started with SAS, head to Data Science with SAS and become a certified data scientist.
SAS’s fiercest competition comes from R, a programming language and software environment for statistical computing and graphics. An excellent tool that can perform any sort of statistical analysis, it has found ardent supporters because of its open source status. There is nothing geeks love more than open source & free-to-experiment software. R allows users to customize the software in accordance with their individual analytics needs, and comes with a strong package ecosystem, which makes working with it that much easier.
From its inception, it has grown increasingly more robust and now has a strong community of users who provide support to each other. R is the way to go for any company that does not have analytics at its core but still work with data. It is the ideal software with which to create reproducible and high quality analysis. While it lacks in security and memory management, it is still a very good analytics tool.
Pros R is versatile. Some users feel it is even more versatile than SAS now. R users seldom need to go to other tools for anything. R is open source and hence it’s free. R integrates well with open source technologies that dominate the big data space.
Cons R has a steep learning curve. It is not an easy tool to learn. While there is extensive support available on the Internet, it is not as well organized as say, SAS resources.
Verdict R is the most popular open source analytics tool in the world. At the pace at which its implementation is spreading, it will soon become the most widely-used analytics tool in industry and academia. Since it’s free, it is the tool of choice for small and mid-size businesses as well as individual consultants. Most student projects are done in R for the same reason. To add R to your analytics tool belt, get rolling with our Data Science with R certification course.
SQL (Structured Query Language) is a special purpose programming language used to communicate and manage a database, specifically in an RDBMS (relational database management system) or RDSMS. It is easy to learn and is used to solve quite a few challenging problems.
While not so great for statistical analysis, it is still one of the best tools for data manipulation and can be used on large data sets. Data manipulation still accounts for about half the project time and SQL sits comfortably in this space. It interacts with and accesses unstructured data with amazing ease and integrates well with old and new databases alike.
Pros SQL is extremely fast and deals well with data sets of any size. Most users have some level of familiarity with SQL because it is used in so many places outside of analytics as well. SQL is easy to learn.
Cons SQL is perfect for slice and dice operations but is not ideal for statistical analysis. This makes the scope of usage fairly limited.
Verdict When it comes to data manipulation, few tools can beat the speed and ease of SQL. SQL is a very popular add-on tool for data scientists. It complements SAS, R, Python and other languages extremely well.
Python is a widely used general purpose programming language that is easy to learn; has comparatively speaking, fewer lines of code; is highly readable, and open source. It has a mature and growing ecosystem of open source tools for mathematics and data analysis making it a strong contender for the title of ‘tool of the future’. It’s very fast and has a huge library base for statistical analysis. It is one of the languages that a lot of programmers are familiar with it and allows for easy transition into analytics from the IT perspective.
It found favour with professionals in the analytics domain only very recently, and hence fewer job openings, but it is definitely a skill to learn if one is looking to move into the analytics sector from a programming background. The coding and debugging is easier in Python due to its cleaner syntax and this makes its learning curve far flatter.
Pros Python is easy to learn because of its simple syntax. Many programmers are already familiar with it and they find picking up Python for analytics easier than learning a new language like R. Python is free. The statistical libraries in Python have been increasing rapidly making it a fairly versatile tool now.
Cons Python has made the transition from a computing language to an analytics tool fairly recently. Hence, it is still not as versatile as R and SAS are.
Verdict Python is fast gaining acceptance in the world of analytics. As more and more IT programmers move into analytics, python’s popularity will only grow. Python is definitely a tool worth investing time in. So there you go! These are the five must-have tools for any data scientist. How many do you know? How many are yet to make it to your list?