Tools for Data Science - What I use EVERYDAY as a DS!
Let's talk about the tools I use every day as a data scientist.
There's a huge range of software tools out there, and it's so hard to know exactly which tool is the right one for the job. Don't worry, I've got you covered today. Today we'll talk a little bit about the technical and non-technical tools I use daily as a data scientist.
The technological tools are the ones I use for data analysis, model building, or data visualisation. And the non-technical ones are more the ones I use for my project management task or communicating with my team.
Technical Tools
The technical tools I mentioned are quite standard and are used across the board by data scientists, machine learning engineers, and even software engineers. Albeit there might be a couple of flavours for each of the tools, it's pretty much the same thing at the end of the day.
Accessing the Data
First off, as a data scientist, if you're just getting started, you need access to data to do anything because a data scientist without data is just a scientist. In most companies that I've worked at majority of the data set will be stored either on the cloud or on-premise clusters. In my current company, we are using the Google Cloud Platform to store all of our data. But I've previously also used Amazon Web Services. Cloud technologies are great to be familiar with as a data professional.
Now, if your data is on-premise, they'll usually be stored on computer clusters that are in your office, and these are computers that are mostly running Linux and don't really have a monitor. To access the data on those computers, you'll need to "ssh" into the machine, and for doing this, you definitely need to be familiar with using the terminal and Bash along with Linux. And the best place to learn these tools is the missing semester at MIT. And I have to say that in my six years as a data scientist, I've always used Bash.
The majority of the cloud platforms that you deal with, you'll be interacting with a lot of those resources, let's say storage facilities, or computers using Bash and terminal. Of course, there are lots of graphical user interfaces out there, but they're almost never as good as interacting with the services through the terminal.
Data Analysis
Now, once you have access to the data, you'll want to conduct data analysis on it and build machine learning models on top of it. For this, I primarily use Python, and the tools I use to write Python is a Jupyter notebook or Jupyter lab along with Visual Studio Code. So I typically use these for the sake of testing my code and doing some proof of concept work.
Once I'm done with the proof of concept work, I typically move on to a tool like Visual Studio Code, where I can properly script out my code. And it's also got some pretty awesome extensions that you can use to speed up your coding workflow.
Recently, one of the ones I've used is the GitHub Copilot, which has been amazing with auto-completing some of the functions that I'm writing or giving some pretty handy auto-complete options.
And Visual Studio Code also has lots of tools to help you debug and run tests of your code. Now, if Visual Studio Code is a bit too much for you, then you can also use a tool like Sublime to do your scripting work. And I especially like Sublime because it's minimalistic and easy to use. Still, even though I use Visual Studio Code as my main IDE, I still tend to use Sublime for some pretty basic note-taking or storing some of the useful code snippets, so it's definitely a great tool.
Once I've written the code, I typically tend to run the code in my terminal with a Docker environment. So you can imagine Docker as pretty much an isolated Linux environment. What tends to happen is when you're trying to write code on your local computer, then if you were to take that code and run it on another computer, it's quite possible that the local packages or the development environment are very different.
Docker is a great way of normalising the environment that you want your code in because you can specify what goes into that Docker image. You can think of it as a mini OS that you can run on the fly, and it's very easy to package up and get going, so it's almost like a virtual machine in your command line.
And finally, the terminal tool I use is iTerm2 which, in my opinion, is the best terminal tool available on macOS. It has great options for customisations. Now we know what tools to use to write code, but now let's talk a little bit about code management. In the commercial data science world, you'll be working with a lot of people, and as you can imagine, keeping track of the different versions of the code can be a nightmare to manage.
Git is basically a tool that you can use to version control your code, and it is in the form of a command-line tool. It also integrates with tools like GitHub and Bitbucket, which are basically online code repositories which a lot of companies tend to use to manage their code. It's going to be great if you get well versed in Git as it's used everywhere in the software and data science industry. And a great way to learn all about it is, again, the missing semester's course at MIT. But you can also start using Git in your own personal project workflow, which gets you a little bit familiar with using the tool.
Currently, at my workplace, we're using Bitbucket, but I've also used GitHub. They're both fairly easy to use, but I do think that GitHub is a lot more accessible and a lot more popular. And when it comes to these code repositories on GitHub and Bitbucket, one of the best practices is integrating them with a Continuous Integration and development tool. What that really means is that when you make a couple of changes in your code repository locally, and then you push those changes onto the code repository on Bitbucket or GitHub every time you do something like that, it triggers a checking procedure. It's basically checking that the code that you've just updated doesn't break the rest of the code base through running the unit test that you've written for the code and also checking the syntax.
Currently, I'm using Google Cloud Build for this purpose, but previously I've also used tools called CircleCI and Buildkite. But they're all different flavours of the same tool. And don't worry if you haven't really worked with a tool like this because you'll definitely learn it on the job. If you're enjoying the blog so far, then consider subscribing to my newsletter. I write blogs & make weekly videos about demystifying AI.
Data Visualisation
Once you've gotten your data, done some analysis, and built some fancy models, you'll want to show off the results of those models and the analysis of that data. There are a few tools that you can use for doing this. Currently, I've been using a lot of Amazon Superset, which gives you a lot of flexibility as to where the data comes from and also the customization options to build these charts, and from what I can see, there are quite a lot of options out there. Previously I've also used Looker, which is a great enterprise tool that connects with a lot of data sources. But if you just have to do data visualization in Python, then my favourite tools for doing that are Seaborn and Plotly. And I've also used a tool called Bokeh for building custom and bespoke dashboards, and it's a fantastic customizable tool that looks beautiful. That's pretty much it for all the technical tools that I use.
Non-Technical Tools
The non-technical tools you use will vary based on the company you're with, but there are definitely, some common tools that are used across the board. At my current company, we use a lot of Atlassian tools, specifically Confluence and Jira. Confluence is an amazing tool for creating documentation. I've really enjoyed using it as it gives you access to a variety of templates, which can really speed up your workflow. And they also have lots of customization options that give you the ability to make a boring document into the most interesting and engaging document. The next product from the Atlassian suite that we use is Jira.
Jira is a very popular tool that's used across so many software companies for project management. In Jira, I primarily create tasks to do for my upcoming sprints. And sprint in my organisation is a two-week period which is basically your development and deliverable blocks, and Jira allows you to easily create this task and make it visible to your other co-workers and your manager. I'm not gonna lie; I'm not a massive fan of Jira because it's just very hard to learn and use effectively.
I've used other alternatives like Asana as it was a lot more user-friendly, so I would definitely opt for that tool if I had the choice. Then we use the Google Suite for emails and Google Chat. However, the main tool that we use for communicating internally is Slack. And Slack is a tool that I've used in most companies that I've worked for previously. It's pretty much like a messaging tool similar to messenger or WhatsApp, but mainly for companies.
And finally, for brainstorming and reviewing processes, we use a tool called Miro board. It basically emulates a whiteboard, where you can go crazy with different colours of sticky notes, and lots of people can write on it at the same time. Very similar to Google Docs but in more interactive.
That's it for all the tools I use daily as a data scientist. Try them out, and let me know how you go. And if you enjoyed the blog, then be sure to share it with your friends, and I'll see you next time.
Check out my other blogs & YouTube -