8 things I’ve learnt being a data guy
Blog Post from Nov, 22 2017
At a Python meet-up last week, my most dreaded question came up: ‘So what do you do?’ The feeling of dread comes from the anxiety of having to explain how I have the responsibility of three different job titles. So what do I say?
I am a data developer, a data scientist and a data engineer. Quite a mouthful, especially when I remark that I’m also a rocket engineer.
Today will mark my 268th day at Lexer and what a ride it’s been! As I distill some of my technical and business learnings, this post is intended to give the outsider a sneak peak into the fascinating world of startups.
Technical
1. Name your variables properly!
Coming from an aerospace engineering background, where coding is just a means to an end I was largely ignorant of good coding practices. As I started scripting in Ruby, my mentors at Lexer were quick to point out that variable names such as sum_array
or final_hash
are uninformative and redundant.
Instead, name your variables to describe their function, not their datatype.
Let’s say I want to count occurrences of "apple"
in ["apple","orange","banana","apple"]
using a hypothetical count function that takes an array of strings and the search string to be counted. The favourable syntax would be n_apples = count(fruits, "apple")
` instead of count_val = count(fruits, "apple")
2. Trust your eyes less and your code more.
Suppose you have a lemonade stall and you want to analyse your sales performance. You ask your assistant Roger to record all transactions in an excel spreadsheet under three columns Name
, Total Spent
& Date of Lemonade purchase
.
Being a data guy, you naturally don’t trust humans to produce a clean dataset, so you decide to sense-check Roger’s spreadsheet before analysing it.
The first few rows all seem to look fine, the names are entered in the correct format. The totals and dates make sense. The problem is that the spreadsheet contains 1000 transactions. You could continue eyeballing the data, one row at a time but this would take an eternity and there’s no guarantee that your eyes will miss subtle errors in data formats or total spent values.
The smarter alternative is to look at the cardinality of the data. A bash command like cut -d '|' -f 2 lemonade_purchase.csv | sort | uniq -c | sort -rn
will aggregate all total spent values and sort them by number of occurrences:800 2.5
100 5
50 10
50 $
Just like that you can see that 50 of the transactions had a transaction amount of $
which makes no sense. Now you look into possible solutions to this problem and continue cleaning the data.
3. Tests, Tests & more Tests
Before joining Lexer, this was my process for writing code: Write Code --> Count my blessings --> Execute --> Check if output looks right
. It was an intuition game.
At Lexer, I was introduced to the beautiful world of Test-Driven Development (TDD). A test specifies the input and desired output of a particular piece of code you’re writing, then it checks if the actual output of your code is equivalent to the desired output.
The tests should be as specific and as varied as possible. Tests give you the confidence that your code is working as expected. It replaces intuition with solid working proof.
4. Design for edge cases
When I first started data wrangling, I would quickly form a plan of attack for cleaning the dataset and naively expect it to work perfectly. Little did I know that when you have a dataset with millions of rows and 40–100 columns, you are bound to have edge cases that your original plan of attack didn’t account for. This causes significant delays and duplication of approach.
Always assume that the data quality is bad, so you can design your wrangling methods to work in the worst cases.
5. Make scalable code
After your Ruby script takes an hour to finish parsing and extracting relevant info from 2 millions lines of text, code efficiency jumps to the top of your priority list. At Lexer, I learnt to think big, to always ask myself about how I could apply a solution at a large scale to GBs of text data. As the saying goes time is money, and every second you save through code efficiency is valuable.
Business
6. Think about how your customer uses your product
At Lexer we help companies put data to work to genuinely understand and engage their customers. As much of my work revolves around the technical workings of the Lexer platform, it’s easy to get out of touch with customer needs and instead get obsessed with building ‘cool’ features that are well…cool to use and look at but maybe not so useful.
7. Only sell to people that need the solution you’re providing
During my first month at Lexer, I asked one of our business development managers to sell me a pen. The first thing he said is ‘Do you need the pen?’. I said ‘No’. He said ‘Ok. Then there’s no need to sell you the pen’. Up to this point, selling to me meant convincing your potential customer into buying your product no matter what. In theory, this could work with enough time and effort, albeit at a lower rate of success. Alternatively, you try finding clients that have a use case fulfilled by your product. Although there’s extra work needed on your part to find those potentials, it drastically improves your chances of closure compared to the `spray and pray` method.
8. Find the questions to be answered
The hardest part of the job is not about finding the questions to be answered. These give context to the solutions which provide us with actionable insights. A number of organisations have great sense of what “big data” can deliver for them. Examples would be an increase in customer engagement, in revenue, decrease in churn, etc. However, most organisations will be stumped as to what to do with their data in order to get those results. In this case, asking the “how?” is an important piece of the puzzle. How can you leverage the data to your advantage?