Key components of this blog:
- My career development and milestones in 2020
- My observations as an Analyst-turned Data Scientist and tips for future Data Scientists
- My goals for 2021
Happy New Year everyone! After a turbulent 2020 marked by a crazy pandemic and social injustice elevated by the Trump administration, we are finally starting 2021 with encouraging signs such as widely-available vaccines and the new Biden administration that seeks to bring more unity, rather than division, for America. With that said, I would like to take this opportunity to share some of my milestones and experiences in 2020, and what I hope to achieve in 2021.
First and foremost, the biggest milestone for my career in 2020 was that I got promoted from Senior Data Analyst to Data Scientist in March 2020, which I am both excited and grateful for. I am excited because it’s long been my dream to become a Data Scientist – it’s a coveted and prestigious role where I can apply my data passion to make a difference in the industry. I am also grateful because of the long-term support and encouragement from my wife, family, coworkers, friends and my readers on this site. I also want to thank the many data scientists who created insightful blogs on medium.com as well as Stackoverflow for helping me understand some confounding concepts I came across after transitioning into Data Science.
Speaking of transition, let me share with you some of the observations I’ve made after transitioning into the Data Scientist role in March 2020, and some new skills I’ve gained that I had little practical exposure to previously. If you are currently a non data scientist looking to become a data scientist in the near term, I hope that these can shed some lights on you so you can make the best use of your time focusing on the more important aspects or skill sets in data science. Note that I am only speaking from my experience in the company I work for, and there could be some nuances from company to company. Nonetheless, I will try to make my points as generalized as possible. Some of you may have already understood, or even applied, these skills in your job. But if you haven’t used them in a while, it might still be a good refresher to read on 🙂
In terms of observations, the very first thing I noticed after my role transition is that writing optimized (both in terms of time and space complexity) scripts is more important in my Data Scientist role compared to my previous Senior Data Analyst role. This doesn’t mean that as a Senior Data Analyst you shouldn’t write optimized codes – performance of codes will be important regardless of which data role you are in, and the bigger the data size your business deals with, the more important code optimization becomes. It’s just that as a Data Analyst, the codes you write usually get reviewed by Data Engineers or Developers first, and optimizations made before they become production codes. This of course differs from company to company. If a company is short-staffed on Data Engineers or Developers (or their Engineers/Developers don’t have enough bandwidth), it might expect the Data Analyst to be more self-sufficient in writing efficient codes. Long story short, if you haven’t been paying much attention to code efficiency, it wouldn’t hurt to start practicing writing efficient and clean codes now so that you can gradually get in the habit of doing so every time you open the code editor.
In the next 3 paragraphs, I will provide an example of efficient vs. inefficient coding with regards to row iteration in Pandas. If you prefer to stay on the high-level, please feel free to skip the next 3 paragraphs.
Let me give you one example of inefficient vs. efficient codes. Most of you are probably familiar with Pandas dataframe already. (If you are not, please read up on some Pandas blogs i.e. https://realpython.com/pandas-dataframe/) Many people who first start out their data analysis career might get accustomed to using “iterrows” to iterate through each row of the Pandas dataframe. After all, it’s a very popular method and widely covered by many Python tutorial sites. In fact, it’s one of those methods that almost always works no matter the type of transformation you want to make to each row. However, you might be surprised that there are at least 9 other iteration methods that are almost guaranteed to be faster than “iterrows”. For example, the “.apply” function is a very versatile Pandas function that could take only 1/10th of the time taken by “iterrows” method to loop through your dataframe. Next up, the “list comprehension” method (formatted like this: [ foo(x) for x in df[‘x’]] which you might’ve already seen or used before) is usually even faster than “.apply“, taking less than 0.8/100th of the time taken by “iterrows” method.
Now, you might be tempted to stop here due to the amazing speed of “list comprehension”. However, when you are dealing with tens of millions of rows (or more) and speed and space are important, optimization becomes utmost critical. This is where “Pandas vectorization” and “Numpy vectorization” come into play. Vectorization methods take advantage of the matrix structure of dataframes and use their embedded mathematical functions to speed up row iterations. All the mathematics happens under-the-hood which you don’t have to create by yourself. However, if you are interested in learning about their mathematical underlyings, check out this blog: https://www.labri.fr/perso/nrougier/from-python-to-numpy/. Depending on your application, vectorizations only take up about 0.2/100th of the time taken by “iterrows”. However, note that vectorization only works when your use case doesn’t require flexible index lookup. So for instance, if your goal is to join 2 Pandas dataframes where the same primary key might not share the same index, you can’t use vectorization. To learn more about different ways to iterate through Pandas rows, feel free to check out these 2 blogs that I find very helpful: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6, https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4.
The second observation I made as an Analyst-turned Data Scientist is that having knowledge in data pipeline (i.e. ETL) and the tools around it will make your life much easier as a Data Scientist. Shortly after I got promoted to Data Scientist, I was tasked with an ETL project which involves periodically querying raw data from Redshift, performing data processing and manipulation, and uploading and refreshing the processed data into new tables in Redshift database. While it was a non-trivial task for me at the beginning, I was lucky to be able to partner with a colleague of mine who is also a Data Scientist, and learn from her many valuable insights around tooling for data pipeline. While there are many tools out there that can perform scalable ETL tasks, we eventually used the tool called “dbt” (standing for “data build tool”) to accomplish our goals (check out https://www.getdbt.com/ for more details), since it’s a tool that’s already being used by our Data Science team and it’s proven to be a viable option. One nice thing about “dbt” is that, as long as you set up the proper docker files and AWS environments, you can easily containerize your dbt codes and achieve automated data pipeline. As dbt’s official URL puts it, dbt enables easy analytics engineering, which means “the data transformation work that happens between loading data into your warehouse and analyzing it”.
So those are the 2 observations I’ve made as a new Data Scientist. To summarize those 2 points into a 1-liner: I believe it will be highly beneficial for prospective Data Scientists to have a firm grasp on code optimization and data pipeline. In the former case, imagine you have to write a script to process at least tens of million rows of data (even worse, each row might contain tens of columns), using optimized codes could save you at least hours or days compared to using brute-force, unoptimized codes. It can also save you a lot of memory and space too. Regarding knowledge in data pipeline, imagine you’ve created the most optimized Python script to process large amounts of data, but have no idea how to wrap these codes into a data pipeline service. This will be analogous to knowing how to assemble a speedboat but don’t know how to drive it. Therefore, knowing how to write optimized data processing scripts and building codes to wrap it into a service will be highly important for future Data Scientists. Now, this doesn’t mean that knowledge in machine learning and statistical analysis are not important. Quite the contrary, ML and statistics knowledge should be fundamental building blocks of any Data Scientist. It’s just that a lot of times, development skills such as code optimization can sometimes be overlooked by prospective Data Scientists who focus most of their energy learning ML and statistics. Think ahead, and think broad, and you will thank yourself later 🙂
In terms of my own goals for 2021, I look forward to further strengthening my skills in Deep Learning and applied ML models by completing a Udemy course, while continuing to expand my development skills through hands-on projects in my job. In addition to strengthening my data science skills, I am also looking forward to learning about some product management practices through a Coursera course. Even though not quite directly related to data science, I believe having good fundamental knowledge in product management can help me develop the kind of “product sense” that’s important in projects I am tasked with, and be able to manage my projects more effectively.
I hope this blog will be beneficial for those of you who are looking to get into or transition into Data Scientists roles. Before concluding, I want to again wish everyone of you a healthy and productive new year! Stay safe and stay warm! 😀