Dat Tran is the current Head of Data Science at Berlin-based company, idealo internet GmbH. From investment banking to consultancy to machine learning, Dat took an unusual journey into data science. Through his work as a data scientist with various companies, he’s developed an interest in machine learning, deep learning, AI and computer vision. He also regularly blogs about his work and has spoken at conferences such as PyData.
Dat Tran’s Introduction To Data Science
How did you begin your career in data science?
The start of my data science career was kind of a twist of fate and not really straight-forward. I actually first pursued a career in investment banking. During my undergraduate I did several internships at major banks like Deutsche Bank and Jefferies. I realized that banking was something I don’t want to do for the rest of my life.
Feeling a little bit lost, I went back to school and pursued several other directions from entrepreneurship to strategy consultancy but none of them really made me happy. I did, however, really enjoyed the classes in advanced statistics and operations research. So I wanted to do something with this but didn’t know what. By chance, I found the online machine learning course by Andrew Ng (like most of all the data scientists) and was blown away by what I could do with data science. Luckily, I also heard that Accenture started to build up an advanced analytics team in Germany and then I applied and luckily got accepted. This was basically my start into my data science career.
What inspired you to learn more about data science?
Short answer. Andrew’s Machine Learning course on Coursera was the crucial factor for me to dig deeper into data science. His course gave me a good overview of all the terminologies behind data science and machine learning.
What do feel are some common misconceptions about data science or your work in general?
As I work in the business world, the biggest misconception I face about data science is that people think I’m a wizard. They think that all I need to do is to put some data into a tool and magic will happen. But of course, in reality, this is not the case. Data science is not as easy as people think.
The core element for data science is data. If a company doesn’t have the right data, then doing data science is really difficult. And many companies that I’ve worked for so far do not really have the right data for doing data science. Besides the data element, there is also the execution element. In a typical project, you need to do a proper data review to understand the data, do some feature engineering, build out the model and evaluate it. And at the end, you need to operationalize the model. So as you can see there are a lot of steps involved. Therefore it’s very important to set the right expectations with the business to explain that data science is not that trivial.
You have quite a bit of experience working in different industries. Can you tell a bit more about different projects you’ve worked on that have contributed to data science?
Since I’d been in consulting for quite a while, I had the chance to work on different projects in various industries. For example, I worked from simple churn use cases in telecommunications to hydroplaning prediction for a German sports automotive car maker. During these projects, I usually covered the entire end-to-end data science process from devising the use cases to the building of the model including putting it into production. My core contribution to data science is actually the engineering part. I contribute a lot of ideas on how to integrate the best practices of software engineering like test-driven development, continuous delivery and many others into data science. This is very important as at the end of the day, only models that are put into production generate any business value and what we do is also write code so it must be stable in production.
What are some tools that you use to conduct your research?
I only use open-source tools! And with regards to this I mainly use Python as my main programming language and its entire ecosystem that it offers. For example, I really like the PyData ecosystem (e.g. Pandas, Jupyter Notebook, Sci-kit learn and many more). It’s something that I use daily. Other than that I use things like git for version control, Jenkins for my CI/CD pipeline, docker/kubernetes for deployment. There are many tools that I use but concerning this topic I’m also quite tool agnostic. I don’t really care about the tool itself but want to use the right tool for the problem and every problem is different.
What inspired you to start blogging about your work?
When I was in middle school, I wanted to be a journalist but after my math teacher told me to study something proper instead of journalism I pursued a different path. Therefore, I always had the aptitude to write. Another reason was that I’ve always enjoyed reading machine learning related articles written by others. I’ve learnt a lot this way and became a better data scientist. So I started blogging to give something back to the machine learning community.
(You can read Dat’s contributions on Medium.)
What advice would you give to students who aspire to be data scientists?
I could give the typical advice such as learning the fundamentals in machine learning and statistics or learn a programming language like Python or R but this is expected from good candidates. My only advice is actually quite simple: get your hands dirty. I really value people who has already contributed to some open-source ML projects and/or participated on Kaggle challenges. This is a good way to showcase their skills to hiring managers.