Yohan Obadia is currently Senior Consultant in Data Science at Ernst & Young in Paris. He learned to code in his free time while earning a Masters in Management from ESSEC Business School. After getting accepted to a PhD program in Quantitative Marketing, Yohan decided to instead work on practical issues in industry.
Yohan Obadia’s Introduction To Data Science
How did you begin your career in data science?
After a Masters in Economic Research and Econometric at the Paris School of Economics, I joined the ESSEC Business School where I met a great marketing analytics’ teacher, Arnaud de Bruyn.
Thanks to him, I learned to code in R and MySQL and started to work for him part-time on an automated analytic platform. It is important to note that prior to that experience, I felt that coding was something stratospheric that only geniuses that started when they were 15 years old could do. However, I quickly found myself to really enjoy it. During that project, I got to use my statistics knowledge and it forced me to go also a bit further with clustering, hierarchical linear models and so on.
He also encouraged me to apply to PhD programs in Quantitative Marketing. Hence, in order to build a strong application, I started my own data science project and really started to use ML algorithms like RF and SVM. I got my first challenges with completely unbalanced data and had to devise strategies to deal with missing values and so on. In the end, I got accepted at Northwestern, but when I got there for a week-end, I quickly realized that I wanted to work on more practical issues and therefore ended up in Israel for a year-long data-science and cyber-security program called Israel Tech Challenge.
What inspired you to learn more about data science and cyber security?
Mostly the impact that it can lead to in people’s lives and also because it’s fun! For example, I got the chance to work for an early startup in Israel called HT BioImaging. Their mission is to use infrared images to detect tumor cells on superficial tissues. If this technology succeeds, it will drastically reduce the cost of screening for cervix cancer, allowing women, especially in rural/poor areas, to be treated earlier. This project forced me to learn about computer vision and I really enjoyed it.
Just look around today about what is happening around Reinforcement Learning in data science or blockchain in cyber-security. Those research fields are amazing!
What do feel are some common misconceptions about data science or your work in general?
For people that aspire to become data scientists, the main misconception is how your time is allocated. Most of them think that modelling is what we do all day, always improving our result (on whatever metric we use). The truth is, it represents less than 10% of the time that I spend on a project. Most of the time is spent on identifying the needs of the client to make sure that what we deliver solves the problem, then on collecting and preparing the data, as well as delivering the result in an intuitive way through an interface, for example.
For people that want to launch a data science project in their company and know nothing about the field, it is important to demystify it. They usually think that it is some kind of magic tool that you feed with whatever data and it will deliver the answers they want. Most of the time, they don’t even know the data they want us to use well enough. As we say, garbage in, garbage out.
Can you tell a bit more about different projects you’ve worked on that have contributed to data science?
I worked for HT BioImaging to develop their first machine learning solution to detect tumors in infrared videos. We had the data of their first experiment on 15 mise and needed to build a model that would be able to detect those tumors. Considering the sample size, it was impossible to approach this problem with today’s shape detection solutions. Therefore, instead of trying to detect shapes, we trained the model to classify each pixel individually and built multiple features to describe every pixel. The idea was that the most dense area would be where the tumor is. We used a density based clustering algorithm called DBSCAN to cluster those points geographically in the image. Then, using dilating and eroding from OpenCV we would merge all those points and find the contour of the area. That would be our tumor.
Recently, I worked on a project that emphasizes the point made above that sometimes clients don’t know what they want. Figuring it out can be an important part of the mission. The idea was to be able to identify suspicious transaction on a specific market. The thing is, they never had any fraudulent transaction but provided us a list of “suspicious” ones according to some unknown rule that someone in their team set to get us a target variable. That did not work and after some back and forth with them and digging on our side to build a clustering solution, we came up with an automated solution that feeds a Tableau dashboard that they can use to further their analysis once we determine which transactions are suspect. On top of that, since their resources are limited, they can filter the top N most dangerous transactions to focus on those ones.
What are some tools that you use to conduct your research?
My work is almost exclusively done in Python and in order to quickly prototype things I use Jupyter notebooks before wrapping things up in scripts within a project. When I need a database, I either use MySQL or if it needs to be easily shared, SQLite. For collaboration, I really like Trello and github/gitlab.
What advice would you give to students who aspire to be data scientists?
Start to follow people that digest research papers for you and present new ideas to stay tuned to this quickly evolving field. I really like Medium for that and also contributed one article. It is easy to get overwhelmed by the quantity of information available in this field and how it evolves in each sub discipline. This leads to my second piece of advice.
You won’t be able to cover all the areas in data science quickly. Instead, find a project that you are interested in (with easily accessible data) and start building the simplest solution you can. Once you have a pipeline ready, enjoy researching on how to improve it! You will learn a lot whether it is in Natural Language Processing, Reinforcement Learning, Computer Vision, Clustering…Good luck and have fun !