Comprehensive View of the State of Data Science and Machine Learning
- Junchen
- Nov 9, 2018
- 3 min read
Updated: Apr 23, 2019
Introduction:
Kaggle conducts an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, 2018, and after cleaning the data Kaggle finished with 23,859 responses.
Overview of Experience

We can see the most frequent cells are 0-2 years of coding and machine learning experience
As we increase in years of coding, we see more people have more experience of machine learning , meaning experience of coding is positively correlated with that of machine learning
Even having coding experience with 30+ years, some people are still new to Machine Learning, machine learning is relatively young field
Citizenship of Respondents and Experience by Country
We can see that the score of experience to analyze data by coding is between 1.939 to 8.889, but the score of experience of machine learning is 1.112 to 4.276, hence we can see experience of analyzing data by coding is much more than that of machine learning.
Generally speaking, USA, Australia and Europe Countries have more experience of analyzing data and machine learning compared to other countries.
Respondents from Finland has the most experience of analyzing data as 4.276 averagely, and respondents from Switzerland has the most experience of machine learning as 8.889 averagely, however considering the inadequate size of respondents(70/164 respondents respectively), this cannot infer the population.
What do they do?

While respondents are diverse in Industries and Jobs, we can see that most of respondents are students, followed by Data Scientist in Computer/Technology and Software Engineer in Computer/Technology.
Looking vertically, we see Data Scientist, Software Engineer and Data Analyst are diverse in Industries. Looking horizontally, Computers/Technology Industry is diverse in job categories.
There are few people working as Data Journalist, and nobody as Data Journalist work in some Industries like Online-Services/Government-Public Services in this survey.
What tools do they use?

The top 8 most popular programming languages for data science are shown here, but Python dominates over other languages in Data Science.

For the BI tools, people have different options, but Tableau is the most popular choice, followed by Power BI.
What language do they recommend for an aspiring Data Scientist?

Number of users comparison in this survey: Python >> R > SQL ≥ C/C++ ≥ Java > MATLAB
We can see most people recommend Python as the first choice except for R users, however there is only a small fraction of people who use Python often recommend R, and there is a big chunk of people who use R often recommend Python. For people who are worried about R or Python, here is a clear answer that Python is the most recommended language to start.
R, SQL and MATLAB users are relatively confident about their languages, but C/C++ and Java users are not very confident about theirs.
How much of time they spend on each part of a regular project?

We can see the Top 2 time-consuming parts are Cleaning Data and Model Selection. And people vary a lot in these two parts. From my own experience, for cleaning data part, I think it depends on where they have the data, because some datasets have already been fully or partially cleaned. For model selection, I think it depends on the problem they are facing, because a lot of similar challenge can be solved within minutes by importing your previous work or model API.
People generally don't have much experience on Model Production, first quantile lies on 0% line meaning 1/4 of respondents don't spend any time on this. Third quantile lies on 10% line, meaning 3/4 of respondents spend less than 10% of total.
People actually don't spend too much time on Finding Insights.
By Junchen Pan
Dataset Source: Kaggle Competition Platform
Comments