With so many exciting data science tools and techniques to explore these days, it can be hard to start building the right learning path for you. Thankfully, Matt Dancho of Business Science put together a great list with the most important data science skills for 2022.
If a picture is worth 1,000 words, then Matt's table is worth 50,000 dollars because he believes that mastering these skills will boost your annual salary by $50k. We've added links to learning resources for many of his topic categories so that you can start to strengthen key areas today.
Plan | Skills |
---|---|
Machine Learning | Supervised Classification, Supervised Regression, Unsupervised Clustering, Dimensionality Reduction, Local Interpretable Model Explanation - H2O Automatic Machine Learning, parsnip (XGBoost, SVM, Random Forest, GLM), K-Means, UMAP, recipes, lime |
Data Visualization | Interactive and Static Visualizations, ggplot2 and plotly |
Data Wrangling & Cleaning | Working with outliers, missing data, reshaping data, aggregation, filtering, selecting, calculating, and many more critical operations, dplyr and tidyr packages |
Data Preprocessing & Feature Engineering | Preparing data for machine learning, Engineering Features (dates, text, aggregates), Recipes package |
Time Series | Working with date/datetime data, aggregating, transforming, visualizing time series, timetk package |
Forecasting | ARIMA, Exponential Smoothing, Prophet, Machine Learning (XGBoost, Random Forest, GLMnet, etc), Deep Learning (GluonTS), Ensembles, Hyperparamter Tuning, Scaling to 1000s of forecasts, Modeltime package |
Text | Working with text data, Stringr |
NLP | Machine learning, Text Features |
Functional Programming | Making reusable functions, sourcing code |
Iteration | Loops and Mapping, using Purrr package |
Reporting | Rmarkdown, Interactive HTML, Static PDF |
Applications | Building Shiny web applications, Flexdashboard, Bootstrap |
Deployment | Cloud (AWS, Azure, GCP), Docker, Git |
Databases | SQL (for data import), MongoDB (for apps) |
Don’t feel like narrowing down all the available options on your own? No problem. Below are 40 free and highly recommended online data courses broken down by skill category and tool.
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that empowers us to build complex predictive models that do not require explicit programming for each possible outcome.
Data visualization is the process of turning underlying data into summary graphics. The purpose may be to tease out insights during exploratory analysis or package visuals in a way that effectively informs and influences others. There are many visualization design choices that can be constructed in static or interactive form.
Data wrangling is the process of changing the structure of a given dataset into a more desirable format. Data cleaning is the process of identifying and correcting potential issues in a dataset that may negatively impact an analysis or process. Typical issues to address include missing data, outliers, and corrupt values.
Data processing is controlling for issues discovered during data cleaning by creating a set of instructions that will result in a dataset that is ready for further analysis. Feature engineering is the process of selecting from a dataset the relevant variables in their optimal form to be used as inputs in a predictive model.
Time series analysis is the process of tracking change over time. Data is generally analyzed across equally spaced intervals such as hour, day, week, month, quarter, or year.
Forecasting is the ability to generate a model that makes predictions about future outcomes along a time horizon. Forecast models are generally based on historic patterns in the variable of interest or a set of input variables from which there are underlying relationships.
Text analysis is the ability to derive meaning from unstructured text data, which is growing exponentially thanks to applications such as social media, blogging, and chat. Processing text data involves adding structure to the data so that we can more easily apply algorithms that digest themes and sentiments more efficiently. Processing text data involves adding structure to the data so that we can more easily apply algorithms that digest themes and sentiments more efficiently.
Natural Language Processing (NLP) attempts to help computers understand and confidently respond to human language stored as text data or audio samples.
Functional programming, as opposed to object oriented programming, is a declarative coding approach that is centered around variables and functions during program development.
There is no value from an insight that stays buried in a raw data file or highlighted in someone's exploratory spreadsheet. Reporting is the process of turning key findings into digestible pieces of information in visually engaging ways. Some purposes might be to share business intelligence internally or position data-driven thought leadership externally.
Applications are interactive environments for people to access, engage with, and contribute to data systems.
If you can't put your analysis or application into production, it isn't going to do many people very good. Deployment is the process of publishing data-driven tools or systems so that they can be easily accessed by intended users and easily maintained by developers.
A database is an organized system for data storage and access. It generally contains many individual data tables that represent specific data sources. These tables can be created, read, updated, and deleted with SQL commands.
If you don't see the tool or technique you want to learn in the categories above, head to our updated search page where you can filter nearly two thousand data science courses and learning resources with something for every skill level and career aspiration.