The most important data skills (and online data courses) in 2022

2022年3月16日

With so many exciting data science tools and techniques to explore these days, it can be hard to start building the right learning path for you. Thankfully, Matt Dancho of Business Science put together a great list with the most important data science skills for 2022.

A Valuable Data Skills Table

If a picture is worth 1,000 words, then Matt's table is worth 50,000 dollars because he believes that mastering these skills will boost your annual salary by $50k. We've added links to learning resources for many of his topic categories so that you can start to strengthen key areas today.

Plan	Skills
Machine Learning	Supervised Classification, Supervised Regression, Unsupervised Clustering, Dimensionality Reduction, Local Interpretable Model Explanation - H2O Automatic Machine Learning, parsnip (XGBoost, SVM, Random Forest, GLM), K-Means, UMAP, recipes, lime
Data Visualization	Interactive and Static Visualizations, ggplot2 and plotly
Data Wrangling & Cleaning	Working with outliers, missing data, reshaping data, aggregation, filtering, selecting, calculating, and many more critical operations, dplyr and tidyr packages
Data Preprocessing & Feature Engineering	Preparing data for machine learning, Engineering Features (dates, text, aggregates), Recipes package
Time Series	Working with date/datetime data, aggregating, transforming, visualizing time series, timetk package
Forecasting	ARIMA, Exponential Smoothing, Prophet, Machine Learning (XGBoost, Random Forest, GLMnet, etc), Deep Learning (GluonTS), Ensembles, Hyperparamter Tuning, Scaling to 1000s of forecasts, Modeltime package
Text	Working with text data, Stringr
NLP	Machine learning, Text Features
Functional Programming	Making reusable functions, sourcing code
Iteration	Loops and Mapping, using Purrr package
Reporting	Rmarkdown, Interactive HTML, Static PDF
Applications	Building Shiny web applications, Flexdashboard, Bootstrap
Deployment	Cloud (AWS, Azure, GCP), Docker, Git
Databases	SQL (for data import), MongoDB (for apps)

Online Data Courses by Skill and Tool

Don’t feel like narrowing down all the available options on your own? No problem. Below are 40 free and highly recommended online data courses broken down by skill category and tool.

1. Machine Learning

Machine Learning (ML) is a subset of Artificial Intelligence (AI) that empowers us to build complex predictive models that do not require explicit programming for each possible outcome.

Python: Applied Machine Learning in Python from University of Michigan on Coursera
R: Data Science: Machine Learning from Harvard University on edX

2. Data Visualization

Data visualization is the process of turning underlying data into summary graphics. The purpose may be to tease out insights during exploratory analysis or package visuals in a way that effectively informs and influences others. There are many visualization design choices that can be constructed in static or interactive form.

Python: Data Visualization with Python from IBM on Coursera
R: Data Visualization in R with ggplot2 from Johns Hopkins University on Coursera
Excel: Data Visualization with Advanced Excel from PwC on Coursera
Tableau: Data Visualization and Communication with Tableau from Duke University on Coursera
Power BI: Analyzing and Visualization Data with Power BI from Davidson College on edX
Looker: Analyzing and Visualizing Data in Looker from Google on Coursera

3. Data Wrangling & Cleaning

Data wrangling is the process of changing the structure of a given dataset into a more desirable format. Data cleaning is the process of identifying and correcting potential issues in a dataset that may negatively impact an analysis or process. Typical issues to address include missing data, outliers, and corrupt values.

General: A Beginner's Guide to Clean Data online textbook from Benjamin Greve
Python: Data Cleaning from Rachael Tatman on Kaggle
R: Data Science: Wrangling from Harvard University on edX
Spreadsheets & SQL: Process Data from Dirty to Clean from Google on Coursera

4. Data Preprocessing & Feature Engineering

Data processing is controlling for issues discovered during data cleaning by creating a set of instructions that will result in a dataset that is ready for further analysis. Feature engineering is the process of selecting from a dataset the relevant variables in their optimal form to be used as inputs in a predictive model.

Python: AI Workflow: Feature Engineering and Bias Detection from IBM on Coursera
R: Feature Engineering in R from Jose Hernandex on DataCamp (first chapter is free)

5. Time Series

Time series analysis is the process of tracking change over time. Data is generally analyzed across equally spaced intervals such as hour, day, week, month, quarter, or year.

Python: Time Series from Ryan Holbrook on Kaggle
R: Practical Time Series Analysis from The State University of New York on Coursera

6. Forecasting

Forecasting is the ability to generate a model that makes predictions about future outcomes along a time horizon. Forecast models are generally based on historic patterns in the variable of interest or a set of input variables from which there are underlying relationships.

General: Forecasting: Principles and Practice online textbook from George Athanasopoulos & Rob J. Hyndman
Python: Financial Forecasting in Python from Victoria Clark on DataCamp (first chapter is free)
R: Introduction to Time Series Analysis and Forecasting in R online textbook from Tejendra Pratap Singh

7. Text

Text analysis is the ability to derive meaning from unstructured text data, which is growing exponentially thanks to applications such as social media, blogging, and chat. Processing text data involves adding structure to the data so that we can more easily apply algorithms that digest themes and sentiments more efficiently. Processing text data involves adding structure to the data so that we can more easily apply algorithms that digest themes and sentiments more efficiently.

Python: Introducing Text Analytics and Natural Language Processing with Python from University of Canterbury on edX
R: Text Mining with R online textbook from David Robinson and Julia Silge

8. Natural Language Processing (NLP)

Natural Language Processing (NLP) attempts to help computers understand and confidently respond to human language stored as text data or audio samples.

Python: Natural Language Processing from Dan Becker on Kaggle
R: Clinical Natural Language Processing from University of Colorado on Coursera

9. Functional Programming & Iteration:

Functional programming, as opposed to object oriented programming, is a declarative coding approach that is centered around variables and functions during program development.

R: Foundations of Functional Programming with purrr on DataCamp (first chapter is free)
Haskell: Functional Programming in Haskell: Supercharge Your Coding from University of Glasgow on FutureLearn

10. Reporting

There is no value from an insight that stays buried in a raw data file or highlighted in someone's exploratory spreadsheet. Reporting is the process of turning key findings into digestible pieces of information in visually engaging ways. Some purposes might be to share business intelligence internally or position data-driven thought leadership externally.

Python: Applied Plotting, Charting & Data Representation in Python from University of Michigan on Coursera
R: Reporting with R Markdown from Amy Peterson on DataCamp (first chapter is free)

11. Applications

Applications are interactive environments for people to access, engage with, and contribute to data systems.

Python: Building Modern Python Applications on AWS from Amazon on edX
R: Developing Data Products from Johns Hopkins University on Coursera

12. Deployment

If you can't put your analysis or application into production, it isn't going to do many people very good. Deployment is the process of publishing data-driven tools or systems so that they can be easily accessed by intended users and easily maintained by developers.

Cloud: Cloud Computing for Enterprises from University of Maryland on edX
Git & GitHub: Getting Started with Git and GitHub from IBM on Coursera
Docker: Deploying Machine Learning Models in Production from DeepLearning.AI on Coursera
AWS: Build a Business Architecture using AWS Organization from Amazon on Coursera
Google Cloud Platform (GCP): Google Cloud Platform Big Data and Machine Learning Fundamentals from Google on Coursera
Microsoft Azure: Getting Started with Azure from LearnQuest on Coursera
Tableau: Dashboarding and Deployment from University of California, Irvine on Coursera

13. Databases

A database is an organized system for data storage and access. It generally contains many individual data tables that represent specific data sources. These tables can be created, read, updated, and deleted with SQL commands.

MySQL: Managing Big Data with MySQL from Duke University on Coursera
PostgreSQL: Database Design and Basic SQL in PostgreSQL from University of Michigan on Coursera
NoSQL: Amazon DynamoDB: Building NoSQL Database-Driven Applications from Amazon on Coursera
MongoDB: Introduction to MongoDB from MongoDB on Coursera

Where to go from here?

If you don't see the tool or technique you want to learn in the categories above, head to our updated search page where you can filter nearly two thousand data science courses and learning resources with something for every skill level and career aspiration.