Reinforcement Learning Cheat Sheet

Reinforcement Learning Cheat Sheet Answer
Reinforcement Learning Cheat Sheet Answers

Reinforcement Learning. What is reinforcement learning? Reinforcement Learning is a subset of machine learning. It enables an agent to learn through the consequences of actions in a specific environment. It can be used to teach a robot new tricks, for example. – towardsdatascience.com. Summary of UCB and Thompson Sampling. Differential reinforcement is a research-based strategy to increase appropriate behaviors and reduce or eliminate inappropriate behaviors. There are several types of differential reinforcement; however, they all share the same principles of reinforcing desired behavior and withholding reinforcement for undesired behavior. Type Definition.

When you hear Artificial Intelligence (AI) the first thing that comes to mind are robots; in particular, the Steven Spielberg movie titled A.I. where a robot child is built that can love and behave just like a real human. This idea appears to be closer to a dream than reality. Truth is, AI is more ubiquitous than we might think. It ranges from self-driving cars, movie recommendations on Netflix, e-mail spam detection to voice-controlled assistants such as Apple’s SIRI. The fact is that AI is already present across many businesses and various industries, as is shown in the figure below.

Still, evidence suggests that HR departments remain unable to seize the multitude of opportunities associated with AI. In part, what may be required to accelerate the adoption of AI is educational content directed at HR Professionals, not data scientists. Thus, we offer this brief guide to Machine Learning (ML), an important subset of AI, with the intent to demystify ML and make it tangible.

What is AI?

Let’s start with the definition. AI is a broad area of computer science with the focus on building machines that can behave in an intelligent way—akin to humans. Furthermore, we can differentiate between 1. Generalized AI and 2. Specific AI. The concept behind Generalized AI is to develop machines that can perform multiple tasks, just like Spielberg’s robot-child. This is an area still in its infancy.

More relevant to HR is Specific AI, which refers to the use of intelligent machines to perform only one particular task, but to do it better than a human could (e.g. faster, more accurately, or objectively). For example, an application that reads hundreds of CVs in seconds and identifies optimal candidates for an interview—thereby affording HR professionals greater time to perform tasks humans do better than machines (e.g., building rapport, empathy, creativity, critical thinking, etc.).

What is Machine Learning?

But how can a machine be programmed to process information (i.e., data) in an intelligent way? The answer to that is Machine Learning (ML). ML is a subset of Specific AI that comes from a mix of statistics and computer science. It refers to the process of computers learning to perform a task instead of following step-by-step instructions. This is generally performed iteratively by data scientists instructing a computer if its decisions are correct or incorrect. Depending on the outcome, the computer adapts how it makes decisions in the future—in other words, “it learns”.

When it comes to ML there are basically three broad types:

Reinforcement Learning,
Unsupervised Learning, and
Supervised Learning.

What is Reinforcement Learning?

Reinforcement Learning is probably best known through IBM’s Deep Blue computer, a “robot” that learned how to play chess and beat the human world champion.

9 HR Analytics
Case Studies

From saving money by predicting who will quit to tackling employee absence, these organizations are leveraging the full potential of People Analytics.

Reinforcement Learning is a type of technique that enables an algorithm to learn by trial and error, using feedback from its own actions and experiences. Much like Pavlov and his dog, Reinforcement Learning involves rewarding decisions that lead to success and penalizing decisions that lead to anything other than success—ultimately making the algorithm more intelligent in the process.

Examples of reinforcement learning applied in HR are a bit lean, though are most prevalent in areas such as education (i.e., applying content based on the progress of the student), finance and investment (i.e., advanced forecasting), supply chain operations (i.e., robots fulfilling orders in a warehouse), traffic flow optimization, and healthcare (i.e., accurate classification of biopsy images).

What is Supervised Learning?

The most common forms of ML across industries, and specifically the HR domain, are Supervised Learning, followed by Unsupervised Learning.

In Supervised Learning, we try to predict an outcome, such as whether an employee will leave the company, the risk of an employee being injured, or the ideal starting salary of a new employee.

To make predictions we need different input variables (i.e., variables are called “Features” by data scientists). Our input features are only limited to our imagination (i.e., what we think will be important), what data we can get our hands-on, or what data we can create (e.g., by knowing where someone works and where they live we can create a variable focused on employee commute distance).

An example: Supervised Learning informing Employee Turnover

Let’s examine a more detailed example of Supervised Learning—predicting who will leave an organization. Imagine that 1 in 5 new recruits leaves an organization in their first 12 months of tenure. To prevent such turnover, we could build a supervised learning model that predicts the likelihood of new starters leaving, so that our HR and managerial colleagues could intervene.

POPULAR

People Analytics Certificate Program

Enrollment closes in:

Claim your seat at the table with modern and relevant HR skills.

In this example, the model outcome being predicted is turnover risk, and the features used to predict turnover risk could include an employees’ demographic and employment characteristics (e.g., age, education level, role level, pay relative to market, month of employment, presence of development plans, and so on.).

Assuming such a model was highly accurate, it would enable us to understand turnover among our new starter population from three angles.

Firstly, what are the factors most influential in predicting turnover among our population. An example of such a model output is presented in the figure below, which illustrates whether a feature prevents turnover (green bars) or promotes turnover (red lines), and the relative importance of each feature in predicting turnover (i.e. longer lines denote more importance).
Secondly, the model also rates the likelihood of each new starter leaving the company, enabling focused intervention (i.e. the risk that Adam will leave in his first 12 months).
Thirdly, the model identifies the features preventing or promoting turnover risk for each employee. This individualized output can enable HR professionals to take informed and personalized action, regardless of whether they personally know each employee.

A supervised learning model used to predict employee turnover among new starters has the potential to reduce notable costs, including financial (e.g., separation, vacancy, recruitment, training, and replacement) reputational (e.g., eroding an EVP and/or reducing candidate appeal) and productivity-related (e.g., on average organizations invest between four weeks and three months training new employees). Some of these costs can be readily quantified so that we can identify organizational savings based on prevented turnover (i.e., preventing 2 in 10 resignations saves $xxx).

What is Unsupervised Learning?

Unlike Supervised Learning where we are trying to predict an outcome, Unsupervised Learning analyzes many variables simultaneously to identify similarities, patterns or relationships in the data. Unsupervised Learning is more about understanding what’s in the data. The two most common uses of unsupervised learning are focused on:

Clustering: automatically splitting the dataset into groups based on similarities among the features analyzed. Classically applied to consumers, but equally relevant to organizations, whereby we understand our employee segments (i.e., clusters) and determine whether our HR policies serve the segments.
Association mining: identifies sets of variables that often occur together in your dataset. For example, identifying injury patterns among workers at specific sites.

An example: Unsupervised Learning informing Employee Turnover

Cluster Analysis, the most famous form of unsupervised learning, can also help us better understand employee attrition. This approach can help group employees based on similar features (e.g., location, tenure, nationality, education level, age, performance level, etc.).

The figure below depicts the results of an analysis of the employee’s demographic features. Multiple demographic features are first reduced to two dimensions using a method called Manifold Learning (another non-supervised method), and these two new dimensions are then clustered using a method called T-SNE. The figure below shows us how the employees can be grouped together, in this case, twelve clusters, based on their demographic features.

Once grouped into clusters, the next step is to determine the risk of turnover for each group. Moreover, it is interesting to identify if there are some shared risk factors, practically indicating that employees within a cluster are experiencing the workplace in a similar way.

This last insight is of considerable practical significance, as it may help us tailor interventions that target specific employee clusters, thereby delivering maximum impact (i.e., retaining employees and reducing turnover costs) and return on our investment (i.e. for every $ spent we generated $ xxx in savings from reduced turnover).

Conclusion

We have begun to open the black box that is AI, providing a simple overview of ML. We looked at three broad types of ML—reinforcement, supervised and unsupervised—and examined some simple applications of each, where possible related to Human Resources. It is our genuine belief that through greater knowledge of what is possible with ML, HRBP’s and organizational decision-makers will both expect more, and be willing to do more, in this technological domain. Our next piece will explain the step-by-step process for performing Supervised ML—making discussions with People Analytics teams more tangible and less abstract!

HR Analytics Certificate Program

Give your career a boost. Become
an HR Analytics specialist!

This resource is designed primarily for beginner to intermediate data scientists or analysts who are interested in identifying and applying machine learning algorithms to address the problems of their interest.

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including:

The size, quality, and nature of data.
The available computational time.
The urgency of the task.
What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. We are not advocating a one-and-done approach, but we do hope to provide some guidance on which algorithms to try first depending on some clear factors.

Editor's note: This post was originally published in 2017. We are republishing it with an updated video tutorial on this topic. You can watch How to Choose a Machine Learning Algorithm below. Or keep reading to find a cheat sheet that helps you find the right algorithm for your project.

The machine learning algorithm cheat sheet

The machine learning algorithm cheat sheet helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems. This article walks you through the process of how to use the sheet.

Since the cheat sheet is designed for beginner data scientists and analysts, we will make some simplified assumptions when talking about the algorithms.

The algorithms recommended here result from compiled feedback and tips from several data scientists and machine learning experts and developers. There are several issues on which we have not reached an agreement and for these issues, we try to highlight the commonality and reconcile the difference.

Additional algorithms will be added in later as our library grows to encompass a more complete set of available methods.

How to use the cheat sheet

Read the path and algorithm labels on the chart as 'If <path label> then use <algorithm>.' For example:

If you want to perform dimension reduction then use principal component analysis.
If you need a numeric prediction quickly, use decision trees or linear regression.
If you need a hierarchical result, use hierarchical clustering.

Sometimes more than one branch will apply, and other times none of them will be a perfect match. It’s important to remember these paths are intended to be rule-of-thumb recommendations, so some of the recommendations are not exact. Several data scientists I talked with said that the only sure way to find the very best algorithm is to try all of them.

Types of machine learning algorithms

This section provides an overview of the most popular types of machine learning. If you’re familiar with these categories and want to move on to discussing specific algorithms, you can skip this section and go to “When to use specific algorithms” below.

Supervised learning

Supervised learning algorithms make predictions based on a set of examples. For example, historical sales can be used to estimate future prices. With supervised learning, you have an input variable that consists of labeled training data and a desired output variable. You use an algorithm to analyze the training data to learn the function that maps the input to the output. This inferred function maps new, unknown examples by generalizing from the training data to anticipate results in unseen situations.

Classification: When the data are being used to predict a categorical variable, supervised learning is also called classification. This is the case when assigning a label or indicator, either dog or cat to an image. When there are only two labels, this is called binary classification. When there are more than two categories, the problems are called multi-class classification.
Regression: When predicting continuous values, the problems become a regression problem.
Forecasting: This is the process of making predictions about the future based on past and present data. It is most commonly used to analyze trends. A common example might be an estimation of the next year sales based on the sales of the current year and previous years.

Semi-supervised learning

The challenge with supervised learning is that labeling data can be expensive and time-consuming. If labels are limited, you can use unlabeled examples to enhance supervised learning. Because the machine is not fully supervised in this case, we say the machine is semi-supervised. With semi-supervised learning, you use unlabeled examples with a small amount of labeled data to improve the learning accuracy.

Unsupervised learning

When performing unsupervised learning, the machine is presented with totally unlabeled data. It is asked to discover the intrinsic patterns that underlie the data, such as a clustering structure, a low-dimensional manifold, or a sparse tree and graph.

Clustering: Grouping a set of data examples so that examples in one group (or one cluster) are more similar (according to some criteria) than those in other groups. This is often used to segment the whole dataset into several groups. Analysis can be performed in each group to help users to find intrinsic patterns.
Dimension reduction: Reducing the number of variables under consideration. In many applications, the raw data have very high dimensional features and some features are redundant or irrelevant to the task. Reducing the dimensionality helps to find the true, latent relationship.

Reinforcement learning

Reinforcement learning is another branch of machine learning which is mainly utilized for sequential decision-making problems. In this type of machine learning, unlike supervised and unsupervised learning, we do not need to have any data in advance; instead, the learning agent interacts with an environment and learns the optimal policy on the fly based on the feedback it receives from that environment. Specifically, in each time step, an agent observes the environment’s state, chooses an action, and observes the feedback it receives from the environment. The feedback from an agent’s action has many important components. One component is the resulting state of the environment after the agent has acted on it. Another component is the reward (or punishment) that the agent receives from performing that particular action in that particular state. The reward is carefully chosen to align with the objective for which we are training the agent. Using the state and reward, the agent updates its decision-making policy to optimize its long-term reward. With the recent advancements of deep learning, reinforcement learning gained significant attention since it demonstrated striking performances in a wide range of applications such as games, robotics, and control. To see reinforcement learning models such as Deep-Q and Fitted-Q networks in action, check out this article.

Considerations when choosing an algorithm

When choosing an algorithm, always take these aspects into account: accuracy, training time and ease of use. Many users put the accuracy first, while beginners tend to focus on algorithms they know best.

When presented with a dataset, the first thing to consider is how to obtain results, no matter what those results might look like. Beginners tend to choose algorithms that are easy to implement and can obtain results quickly. This works fine, as long as it is just the first step in the process. Once you obtain some results and become familiar with the data, you may spend more time using more sophisticated algorithms to strengthen your understanding of the data, hence further improving the results.

Even in this stage, the best algorithms might not be the methods that have achieved the highest reported accuracy, as an algorithm usually requires careful tuning and extensive training to obtain its best achievable performance.

When to use specific algorithms

Looking more closely at individual algorithms can help you understand what they provide and how they are used. These descriptions provide more details and give additional tips for when to use specific algorithms, in alignment with the cheat sheet.

Linear regression and Logistic regression

Linear regression is an approach for modeling the relationship between a continuous dependent variable (y) and one or more predictors (X). The relationship between (y) and (X) can be linearly modeled as (y=beta^TX+epsilon) Given the training examples ({x_i,y_i}_{i=1}^N), the parameter vector (beta) can be learnt.

If the dependent variable is not continuous but categorical, linear regression can be transformed to logistic regression using a logit link function. Logistic regression is a simple, fast yet powerful classification algorithm. Here we discuss the binary case where the dependent variable (y) only takes binary values ({y_iin(-1,1)}_{i=1}^N) (it which can be easily extended to multi-class classification problems).

In logistic regression we use a different hypothesis class to try to predict the probability that a given example belongs to the '1' class versus the probability that it belongs to the '-1' class. Specifically, we will try to learn a function of the form:(p(y_i=1|x_i )=sigma(beta^T x_i )) and (p(y_i=-1|x_i )=1-sigma(beta^T x_i )). Here (sigma(x)=frac{1}{1+exp(-x)}) is a sigmoid function. Given the training examples({x_i,y_i}_{i=1}^N), the parameter vector (beta) can be learnt by maximizing the log-likelihood of (beta) given the data set.

Group By Linear Regression

Logistic Regression in SAS Visual Analytics

Linear SVM and kernel SVM

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function. A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector (w) and bias (b) of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:
begin{equation*}
begin{aligned}
& underset{w}{text{minimize}}
& & ||w||
& text{subject to}
& & y_i(w^T X_i-b) geq 1, ; i = 1, ldots, n.
end{aligned}
end{equation*}

A support vector machine (SVM) training algorithm finds the classifier represented by the normal vector and bias of the hyperplane. This hyperplane (boundary) separates different classes by as wide a margin as possible. The problem can be converted into a constrained optimization problem:

Kernel tricks are used to map a non-linearly separable functions into a higher dimension linearly separable function.

When the classes are not linearly separable, a kernel trick can be used to map a non-linearly separable space into a higher dimension linearly separable space.

When most dependent variables are numeric, logistic regression and SVM should be the first try for classification. These models are easy to implement, their parameters easy to tune, and the performances are also pretty good. So these models are appropriate for beginners.

Trees and ensemble trees

Decision trees, random forest and gradient boosting are all algorithms based on decision trees. There are many variants of decision trees, but they all do the same thing – subdivide the feature space into regions with mostly the same label. Decision trees are easy to understand and implement. However, they tend to over-fit data when we exhaust the branches and go very deep with the trees. Random Forrest and gradient boosting are two popular ways to use tree algorithms to achieve good accuracy as well as overcoming the over-fitting problem.

Neural networks and deep learning

A convolution neural network architecture (image source: wikipedia creative commons)

Neural networks flourished in the mid-1980s due to their parallel and distributed processing ability. But research in this field was impeded by the ineffectiveness of the back-propagation training algorithm that is widely used to optimize the parameters of neural networks. Support vector machines (SVM) and other simpler models, which can be easily trained by solving convex optimization problems, gradually replaced neural networks in machine learning.

In recent years, new and improved training techniques such as unsupervised pre-training and layer-wise greedy training have led to a resurgence of interest in neural networks. Increasingly powerful computational capabilities, such as graphical processing unit (GPU) and massively parallel processing (MPP), have also spurred the revived adoption of neural networks. The resurgent research in neural networks has given rise to the invention of models with thousands of layers.

In other words, shallow neural networks have evolved into deep learning neural networks. Deep neural networks have been very successful for supervised learning. When used for speech and image recognition, deep learning performs as well as, or even better than, humans. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.

A neural network consists of three parts: input layer, hidden layers and output layer. The training samples define the input and output layers. When the output layer is a categorical variable, then the neural network is a way to address classification problems. When the output layer is a continuous variable, then the network can be used to do regression. When the output layer is the same as the input layer, the network can be used to extract intrinsic features. The number of hidden layers defines the model complexity and modeling capacity.

k-means/k-modes, GMM (Gaussian mixture model) clustering

K Means Clustering

Gaussian Mixture Model

Kmeans/k-modes, GMM clustering aims to partition n observations into k clusters. K-means define hard assignment: the samples are to be and only to be associated to one cluster. GMM, however, defines a soft assignment for each sample. Each sample has a probability to be associated with each cluster. Both algorithms are simple and fast enough for clustering when the number of clusters k is given.

DBSCAN

When the number of clusters k is not given, DBSCAN (density-based spatial clustering) can be used by connecting samples through density diffusion.

Hierarchical clustering

Hierarchical partitions can be visualized using a tree structure (a dendrogram). It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities (i.e., can refine/coarsen clusters) using different K.

PCA, SVD and LDA

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the “intrinsic” dimensionality may be smaller than the number of features. Principal component analysis (PCA), singular value decomposition (SVD), andlatent Dirichlet allocation (LDA) all can be used to perform dimension reduction.

PCA is an unsupervised clustering method that maps the original data space into a lower-dimensional space while preserving as much information as possible. The PCA basically finds a subspace that most preserve the data variance, with the subspace defined by the dominant eigenvectors of the data’s covariance matrix.

The SVD is related to PCA in the sense that the SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA. However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles and movie profiles that can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

A related technique in NLP is latent Dirichlet allocation (LDA). LDA is a probabilistic topic model and it decomposes documents into topics in a similar way as a Gaussian mixture model (GMM) decomposes continuous data into Gaussian densities. Differently from the GMM, an LDA models discrete data (words in documents) and it constrains that the topics are a priori distributed according to a Dirichlet distribution.

Reinforcement Learning Cheat Sheet Answer

Conclusions

This is the work flow which is easy to follow. The takeaway messages when trying to solve a new problem are:

Reinforcement Learning Cheat Sheet Answers

Define the problem. What problems do you want to solve?
Start simple. Be familiar with the data and the baseline results.
Then try something more complicated.

SAS Visual Data Mining and Machine Learning provides a good platform for beginners to learn machine learning and apply machine learning methods to their problems. Sign up for a free trial today!

Tags classificationclusteringdata sciencedata science basicsmachine learningmachine learning algorithmsregression