1. Support vector machine (SVM)

Key point: C parameter

SVM creates a decision boundary to distinguish two or more classes.

Soft margin support vector machine tries to solve the optimization problem with the following objectives

Increasing the distance between decision boundary and class (or support vector)

Maximize the number of points correctly classified in the training set

There is a clear trade-off between these two goals. The decision boundary may have to be very close to a particular class to mark all data points correctly. However, in this case, because the decision boundary is too sensitive to noise and small changes of independent variables, the accuracy of new observations may be reduced.

On the other hand, it is possible to set as large a decision boundary as possible for each category, but at the cost of some misclassified exceptions. The trade-off is controlled by C parameter.

The C parameter adds a penalty for each misclassified data point. If C is small, the penalty for misclassification points is low, so the decision boundary with large margin is selected at the cost of a large number of misclassification.

If C is large, SVM will try to minimize the number of misclassification examples due to high penalty, resulting in smaller margin of decision boundary. For all examples of misclassification, the penalty is different. It is proportional to the distance to the decision boundary.

2. Decision tree

Key point: information acquisition

When selecting the features to be segmented, the decision tree algorithm will try to achieve the following:

More predictive

Less impurity

Lower entropy

Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is. The variable with uniform distribution has the highest entropy. For example, dice has six possible results with equal probability, so it has uniform distribution and high entropy.

EntropyvsRandomness

Select split that leads to more pure nodes. All these show that the “information gain” is basically the difference of entropy before and after splitting.

3. Random forest

Key points: bootstrapping and functional randomness

Random forest is a set of many decision trees. The success of random forests depends largely on the use of uncorrelated decision trees. If we use the same or very similar trees, the overall result will be similar to that of a single decision tree. Random forest realizes decision tree with uncorrelation by bootstrapping and feature randomness.

Bootstrapping is to randomly select samples from training data for replacement. They are called bootstrap samples.

The feature randomness is realized by randomly selecting features for each decision tree in the random forest. You can use max_ The features parameter controls the number of features used for each tree in the random forest.

Featurerandomness

4. Gradient promotion decision tree

Key points: learning rate and n_ esTImators

Gbdt is a combination of decision tree and booting method, which means that the decision tree is sequentially connected.

Learning rate and n_ Estimator is two key super parameters for gradient lifting decision tree.

The learning rate only represents the learning speed of the model. The advantage of slow learning is that the model becomes more robust and general. However, slow learning needs to pay a certain price. Training the model takes more time, which brings us to another important super parameter.

n_ The estimator parameter is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful when choosing the number of trees. Using too many trees is a high risk of over fitting.

5. Naive Bayes classifier

Key point: what are the advantages of simple hypothesis?

Naive Bayes is a supervised machine learning algorithm for classification, so the task is to find the category of observation in the case of given element value. Given a set of eigenvalues (i.e. P (Yi | x1, X2 , xn).

Naive Bayes assumes that the elements are independent of each other and there is no correlation between them. However, this is not the case in real life. The naive assumption that features are not related is the reason why the algorithm is called “naive”.

Compared with complex algorithms, the assumption that all functions are independent makes it very fast. In some cases, speed is higher than accuracy.

It is suitable for high dimensional data, such as text classification, email spam detection.

6. K nearest neighbor

Key points: when to use and not to use

K-nearest neighbor (KNN) is a supervised machine learning algorithm, which can be used to solve classification and regression tasks. The main principle of KNN is that the values of data points are determined by the data points around them.

With the increase of the number of data points, KNN algorithm becomes very slow, because the model needs to store all data points in order to calculate the distance between them. The reason is that the storage efficiency of the algorithm is not high.

Another disadvantage is that KNN is sensitive to outliers, because outliers affect the nearest point (even if it is too far away).

On the positive side:

Easy to understand

It does not make any assumptions, so it can be implemented in nonlinear tasks.

It works well in the classification of multiple categories

Suitable for classification and regression tasks

7. K-means clustering

Key points: when to use and not to use

K-means clustering aims to divide the data into K clusters, so that the data points in the same cluster are similar, while the data points in different clusters are farther apart.

K-means algorithm can’t guess how many clusters exist in the data. The number of clusters must be determined in advance, which can be a difficult task.

The algorithm slows down as the number of samples increases, because in each step, it accesses all data points and calculates the distance.

K-means can only draw linear boundaries. If there is a nonlinear structure to separate the groups in the data, then K-means is not a good choice.

On the positive side:

Easy to explain

It’s faster

Scalable for large datasets

The position of the initial centroid can be selected intelligently to accelerate the convergence speed

Ensure integration

Editor in charge: CC