Machine Learning-笔记 -XGBoost教程

【1】前言

XGBoost, 全名(eXtreme Gradient Boosting)，Kaggle大杀器，在数据挖掘比赛上，Everybody knows it！！！
XGBoost作者：陈天奇（华盛顿大学）
XGBoost前身:XGBoost是Boosting算法的其中一种,是在GBDT的基础上进行改进，使之更强大，适用于更大范围.
算法发布时间在2014年

本文适用对象：
1.了解决策树族群：决策森林，Adaboost，GBDT等
2.了解bagging，boosting
3.准备在机器学习道路上越走越远的朋友
4.高能数学预警：泰勒公式、梯度下降法了解一下下，但是我是不愿意看这些深奥的数学公式。

链接：
【天奇大神PPT传送门】

【2】算法原理简述

(1)Review of key concepts of supervised learning | 监督学习到底在学什么

label(标签)
根据带有标签的数据学习出一套规则，给另外没有标签的测试集打上标签。
假设函数(Hypothesis)
目标函数（Objective Function）= 损失函数(Cost Function) + 正则化（Regularization）
minimize Objective Function 最小化目标函数

在监督学习算法学习的过程中，其实就是最小化目标函数的过程，找到让目标函数最小化的一组参数。其中：

损失函数表示模型对训练数据的拟合程度，loss越小，代表模型预测的越准.
正则化项衡量模型的复杂度，regularization越小，代表模型模型的复杂度越低。
目标函数越小，代表模型越好

(2) Regression Tree and Ensemble | 决策树在做什么

一种模仿人类做决定的思维方式构建的算法
信息增益（Information Gain）：决定分裂节点，主要是为了减少损失loss
最大深度：会影响模型复杂度
树的剪枝：主要为了减少模型复杂度，而复杂度被‘树枝的数量’影响
回归树不止用于做回归，还可以做分类、排序等，主要依赖于目标函数的定义

(3) Gradient Boosting(How do we learn)

XGBoost 与前身 GBDT比较优势在于

1.损失函数：GBDT是一阶，XGB是二阶泰勒展开
2.XGB的损失函数可以自定义
3.XGB加入正则：XGB的目标函数进行了优化，有正则项，减少过拟合，控制模型复杂度
4.XGB运行速度快：决策树的学习最耗时的一个步骤就是对特征的值进行排序,在进行节点的分裂的时候，需要计算每个特征的增益。Xgboost在训练之前，预先对数据进行了排序，然后保存为block结构，后面的迭代中重复使用这个结构，大大减小计算量。各个特征的增益计算就可以开多线程进行。
5.内置交叉验证: 允许每轮boosting迭代中用交叉检验，以便获取最优 Boosting_n_round 迭代次数，可利用网格搜索grid search和交叉检验cross validation进行调参。
6.预剪枝：
GBDT：分裂到负损失，分裂停止;
XGB：一直分裂到指定的最大深度（max_depth），然后回过头剪枝。如某个点之后不再正值，去除这个分裂。优点是，当一个负损失(-2)后存在一个正损失(+10)，(-2+10=8>0)求和为正，保留这个分裂。

【3】参数说明

XGBoost的参数多到让人发指，下面只列举部分常用参数，所有参数的官方说明文档，请点击官方文档说明
XGBoost的参数可以归为3类：

(1) General parameters 通用参数

该参数控制在提升（boosting）过程中使用哪种booster，常用的booster有树模型（tree）和线性模型（linear model）

booster [default=gbtree]
有两种模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree
silent [default=0]
取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时的信息。缺省值为0
nthread [default to maximum number of threads available if not set]
- XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数。
- 如果你希望以最大速度运行，建议不设置这个参数，模型将自动获得最大线程
num_pbuffer [set automatically by xgboost, no need to be set by user]
size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]
boosting过程中用到的特征维数，设置为特征个数。XGBoost 会自动设置，不需要手工设置

(2) Booster parameters参数

Parameters for Tree Booster
- eta [default=0.3]
- 为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。
通常最后设置eta为0.01~0.2
- 取值范围为：[0,1]
- gamma [default=0]
  - minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be ；
  - range: [0,∞] ；
  - 模型在默认情况下，对于一个节点的划分只有在其loss function 得到结果大于0的情况下才进行，而gamma 给定了所需的最低loss function的值；
  - gamma值使得算法更conservation，且其值依赖于loss function ，在模型中应该进行调参。
- max_depth [default=6]
  - 树的最大深度。缺省值为6 ；
  - 取值范围为：[1,∞] ；
  - 指树的最大深度；
  - 树的深度越大，则对数据的拟合程度越高（过拟合程度也越高）。即该参数也是控制过拟合；
  - 建议通过交叉验证（xgb.cv ) 进行调参；
  通常取值：3-10 ；
- min_child_weight [default=1]
  - 孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative。即调大这个参数能够控制过拟合 ;
  - 取值范围为: [0,∞]
- max_delta_step [default=0]
  - 如果取值为0，那么意味着无限制。如果取为正数，则其使得xgboost更新过程更加保守。
  - 取值范围为：[0,∞]
  - 通常不需要设置这个值，但在使用logistics 回归时，若类别极度不平衡，则调整该参数可能有效果
- subsample [default=1]
  - 用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的从整个样本集合中抽取出50%的子样本建立树模型，这能够防止过拟合。
  - 取值范围为：(0,1]
- colsample_bytree [default=1]
  在建立树时对特征随机采样的比例。缺省值为1
- colsample_bylevel[default=1]
  通常不使用，因为subsample和colsample_bytree已经可以起到相同的作用了
Parameters for Linear Booster and Tweedie Regression
- lambda [default=0]
  L2 正则的惩罚系数
  用于处理XGBoost的正则化部分。通常不使用，但可以用来降低过拟合
- alpha [default=0]
  L1 正则的惩罚系数
  当数据维度极高时可以使用，使得算法运行更快。
- lambda_bias
  在偏置上的L2正则。缺省值为0（在L1上没有偏置项的正则，因为L1时偏置不重要）

(3) Learning Task parameters

objective [ default=reg:linear ]
- “reg:linear” –线性回归。
- “reg:logistic” –逻辑回归。
- “binary:logistic” –二分类的逻辑回归问题，输出为概率。
- “binary:logitraw” –二分类的逻辑回归问题，输出的结果为wTx。
- “count:poisson” –计数问题的poisson回归，输出结果为poisson分布。
  在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
- “multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
- “multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。每行数据表示样本所属于每个类别的概率。
- “rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）
- “rmse”: root mean square error
- “logloss”: negative log-likelihood
- “error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
- “merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
- “mlogloss”: Multiclass logloss
- “auc”: Area under the curve for ranking evaluation
- “ndcg”:Normalized Discounted Cumulative Gain
- “map”:Mean average precision
seed [ default=0 ]

【4】代码实现：python

(1)API接口说明

目前为止，xgb model 有两个接口

import xgboost
from xgboost import XGBClassifier

(2) XGBoost调参

方法一：直接调参，调用 xgboost包的 XGBClassifier()
可以对其参数进行手动修改，default参数如下
方法二：随机调参。
使用 xgb.cv，这里同样可以使用KFold()

from sklearn.cross_validation import KFold
kf = KFold(len(train_feat), n_folds=5, shuffle=True, random_state=520)
for i, (train_index, test_index) in enumerate(kf):
        # 将测试集均分 取一份当测试集
	xgb_train = xgb.DMatrix(train_feat[predictors].iloc[train_index], train_feat[label].iloc[train_index])
	xgb_eval = xgb.DMatrix(train_feat[predictors].iloc[test_index], train_feat[label].iloc[test_index])
	print("..........开始第{}轮训练".format(i))

注: xgb.cv()这里的cv()函数是进行了k折叠交叉验证，它不是一个参数搜索功能

import pandas as pd
import xgboost as xgb
import numpy as np
import matplotlib.pyplot as plt
import random
from ggplot import *
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix,mean_squared_error

train_path = 'D:/Pywork/Titanic/train.csv'
test_path = 'D:/Pywork/Titanic/test.csv'
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)


best_param = list()
best_logloss = np.Inf
best_logloss_index = 0
X_hote = pd.get_dummies(X_train)
print(X_hote.info())
dtrain = xgb.DMatrix(X_hote, y_train)

for i in range(50):
    xgb_params = {
        'objective': "binary:logistic",
        'max_depth': np.random.randint(6, 11), # 构建树的深度，越大越容易过拟合
        'eta': np.random.uniform(.01, .3), # 如同学习率
        'gamma': np.random.uniform(0.0, 0.2),# gamma最小损失调节范围
        'subsample': np.random.uniform(.6, .9), # 随机采样训练样本
        'colsample_bytree': np.random.uniform(.5, .8), # 生成树时进行的列采样
        'min_child_weight': np.random.randint(1, 41),
        'max_delta_step': np.random.randint(1, 11),
        'silent': 1
    }
    cv_nfold = 5
    cv_nround = 50
    bst_cv1 = xgb.cv(params=xgb_params,  # 这是一个字典，里面包含着训练中的参数关键字和对应的值
                     dtrain=dtrain, # 训练的数据
                     num_boost_round=cv_nround, # 这是指提升迭代的个数
                      # evals  这是一个列表，用于对训练过程中进行评估列表中的元素 evals = [(dtrain,’train’)]
                     # feval,自定义评估函数
                     # verbose_eval(可以输入布尔型或数值型)，也要求evals 里至少有 一个元素。
                     # 如果为True ,则对evals中元素的评估结果会输出在结果中；如果输入数字，假设为5，则每隔5个迭代输出一次
                     nfold=cv_nfold,
                     seed=0,
                     metrics=["auc", "rmse", "error", "logloss"],
                     maximize=False, # 是否对评估函数进行最大化
                     early_stopping_rounds=10,
                     verbose_eval=None,
                     )

    min_logloss = min(bst_cv1['test-logloss-mean'])
    min_logloss_index = bst_cv1.index[bst_cv1['test-logloss-mean'] == min(bst_cv1['test-logloss-mean'])][0]

    if min_logloss < best_logloss:
        best_logloss = min_logloss
        best_logloss_index = min_logloss_index
        best_param = xgb_params

nround = best_logloss_index
print(best_logloss)
print('best_round = %d' % (nround))
print('best_param : ------------------------------')
print(best_param)  # 显示最佳参数组合，到后面真正的模型要用
plt.figure()
plt.plot(bst_cv1['train-logloss-mean'], 'g', label='train')
plt.plot(bst_cv1['test-logloss-mean'], 'r', label='test')
print(plt.show())

方法三：使用 gridsearch 和 cross validation

from sklearn.grid_search import GridSearchCV
params = {'max_depth':[i for i in range(2,7)],'n_estimators':[j for j in range(100,1100,200)],'learning_rate':[0.05,0.1,0.25,0.5,0.1]
         }
xgbc_best = XGBClassifier()
gs = GridSearchCV(xgbc_best,params,n_jobs=-1,cv=5,verbose=1)
gs.fit(X_train,y_train)

(3) 绘制 train/test 的 auc/rmse/error

def xgb_plot(input, output):
    history = input
    train_history = history.iloc[:, 8:16].assign(id=[i+1 for i in history.index])
    train_history['Class'] = 'train'
    test_history = history.iloc[:, 0:8].assign(id=[i+1 for i in history.index])
    test_history['Class'] = 'test'
    train_history.columns = ["auc_mean", "auc_std", "error_mean", "error_std", "logloss_mean", "logloss_std", "rmse_mean", "rmse_std", "id", "Class"]
    test_history.columns = ["auc_mean", "auc_std", "error_mean", "error_std", "logloss_mean", "logloss_std", "rmse_mean", "rmse_std", "id", "Class"]
    his = pd.concat([train_history, test_history])

    if output == "auc":
        his['y_min_auc'] = his['auc_mean']-his['auc_std']
        his['y_man_auc'] = his['auc_mean']+his['auc_std']
        auc = ggplot(his, aes(x='id', y='auc_mean', ymin='y_min_auc', ymax='y_man_auc', fill='Class'))+geom_line()+geom_ribbon(alpha=0.5)+labs(x="nround",y='',title = "XGB Cross Validation AUC")
        return auc
    if output == "rmse":
        his['y_min_rmse'] = his['rmse_mean'] - his['rmse_std']
        his['y_man_rmse'] = his['rmse_mean'] + his['rmse_std']
        rmse = ggplot(his, aes(x='id', y='rmse_mean', ymin='y_min_rmse', ymax='y_man_rmse', fill='Class')) + geom_line() + geom_ribbon(alpha=0.5) + labs(x="nround", y='', title="XGB Cross Validation RMSE")
        return (rmse)
    if output == "error":
        his['y_min_error'] = his['error_mean'] - his['error_std']
        his['y_man_error'] = his['error_mean'] + his['error_std']
        error = ggplot(his, aes(x='id', y='error_mean', ymin='y_min_error', ymax='y_man_error', fill='Class')) + geom_line() + geom_ribbon(alpha=0.5) + labs(x="nround", y='', title="XGB Cross Validation ERROR")
        return (error)

横坐标是迭代次数
train曲线和test曲线的相差程度，可以侧面反映模型复杂度，检验是否过拟合

1	`xgb_plot(bst_cv1, 'auc')`

1	`xgb_plot(bst_cv1,'rmse')`

(4) 建模，进行预测，打印评估指标

# 利用上面调参结果： best_param

md_1 = xgb.train(best_param, dtrain, num_boost_round=nround)
dtest = xgb.DMatrix(X_test)
xgbc_y_predict = [1 if value >= 0.5 else 0 for value in md_1.predict(dtest)]

accuracy = accuracy_score(y_test, xgbc_y_predict )
f1_score = f1_score(y_test,predictions)
print("Accuracy: %.2f%%" %(accuracy * 100.0))
print("F1 Score: %.2f%%" %(f1_score * 100.0))

# save model
md_1.save_model('xgb.model')

方法二：使用 XGBClassifier()

md_2 = XGBClassifier(**best_param)                   # 2个*号，允许直接填入字典格式的param
md_2.fit(X_train, y_train)  

ypred = md_2.predict(X_test)
predictions = [round(value) for value in ypred]

# 打印评估指标
MSE = mean_squared_error(y_test, predictions)
print("MSE: %.2f%%" % (MSE * 100.0))  
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
f1_score = f1_score(y_test, predictions)
print("F1 Score: %.2f%%" % (f1_score * 100.0))

(5) 绘制Importance排序图

ax = xgb.plot_importance(md_1, height=0.5)
fig = ax.figure
fig.set_size_inches(25,20)                  # 可调节图片尺寸和紧密程度
plt.show()

(6) 根据Importance进行特征筛选

# sorted(list(selection_model.booster().get_score(importance_type='weight').values()),reverse = True)

importance_plot = pd.DataFrame({'feature':list(X_train.columns),'importance':md_2.feature_importances_})
importance_plot = importance_plot.sort_values(by='importance')
importance_plot = importance_plot.reset.index(drop=True)
thresholds = importance_plot.importance
thresholds_valid = np.unique(thresholds[thresholds != 0])


for thresh in thresholds_valid:

	# select features using threshold
	selection = SelectFromModel(md_2, threshold=thresh, prefit=True)
	select_X_train = selection.transform(X_train)
	# train model
	selection_model = XGBClassifier(**best_param)
	selection_model.fit(select_X_train, y_train)
	# eval model
	select_X_test = selection.transform(X_test)
	y_pred = selection_model.predict(select_X_test)
	predictions = [round(value) for value in y_pred]
	accuracy = accuracy_score(y_test, predictions)
	print("Thresh=%.4f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))


thresh = 0.034
selected_features = list(importance_plot[importance_plot.importance > thresh]['feature'])
print('selected features are :\n %s'%selected_features)
select_X_train = X_train[selected_features]                        # 筛选Importance符合阈值的特征集

n_features = selected_X_train.shape[1]
print('total: %d features are selected' %n_features)

selection_model = XGBClassifier(**best_param)                                   
selection_model.fit(select_X_train, y_train)

select_X_test = X_test[selected_features]
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
f1_score = f1_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
print("F1 Score: %.2f%%" % (f1_score * 100.0))