scikit-learn 学习(二) 科学数据处理中的统计学习教程(上)

2018-01-05

scikit-learn

本教程使用统计学习技术

1. 统计学习：设置, estimator 对象

Datasets

from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
data.shape

(150, 4)

1 2	digits = datasets.load_digits() digits.images.shape

(1797, 8, 8)

1 2	import matplotlib.pyplot as plt plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)

<matplotlib.image.AxesImage at 0x7fd41bfb1a20>

为了在 scikit 中使用这个数据集,把每张 8*8 的图像转换为长度为64的向量

1	data = digits.images.reshape((digits.images.shape[0], -1))

data[0]

array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

estimators 对象

estimator 是任何从数据中学习到的对象,可以是分类,回归,聚类或者是一个 transformer 从原始数据中提取(extracts)/过滤(filters)特征.

所有 estimator 有 fit 方法

python estimator.fit(data)

Estimator parameters:所有estimator的参数可以在初始化时设置并用attribute修改:

python estimator = Estimator(param1=1, param2=2) estimator.param1

Estimator parameters:当数据拟合后,参数被估计,所有估计的参数是estimator的一个属性,用如下下划线表示

python estimator.estimated_param_

2. 监督学习

KNN分类

# split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

1	knn.predict(iris_X_test)

array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])

1	iris_y_test

array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

维度灾难

为了使学习器有效,你需要将邻居的距离小于一个值d.1维中,平均 n ~ 1/d个点.
如果特征数为p,你需要 n ~ 1/d^p 个点.例如1维中要求10个点,p维中就要求10^p个点.

线性模型:从回归到稀疏

Diabetes 数据集

这个数据集包括442个病人的10个生理变量,和1年后的癌症迹象,任务是从生理变量中预测癌症

diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

线性回归

1
2
3

from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

1	print(regr.coef_)

[  3.03499549e-01  -2.37639315e+02   5.10530605e+02   3.27736980e+02
  -8.14131709e+02   4.92814588e+02   1.02848452e+02   1.84606489e+02
   7.43519617e+02   7.60951722e+01]

1 2	# The mean square error np.mean((regr.predict(diabetes_X_test) - diabetes_y_test)**2)

2004.5676026898225

# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and y.
regr.score(diabetes_X_test, diabetes_y_test)

0.58507530226905713

收缩(shrinkage)

通过对损失函数(即优化目标)加入惩罚项，使得训练求解参数过程中会考虑到系数的大小，通过设置缩减系数(惩罚系数)，会使得影响较小的特征的系数衰减到0，只保留重要的特征。常用的缩减系数方法有lasso(L1正则化)，岭回归(L2正则化)。

如果每个维度的数据点很少，观察噪声就会导致很大的方差

1	X = np.c_[ .5, 1].T # 按列组合

1 2	print(np.c_[ .5, 1]) print(X)

[[ 0.5  1. ]]
[[ 0.5]
 [ 1. ]]

%matplotlib inline
y = [0.5, 1]
test = np.c_[0, 2].T
regr = linear_model.LinearRegression()

import matplotlib.pyplot as plt
plt.figure()

np.random.seed(0)
for _ in range(6):
    this_X = 0.1 * np.random.normal(size=(2, 1)) + X
    regr.fit(this_X, y)
    plt.plot(test, regr.predict(test))
    plt.scatter(this_X, y, s=3)

png

regr = linear_model.Ridge(alpha=0.1)
plt.figure()
np.random.seed(0)
for _ in range(6):
    this_X = 0.1 * np.random.normal(size=(2, 1)) + X
    regr.fit(this_X, y)
    plt.plot(test, regr.predict(test))
    plt.scatter(this_X, y, s=3)

png

这是一个 bias/variance 折衷的例子:alpha越大,bias越大,variance越小.我们选择 alpha来极小化输出误差.考虑数据集 diabetes

alphas = np.logspace(-4, -1, 6)
print([regr.set_params(alpha=alpha
            ).fit(diabetes_X_train, diabetes_y_train,
            ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])

[0.58511106838835336, 0.58520730154446765, 0.5854677540698493, 0.58555120365039159, 0.58307170855541623, 0.57058999437280111]

alphas

array([ 0.0001    ,  0.00039811,  0.00158489,  0.00630957,  0.02511886,
        0.1       ])

稀疏性

只拟合特征1和特征2

我们可以看到，尽管特征2在整个模型占有一个很大的系数，但是当考虑特征1时，其对 y 的影响就较小了。

为了改善问题条件,如缓解维度灾难,只选择部分提供信息的特征并将另一部分设置为无信息特征,例如特征2系数设为0.岭回归将会减小它的贡献,但不会设置为0.另一个惩罚方法称为Lasso,可以将一些系数设为0.这种方法称为稀疏方法,稀疏性可以视为 Occam 剃刀原则的应用.

regr = linear_model.Lasso()
scores = [regr.set_params(alpha=alpha
            ).fit(diabetes_X_train, diabetes_y_train,
            ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas]
best_alpha = alphas[scores.index(max(scores))]
regr.alpha = best_alpha
regr.fit(diabetes_X_train, diabetes_y_train)

Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

1	print(regr.coef_)

[   0.         -212.43764548  517.19478111  313.77959962 -160.8303982    -0.
 -187.19554705   69.38229038  508.66011217   71.84239008]

Notes:scikit-learn 也提供 LassoLars 使用 LARS 算法,对权重向量稀疏的时候很有效.

分类

在 iris 任务中,线性回归不是一个正确的方法,它给较远的数据给了太大的权重.另一个线性方法是用 sigmoid 函数或 logistic 函数

1 2	logistic = linear_model.LogisticRegression(C=1e5) logistic.fit(iris_X_train, iris_y_train)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

多分类
如果你有很多类需要预测，一种常用方法就是去拟合一对多分类器，然后使用根据投票为最后做决定。

logistic回归的收缩和稀疏性
C 参数控制了正则化参数,C 越大正则化越小.penalty='12'给出了收缩,penalty='11'给出了稀疏性.

练习
尝试用最近邻与线性模型分类手写数字集,用最后10%测试

from sklearn import datasets, neighbors, linear_model

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

1	X_digits.shape

(1797, 64)

split = int(len(X_digits) * 0.9)
X_digits_train = X_digits[:split]
y_digits_train = y_digits[:split]
X_digits_test = X_digits[split:]
y_digits_test = y_digits[split:]
knn = neighbors.KNeighborsClassifier()
knn.fit(X_digits_train, y_digits_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

1	print('knn score: %f' % knn.score(X_digits_test, y_digits_test))

knn score: 0.961111

1
2
3

logistic = linear_model.LogisticRegression()
logistic.fit(X_digits_train, y_digits_train)
print('logistic score: %f' % logistic.score(X_digits_test, y_digits_test))

logistic score: 0.938889

SVM

线性 SVM

SVM 属于判别模型,它试图找到样本的组合来建立一个超平面来极大化类别间的 margin.正则化由C参数设置,小的C意味着边缘是通过分割线周围的所有观测样例进行计算得到的(更大正则化)；大的C意味着边缘是通过邻近分割线的观测样例计算得到的(更少正则化)。

SVM 可用于回归 SVR,分类 SVC

1
2
3

from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

使用核

对于线性不可分的类使用核技巧.例如换成多项式核

RBF核

交互式的例子

练习

根据特征1和特征2，尝试用 SVMs 把1和2类从鸢尾属植物数据集中分出来。为每一个类留下10%，并测试这些观察值预期效果。

Warning:类已排序,不能直接用后10%
hint: 为了直观显示，你可以在网格上使用 decision_function 方法

1
2
3

iris = datasets.load_iris()
X = iris.data
y = iris.target

1	X = X[y != 0, :2]

1	y = y[y != 0]

1
2
3

import numpy as np
np.random.seed(0)
indices = np.random.permutation(len(X))

split = int(len(X) * 0.9)
X_train = X[indices[:split]]
y_train = y[indices[:split]]
X_test = X[indices[split:]]
y_test = y[indices[split:]]

for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X_train, y_train)
    
    plt.figure(fig_num)
    plt.clf()
    # 散点图颜色c根据类别y上色,绘图样式选择paired,zorder绘图顺序
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired,
               edgecolor='k', s=20) 
    
    # Circle out the test data
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none',
               zorder=10, edgecolor='k')
    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()
    
    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    # ravel 拉直
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()]) 
    
    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], 
               linestyles=['--', '-', '--'], levels=[-0.5, 0, 0.5])
    plt.title(kernel)
plt.show()

png

1	XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]

1	len(XX), len(YY)

(200, 200)

模型选择: 选择 estimators 与参数

分数与交叉验证分数

每一个 estimator 有一个 score 方法,在新数据上判定拟合或预测的质量.

from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])

0.97999999999999998

为了更好地刻画预测精度,我们用KFold交叉验证

import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
    # We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test = X_train.pop(k)
    X_train = np.concatenate(X_train) # 合并为一个list
    y_train = list(y_folds)
    y_test = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

交叉验证生成器

scikit-learn 有生成训练/测试下标list的类来进行交叉验证
它用split方法,接受输入数据集,并分成训练/测试集的下标,在每一次迭代中选择相应的交叉验证策略

from sklearn.model_selection import KFold, cross_val_score
X = ['a', 'a', 'b', 'c', 'c', 'c']
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
    print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

1 2	[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in k_fold.split(X_digits)]

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

交叉验证分数可以直接使用 cross_val_score 计算.
默认的 estimator 的 score 计算的是独立的分数

1	cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([ 0.93489149,  0.95659432,  0.93989983])

n_jobs=-1 表示计算会被分配到所有的 CPU 中
或者提供 scoring 参数来提供一个替代的评分方法

1 2	cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro')

array([ 0.93969761,  0.95911415,  0.94041254])

交叉验证生成器

KFold(n_splits, shuffle, random_state)	StratifiedKFold(n_splits, shuffle, random_state)	GroupKFold (n_splits)
分成k折,在k-1上训练,最后一个测试	保留了每个折里的类分布	确保相同组不会同时在训练集和测试集

ShuffleSplit(n_splits, test_size, train_size, random_state)	StratifiedShuffleSplit	GroupShuffleSplit
随机	随机,保留了每个折里的类分布	确保相同组不会同时在训练集和测试集

LeaveOneGroupOut()	LeavePGroupsOut(n_groups)	LeaveOneOut()

LeavePOut(p)	PredefinedSplit

练习

在数字数据集中，绘制关于线性核的 SVC estimator交叉验证分数

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm

digits = datasets.load_digits()
X = digits.data
y = digits.target

svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)

scores_mean = []
scores_std = []
for C in C_s:
    svc.C = C
    score = cross_val_score(svc, X, y, n_jobs=1)
    scores_mean.append(np.mean(score))
    scores_std.append(np.std(score))

# Do the plotting
import matplotlib.pyplot as plt
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.semilogx(C_s, scores_mean)
plt.semilogx(C_s, np.array(scores_mean) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores_mean) - np.array(scores_std), 'b--')
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()

png

网格搜索与交叉验证 estimator

网格搜索

scikit-learn 提供对象通过给定数据计算在参数网格拟合estimator时的分数来最大化交叉验证分数.

from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
         1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
         2.78256e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

1	clf.best_score_

0.92500000000000004

1	clf.best_estimator_.C

0.0077426368268112772

1	clf.score(X_digits[1000:], y_digits[1000:])

0.94353826850690092

默认 GridSearchCV 使用3折交叉验证,但如果是分类而不是回归,则使用stratified 3折
交叉验证 estimators

设置参数的交叉验证可以更有效地完成一个基础算法。这就是为什么对某些估计量来说，scikit-learn 提供了交叉验证估计量自动设置它们的参数。

from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)

1	lasso.alpha_

0.012291895087486161

练习

在 diabetes 数据集中,找到最优的正则化参数 alpha
另外：你有多相信 alpha 的选择？

from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

diabetes = datasets.load_diabetes()

X = diabetes.data[:150]
y = diabetes.target[:150]

lasso = Lasso(random_state=0)
alphas = np.logspace(-4, -0.5, 30)

tuned_parameters = [{'alpha': alphas}]
n_folds = 3

clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']

plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

#plot error lines showing +/- std. error of the scores
std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')
# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])

(0.0001, 0.31622776601683794)

png

# #############################################################################
# Bonus: how much can you trust the selection of alpha?
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.

lasso_cv = LassoCV(alphas=alphas, random_state=0)
k_fold = KFold(3)
print("Answer to the bonus question:",
"how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X)):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
        format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")

Answer to the bonus question: how much can you trust the selection of alpha?

Alpha parameters maximising the generalization score on different
subsets of the data:
[fold 0] alpha: 0.10405, score: 0.53573
[fold 1] alpha: 0.05968, score: 0.16278
[fold 2] alpha: 0.10405, score: 0.44437

Answer: Not very much since we obtained different alphas for different
subsets of the data and moreover, the scores for these alphas differ
quite substantially.