scikit-learn 学习(二) 科学数据处理中的统计学习教程(上)

本教程使用统计学习技术

1. 统计学习:设置, estimator 对象

Datasets

1
2
3
4
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data
data.shape
(150, 4)
1
2
digits = datasets.load_digits()
digits.images.shape
(1797, 8, 8)
1
2
import matplotlib.pyplot as plt
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)
<matplotlib.image.AxesImage at 0x7fd41bfb1a20>

为了在 scikit 中使用这个数据集,把每张 8*8 的图像转换为长度为64的向量

1
data = digits.images.reshape((digits.images.shape[0], -1))
1
data[0]
array([  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.,   0.,   0.,  13.,
        15.,  10.,  15.,   5.,   0.,   0.,   3.,  15.,   2.,   0.,  11.,
         8.,   0.,   0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.,   0.,
         5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4.,  11.,   0.,
         1.,  12.,   7.,   0.,   0.,   2.,  14.,   5.,  10.,  12.,   0.,
         0.,   0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.])

estimators 对象

estimator 是任何从数据中学习到的对象,可以是分类,回归,聚类或者是一个 transformer 从原始数据中提取(extracts)/过滤(filters)特征.

所有 estimator 有 fit 方法

python estimator.fit(data)

Estimator parameters:所有estimator的参数可以在初始化时设置并用attribute修改:

python estimator = Estimator(param1=1, param2=2) estimator.param1

Estimator parameters:当数据拟合后,参数被估计,所有估计的参数是estimator的一个属性,用如下下划线表示

python estimator.estimated_param_

2. 监督学习

最近邻和维度灾难

iris 数据集是一个分类任务,包括3个不同的irises类

1
2
3
4
5
6
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)
array([0, 1, 2])

KNN分类

1
2
3
4
5
6
7
8
9
10
11
12
# split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
1
knn.predict(iris_X_test)
array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
1
iris_y_test
array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

维度灾难

为了使学习器有效,你需要将邻居的距离小于一个值d.1维中,平均 n ~ 1/d个点.
如果特征数为p,你需要 n ~ 1/d^p 个点.例如1维中要求10个点,p维中就要求10^p个点.

线性模型:从回归到稀疏

Diabetes 数据集

这个数据集包括442个病人的10个生理变量,和1年后的癌症迹象,任务是从生理变量中预测癌症

1
2
3
4
5
diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

线性回归

1
2
3
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
1
print(regr.coef_)
[  3.03499549e-01  -2.37639315e+02   5.10530605e+02   3.27736980e+02
  -8.14131709e+02   4.92814588e+02   1.02848452e+02   1.84606489e+02
   7.43519617e+02   7.60951722e+01]
1
2
# The mean square error
np.mean((regr.predict(diabetes_X_test) - diabetes_y_test)**2)
2004.5676026898225
1
2
3
4
# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and y.
regr.score(diabetes_X_test, diabetes_y_test)
0.58507530226905713

收缩(shrinkage)

通过对损失函数(即优化目标)加入惩罚项,使得训练求解参数过程中会考虑到系数的大小,通过设置缩减系数(惩罚系数),会使得影响较小的特征的系数衰减到0,只保留重要的特征。常用的缩减系数方法有lasso(L1正则化),岭回归(L2正则化)。

如果每个维度的数据点很少,观察噪声就会导致很大的方差

1
X = np.c_[ .5, 1].T # 按列组合
1
2
print(np.c_[ .5, 1])
print(X)
[[ 0.5  1. ]]
[[ 0.5]
 [ 1. ]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
%matplotlib inline
y = [0.5, 1]
test = np.c_[0, 2].T
regr = linear_model.LinearRegression()

import matplotlib.pyplot as plt
plt.figure()

np.random.seed(0)
for _ in range(6):
this_X = 0.1 * np.random.normal(size=(2, 1)) + X
regr.fit(this_X, y)
plt.plot(test, regr.predict(test))
plt.scatter(this_X, y, s=3)

png

1
2
3
4
5
6
7
8
regr = linear_model.Ridge(alpha=0.1)
plt.figure()
np.random.seed(0)
for _ in range(6):
this_X = 0.1 * np.random.normal(size=(2, 1)) + X
regr.fit(this_X, y)
plt.plot(test, regr.predict(test))
plt.scatter(this_X, y, s=3)

png

这是一个 bias/variance 折衷的例子:alpha越大,bias越大,variance越小.我们选择 alpha来极小化输出误差.考虑数据集 diabetes

1
2
3
4
alphas = np.logspace(-4, -1, 6)
print([regr.set_params(alpha=alpha
).fit(diabetes_X_train, diabetes_y_train,
).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])
[0.58511106838835336, 0.58520730154446765, 0.5854677540698493, 0.58555120365039159, 0.58307170855541623, 0.57058999437280111]
1
alphas
array([ 0.0001    ,  0.00039811,  0.00158489,  0.00630957,  0.02511886,
        0.1       ])

稀疏性

只拟合特征1和特征2

我们可以看到,尽管特征2在整个模型占有一个很大的系数,但是当考虑特征1时,其对 y 的影响就较小了。

为了改善问题条件,如缓解维度灾难,只选择部分提供信息的特征并将另一部分设置为无信息特征,例如特征2系数设为0.岭回归将会减小它的贡献,但不会设置为0.另一个惩罚方法称为Lasso,可以将一些系数设为0.这种方法称为稀疏方法,稀疏性可以视为 Occam 剃刀原则的应用.

1
2
3
4
5
6
7
regr = linear_model.Lasso()
scores = [regr.set_params(alpha=alpha
).fit(diabetes_X_train, diabetes_y_train,
).score(diabetes_X_test, diabetes_y_test) for alpha in alphas]
best_alpha = alphas[scores.index(max(scores))]
regr.alpha = best_alpha
regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
1
print(regr.coef_)
[   0.         -212.43764548  517.19478111  313.77959962 -160.8303982    -0.
 -187.19554705   69.38229038  508.66011217   71.84239008]

Notes:scikit-learn 也提供 LassoLars 使用 LARS 算法,对权重向量稀疏的时候很有效.

分类

在 iris 任务中,线性回归不是一个正确的方法,它给较远的数据给了太大的权重.另一个线性方法是用 sigmoid 函数或 logistic 函数


1
2
logistic = linear_model.LogisticRegression(C=1e5)
logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

多分类
如果你有很多类需要预测,一种常用方法就是去拟合一对多分类器,然后使用根据投票为最后做决定。

logistic回归的收缩和稀疏性
C 参数控制了正则化参数,C 越大正则化越小.penalty='12'给出了收缩,penalty='11'给出了稀疏性.

练习
尝试用最近邻与线性模型分类手写数字集,用最后10%测试

1
2
3
4
5
from sklearn import datasets, neighbors, linear_model

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
1
X_digits.shape
(1797, 64)
1
2
3
4
5
6
7
split = int(len(X_digits) * 0.9)
X_digits_train = X_digits[:split]
y_digits_train = y_digits[:split]
X_digits_test = X_digits[split:]
y_digits_test = y_digits[split:]
knn = neighbors.KNeighborsClassifier()
knn.fit(X_digits_train, y_digits_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
1
print('knn score: %f' % knn.score(X_digits_test, y_digits_test))
knn score: 0.961111
1
2
3
logistic = linear_model.LogisticRegression()
logistic.fit(X_digits_train, y_digits_train)
print('logistic score: %f' % logistic.score(X_digits_test, y_digits_test))
logistic score: 0.938889

SVM

线性 SVM

SVM 属于判别模型,它试图找到样本的组合来建立一个超平面来极大化类别间的 margin.正则化由C参数设置,小的C意味着边缘是通过分割线周围的所有观测样例进行计算得到的(更大正则化);大的C意味着边缘是通过邻近分割线的观测样例计算得到的(更少正则化)。

SVM 可用于回归 SVR,分类 SVC

1
2
3
from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

使用核

对于线性不可分的类使用核技巧.例如换成多项式核

RBF核

交互式的例子

练习

根据特征1和特征2,尝试用 SVMs 把1和2类从鸢尾属植物数据集中分出来。为每一个类留下10%,并测试这些观察值预期效果。

Warning:类已排序,不能直接用后10%
hint: 为了直观显示,你可以在网格上使用 decision_function 方法

1
2
3
iris = datasets.load_iris()
X = iris.data
y = iris.target
1
X = X[y != 0, :2]
1
y = y[y != 0]
1
2
3
import numpy as np
np.random.seed(0)
indices = np.random.permutation(len(X))
1
2
3
4
5
split = int(len(X) * 0.9)
X_train = X[indices[:split]]
y_train = y[indices[:split]]
X_test = X[indices[split:]]
y_test = y[indices[split:]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X_train, y_train)

plt.figure(fig_num)
plt.clf()
# 散点图颜色c根据类别y上色,绘图样式选择paired,zorder绘图顺序
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired,
edgecolor='k', s=20)

# Circle out the test data
plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none',
zorder=10, edgecolor='k')
plt.axis('tight')
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()

XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
# ravel 拉直
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

# Put the result into a color plot
Z = Z.reshape(XX.shape)
plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'],
linestyles=['--', '-', '--'], levels=[-0.5, 0, 0.5])
plt.title(kernel)
plt.show()

png

png

png

1
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
1
len(XX), len(YY)
(200, 200)

模型选择: 选择 estimators 与参数

分数与交叉验证分数

每一个 estimator 有一个 score 方法,在新数据上判定拟合或预测的质量.

1
2
3
4
5
6
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998

为了更好地刻画预测精度,我们用KFold交叉验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
# We use 'list' to copy, in order to 'pop' later on
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train) # 合并为一个list
y_train = list(y_folds)
y_test = y_train.pop(k)
y_train = np.concatenate(y_train)
scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

交叉验证生成器

scikit-learn 有生成训练/测试下标list的类来进行交叉验证
它用split方法,接受输入数据集,并分成训练/测试集的下标,在每一次迭代中选择相应的交叉验证策略

1
2
3
4
5
from sklearn.model_selection import KFold, cross_val_score
X = ['a', 'a', 'b', 'c', 'c', 'c']
k_fold = KFold(n_splits=3)
for train_indices, test_indices in k_fold.split(X):
print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
1
2
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

交叉验证分数可以直接使用 cross_val_score 计算.
默认的 estimator 的 score 计算的是独立的分数

1
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149,  0.95659432,  0.93989983])

n_jobs=-1 表示计算会被分配到所有的 CPU 中
或者提供 scoring 参数来提供一个替代的评分方法

1
2
cross_val_score(svc, X_digits, y_digits, cv=k_fold, 
scoring='precision_macro')
array([ 0.93969761,  0.95911415,  0.94041254])

交叉验证生成器

KFold(n_splits, shuffle, random_state) StratifiedKFold(n_splits, shuffle, random_state) GroupKFold (n_splits)
分成k折,在k-1上训练,最后一个测试 保留了每个折里的类分布 确保相同组不会同时在训练集和测试集
ShuffleSplit(n_splits, test_size, train_size, random_state) StratifiedShuffleSplit GroupShuffleSplit
随机 随机,保留了每个折里的类分布 确保相同组不会同时在训练集和测试集
LeaveOneGroupOut() LeavePGroupsOut(n_groups) LeaveOneOut()
LeavePOut(p) PredefinedSplit

练习

在数字数据集中,绘制关于线性核的 SVC estimator交叉验证分数

1
2
3
4
5
6
7
8
9
10
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm

digits = datasets.load_digits()
X = digits.data
y = digits.target

svc = svm.SVC(kernel='linear')
C_s = np.logspace(-10, 0, 10)
1
2
3
4
5
6
7
scores_mean = []
scores_std = []
for C in C_s:
svc.C = C
score = cross_val_score(svc, X, y, n_jobs=1)
scores_mean.append(np.mean(score))
scores_std.append(np.std(score))
1
2
3
4
5
6
7
8
9
10
11
# Do the plotting
import matplotlib.pyplot as plt
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.semilogx(C_s, scores_mean)
plt.semilogx(C_s, np.array(scores_mean) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores_mean) - np.array(scores_std), 'b--')
plt.ylabel('CV score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)
plt.show()

png

网格搜索与交叉验证 estimator

网格搜索

scikit-learn 提供对象通过给定数据计算在参数网格拟合estimator时的分数来最大化交叉验证分数.

1
2
3
4
from sklearn.model_selection import GridSearchCV, cross_val_score
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), n_jobs=-1)
clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
         1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
         2.78256e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
1
clf.best_score_
0.92500000000000004
1
clf.best_estimator_.C
0.0077426368268112772
1
clf.score(X_digits[1000:], y_digits[1000:])
0.94353826850690092

默认 GridSearchCV 使用3折交叉验证,但如果是分类而不是回归,则使用stratified 3折
交叉验证 estimators

设置参数的交叉验证可以更有效地完成一个基础算法。这就是为什么对某些估计量来说,scikit-learn 提供了 交叉验证 估计量自动设置它们的参数。

1
2
3
4
5
6
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
1
lasso.alpha_
0.012291895087486161

练习

在 diabetes 数据集中,找到最优的正则化参数 alpha
另外: 你有多相信 alpha 的选择?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn import datasets
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

diabetes = datasets.load_diabetes()

X = diabetes.data[:150]
y = diabetes.target[:150]

lasso = Lasso(random_state=0)
alphas = np.logspace(-4, -0.5, 30)

tuned_parameters = [{'alpha': alphas}]
n_folds = 3

clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False)
clf.fit(X, y)
scores = clf.cv_results_['mean_test_score']
scores_std = clf.cv_results_['std_test_score']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
plt.figure().set_size_inches(8, 6)
plt.semilogx(alphas, scores)

#plot error lines showing +/- std. error of the scores
std_error = scores_std / np.sqrt(n_folds)

plt.semilogx(alphas, scores + std_error, 'b--')
plt.semilogx(alphas, scores - std_error, 'b--')
# alpha=0.2 controls the translucency of the fill color
plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2)
plt.ylabel('CV score +/- std error')
plt.xlabel('alpha')
plt.axhline(np.max(scores), linestyle='--', color='.5')
plt.xlim([alphas[0], alphas[-1]])
(0.0001, 0.31622776601683794)

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# #############################################################################
# Bonus: how much can you trust the selection of alpha?
# To answer this question we use the LassoCV object that sets its alpha
# parameter automatically from the data by internal cross-validation (i.e. it
# performs cross-validation on the training data it receives).
# We use external cross-validation to see how much the automatically obtained
# alphas differ across different cross-validation folds.

lasso_cv = LassoCV(alphas=alphas, random_state=0)
k_fold = KFold(3)
print("Answer to the bonus question:",
"how much can you trust the selection of alpha?")
print()
print("Alpha parameters maximising the generalization score on different")
print("subsets of the data:")
for k, (train, test) in enumerate(k_fold.split(X)):
lasso_cv.fit(X[train], y[train])
print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))
print()
print("Answer: Not very much since we obtained different alphas for different")
print("subsets of the data and moreover, the scores for these alphas differ")
print("quite substantially.")
Answer to the bonus question: how much can you trust the selection of alpha?

Alpha parameters maximising the generalization score on different
subsets of the data:
[fold 0] alpha: 0.10405, score: 0.53573
[fold 1] alpha: 0.05968, score: 0.16278
[fold 2] alpha: 0.10405, score: 0.44437

Answer: Not very much since we obtained different alphas for different
subsets of the data and moreover, the scores for these alphas differ
quite substantially.
分享到