scikit-learn 学习（二）科学数据处理中的统计学习教程 (下)

2018-01-06

scikit-learn

无监督学习: 寻找数据的表示

1. 聚类

K-means

最简单的聚类算法

from sklearn import cluster, datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target

k_means = cluster.KMeans(n_clusters=3)
k_means.fit(X_iris)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

1	print(k_means.labels_[::10]) # 步长为10

[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]

1	print(y_iris[::10])

[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Warning:对真实性不保证,选择分类数困难,算法对初值敏感,可能陷入局部极小值.

应用实例:vector quantization(矢量量化)

通常的聚类特别是 KMeans,可以被视为一种选择小的模型来压缩信息的方法.这个问题有时也被称为矢量量化.例如,这可以被用来对一张图片进行色调分离

%matplotlib inline
import matplotlib.pyplot as plt
import scipy as sp
try:
    face = sp.face(gray=True)
except AttributeError:
    from scipy import misc
    face = misc.face(gray=True)
X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
k_means = cluster.KMeans(n_clusters=5, n_init=1)
k_means.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

import numpy as np
values = k_means.cluster_centers_.squeeze() # 从数组的形状中删除单维条目，即把shape中为1的维度去掉
labels = k_means.labels_
face_compressed = np.choose(labels, values) # 每个像素用5个类的中心像素代替,达到压缩的效果
face_compressed.shape = face.shape

1	plt.imshow(face)

<matplotlib.image.AxesImage at 0x7f38fd5ac4e0>

png

values

array([ 148.25700663,   71.49771602,  191.63450508,  109.71975211,
         26.11714144])

labels

array([3, 0, 0, ..., 0, 0, 0], dtype=int32)

1	k_means.cluster_centers_

array([[ 148.25700663],
       [  71.49771602],
       [ 191.63450508],
       [ 109.71975211],
       [  26.11714144]])

1	k_means.cluster_centers_.squeeze()

array([ 148.25700663,   71.49771602,  191.63450508,  109.71975211,
         26.11714144])

1	plt.imshow(face_compressed)

<matplotlib.image.AxesImage at 0x7f38fee9cf98>

png

凝聚的层次聚类: Ward

层次聚类（分层聚类算法）是一种旨在构建聚类层次结构的分析方法，一般来说，实现该算法的大多数方法有以下两种：

Agglomerative（聚合）自底向上的方法: 初始阶段，每一个样本将自己作为单独的一个簇，聚类的簇以最小化距离的标准进行迭代聚合。当感兴趣的簇只有少量的样本时，该方法是很合适的。如果需要聚类的簇数量很大，该方法比K_means算法的计算效率也更高。
Divisive（分裂）自顶向下的方法: 初始阶段，所有的样本是一个簇，当一个簇下移时，它被迭代的进行分裂。当估计聚类簇数量较大的数据时，该算法不仅效率低(由于样本始于一个簇，需要被递归的进行分裂)，而且从统计学的角度来讲也是不合适的

连接约束聚类

对于聚合聚类，通过连接图可以指定哪些样本可以被聚合在一个簇。在 scikit 中，图由邻接矩阵来表示，通常该矩阵是一个稀疏矩阵。这种表示方法是非常有用的，例如在聚类图像时检索连接区域(有时也被称为连接要素):

import matplotlib.pyplot as plt

from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering

# Generate data
try:  # SciPy >= 0.16 have face in misc
    from scipy.misc import face
    face = face(gray=True)
except ImportError:
    face = sp.face(gray=True)

1	sp.misc.imresize(face, 0.10)

array([[119, 104,  97, ..., 116, 137, 115],
       [117, 123, 123, ..., 113, 108,  92],
       [146, 136, 141, ...,  88,  80,  84],
       ..., 
       [123, 140, 145, ..., 137, 137, 140],
       [128, 144, 141, ..., 140, 136, 136],
       [126, 144, 134, ..., 137, 139, 142]], dtype=uint8)

1	face.shape

(76, 102)

1 2	# Resize it to 10% of the original size to speed up the processing face = sp.misc.imresize(face, 0.10) / 255

X = np.reshape(face, (-1, 1))

# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*face.shape)

1	connectivity

<7752x7752 sparse matrix of type '<class 'numpy.int64'>'
    with 38404 stored elements in COOrdinate format>

特征聚合

我们已经知道，稀疏性可以缓解维度灾难带来的问题，i.e 即与特征的数量相比，样本数量太少。另一个解决该问题的方式是合并相似的特征：feature agglomeration（特征聚合）。该方法可以通过对特征聚类来实现。换句话说，就是对样本数据转置后进行聚类。

digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)

agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
                                    n_clusters=32)
agglo.fit(X)

FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',
           connectivity=<64x64 sparse matrix of type '<class 'numpy.int64'>'
    with 288 stored elements in COOrdinate format>,
           linkage='ward', memory=None, n_clusters=32,
           pooling_func=<function mean at 0x7f393104d6a8>)

X_reduced = agglo.transform(X)

X_approx = agglo.inverse_transform(X_reduced)
images_approx = np.reshape(X_approx, images.shape)

2. 分解:从信号到成分

Components and loadings（成分和载荷）
如果 X 是多维数据，那么我们试图解决的问题是在不同的观察基础上对数据进行重写。我们希望学习得到载荷 L 和成分 C 使得 X = L C 。提取成分 C 有多种不同的方法。

主成份分析 PCA

主成分分析（PCA）将能够解释数据信息最大方差的的连续成分提取出来

上图中样本点的分布在一个方向上是非常平坦的：即三个单变量特征中的任何一个都可以由另外两个特征来表示。主成分分析法(PCA)可以找到使得数据分布不 flat 的矢量方向(可以反映数据主要信息的特征)。

当用主成分分析(PCA)来 transform（转换）数据时，可以通过在子空间上投影来降低数据的维数。

# Create a signal with only 2 useful dimensions
x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1, x2, x3]

1
2
3

from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

1	print(pca.explained_variance_) # explained_variance_，它代表降维后的各主成分的方差值。方差值越大，则说明越是重要的主成分

[  2.96426463e+00   9.46875918e-01   7.59936955e-32]

# As we can see, only the 2 first components are useful
pca.n_components = 2 # 指定希望PCA降维后的特征维度数目
X_reduced = pca.fit_transform(X)
X_reduced.shape

(100, 2)

独立成分分析:ICA

独立成分分析（ICA）可以提取数据信息中的独立成分，这些成分载荷的分布包含了最多的独立信息。该方法能够恢复 non-Gaussian（非高斯）独立信号:

# Generate sample data
import numpy as np
from scipy import signal
time = np.linspace(0, 10, 2000)
s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal
s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal
s3 = signal.sawtooth(2 * np.pi * time)  # Signal 3: saw tooth signal
S = np.c_[s1, s2, s3]
S += 0.2 * np.random.normal(size=S.shape)  # Add noise
S /= S.std(axis=0)  # Standardize data

# Mix data
A = np.array([[1, 1, 1], [0.5, 2, 1], [1.5, 1, 2]])  # Mixing matrix
X = np.dot(S, A.T)  # Generate observations

# Compute ICA
ica = decomposition.FastICA()
S_ = ica.fit_transform(X)  # Get the estimated sources
A_ = ica.mixing_.T
np.allclose(X,  np.dot(S_, A_) + ica.mean_)

True

放在一起

流水线

我们已经看到一些 estimators 能转换数据一些 estimators 能预测数据,我们也能创建 combined estimators

import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logictic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logictic)])

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

# Plot the PCA spectrum
pca.fit(X_digits)

plt.figure(1, figsize=(4, 3))
plt.clf()
# axes([x,y,xs,ys])
# 其中x代表在X轴的位置，y代表在Y轴的位置，xs代表在X轴上向右延展的
# 范围大小，yx代表在Y轴中向上延展的范围大小。
plt.axes([.2, .2, .7, .7]) 
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')

# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)

# Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe, dict(pca__n_components=n_components, logistic__C=Cs))
estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')

plt.legend(prop=dict(size=12))
plt.show()

png

1	len(pca.explained_variance_)

用特征面进行人脸识别

该实例用到的数据集来自 LFW (Labeled Faces in the Wild)。数据已经进行了初步预处理

对于数据测试结果有下面4种情况：
TP: 预测为正，实现为正
FP: 预测为正，实现为负
FN: 预测为负，实现为正
TN: 预测为负，实现为负

准确率： TP/ (TP+FP)
召回率： TP/ (TP + FN)
F1-score: 2*TP/(2*TP + FP + FN)

"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================

The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:

  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

.. _LFW: http://vis-www.cs.umass.edu/lfw/

Expected results for the top 5 most represented people in the dataset:

================== ============ ======= ========== =======
                   precision    recall  f1-score   support
================== ============ ======= ========== =======
     Ariel Sharon       0.67      0.92      0.77        13
     Colin Powell       0.75      0.78      0.76        60
  Donald Rumsfeld       0.78      0.67      0.72        27
    George W Bush       0.86      0.86      0.86       146
Gerhard Schroeder       0.76      0.76      0.76        25
      Hugo Chavez       0.67      0.67      0.67        15
       Tony Blair       0.81      0.69      0.75        36

      avg / total       0.80      0.80      0.80       322
================== ============ ======= ========== =======

"""

from time import time
import logging
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report #显示主要的分类指标，返回每个类标签的精确、召回率及F1值
from sklearn.metrics import confusion_matrix # 
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')


# #############################################################################
# Download the data, if not already on disk and load it as numpy arrays

lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)


# #############################################################################
# Split into a training set and a test set using a stratified k fold

# split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42)


# #############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150

print("Extracting the top %d eigenfaces from %d faces"
      % (n_components, X_train.shape[0]))
t0 = time()
pca = PCA(n_components=n_components, svd_solver='randomized',
          whiten=True).fit(X_train)
print("done in %0.3fs" % (time() - t0))

eigenfaces = pca.components_.reshape((n_components, h, w))

print("Projecting the input data on the eigenfaces orthonormal basis")
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print("done in %0.3fs" % (time() - t0))


# #############################################################################
# Train a SVM classification model

print("Fitting the classifier to the training set")
t0 = time()
param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], }
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best estimator found by grid search:")
print(clf.best_estimator_)


# #############################################################################
# Quantitative evaluation of the model quality on the test set

print("Predicting people's names on the test set")
t0 = time()
y_pred = clf.predict(X_test_pca)
print("done in %0.3fs" % (time() - t0))

print(classification_report(y_test, y_pred, target_names=target_names))
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))


# #############################################################################
# Qualitative evaluation of the predictions using matplotlib

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        # 这里画了前12张肖像
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())


# plot the result of the prediction on a portion of the test set

def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, target_names, i)
                     for i in range(y_pred.shape[0])]

plot_gallery(X_test, prediction_titles, h, w)

# plot the gallery of the most significative eigenfaces

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

plt.show()

Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
2018-01-06 14:28:33,104 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
2018-01-06 14:28:37,747 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
2018-01-06 14:28:40,713 Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006


Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
done in 0.528s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.008s
Fitting the classifier to the training set
done in 20.580s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.005, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Predicting people's names on the test set
done in 0.051s
                   precision    recall  f1-score   support

     Ariel Sharon       0.78      0.54      0.64        13
     Colin Powell       0.80      0.87      0.83        60
  Donald Rumsfeld       0.90      0.67      0.77        27
    George W Bush       0.84      0.98      0.91       146
Gerhard Schroeder       0.91      0.80      0.85        25
      Hugo Chavez       1.00      0.53      0.70        15
       Tony Blair       0.96      0.75      0.84        36

      avg / total       0.86      0.85      0.85       322

[[  7   1   0   5   0   0   0]
 [  2  52   1   5   0   0   0]
 [  0   3  18   6   0   0   0]
 [  0   3   0 143   0   0   0]
 [  0   1   0   3  20   0   1]
 [  0   4   0   2   1   8   0]
 [  0   1   1   6   1   0  27]]

png

1	target_names

array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
       'Gerhard Schroeder', 'Hugo Chavez', 'Tony Blair'],
      dtype='<U17')