scikit-learn 学习(四)处理文本数据

本指南旨在一个单独实际任务中探索一些主要的 scikit-learn 工具: 分析关于 20 个不同主题的一个文档集合(新闻组帖子)。

在本节中,我们将会学习如何:

  • 读取文件内容以及所属的类别
  • 提取合适于机器学习的特征向量
  • 训练一个线性模型来进行分类
  • 使用网格搜索策略找到特征提取组件和分类器的最佳配置

教程设置

本教程源码在 scikit-learn 文件夹中

scikit-learn/doc/tutorial/text_analytics/

教程文件夹包含下列文件夹:

  • *.rst files - 教程文档源文件
  • data - 输入数据集的文件夹
  • skeletons - 为了练习用的不完整样本
  • solutions - 练习的答案

载入 20 新闻组数据集

这个数据集包含20000个新闻文档,几乎被平均分成20个不同的组。这是在文本应用中非常流行的数据。

首先,我们只考虑20个类中的其中4个

1
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
1
2
3
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42)
1
twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
1
len(twenty_train.data)
2257
1
len(twenty_train.filenames)
2257
1
print('\n'.join(twenty_train.data[0].split('\n')[:3]))
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
1
print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics
1
twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])
1
2
for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med

从文本文件里提取特征

首先将文本内容转换成数值特征向量。

词袋

最直观的方式是用词袋表示:

  1. 对训练集文档中的每个词映射到一个固定的整数ID
  2. 对每个文档 #i,计算每个词 w 出现的次数并存储在 X[i,j] 中作为特征 #j 的值,其中 j 是单词 w 的ID

词袋的表示意味着 n_features 是语料库中不同单词的数目,它常常超过 100000.

如果 n_samples == 10000,以 float32 类型存储在 numpy array 中要求内存 10000 x 100000 x 4 bytes = 4GB in RAM。
不过这是稀疏矩阵,scipy.sparse 矩阵存储高维稀疏数据集。

用 scikit-learn 符号化文本

文本处理中,符号化和过滤停用词能够建立特征字典,将文档转换为特征向量。

1
2
3
4
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape
(2257, 35788)

CountVectorizer 支持 N-grams 的数量或连续特征。一旦拟合,vectorizer 建立了特征下标的字典

1
count_vect.vocabulary_.get('algorithm')
4690

值是键对应的词汇在整个文档中出现的次数。

从次数到频率

将单词在文档中出现的次数除以文档总字数,这个新特征称为 tf。

最重要的另一个改进是缩减语料库中许多文档中出现的单词的权重,因此比仅在语料库的较小部分中出现的单词更少提供信息。

这个缩减称为 tf-idf

1
2
3
4
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
(2257, 35788)

可以如下将 fit 与 transform 操作合在一起

1
2
3
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
(2257, 35788)

训练分类器

从朴素贝叶斯分类器开始,它提供了很好的 baseline。scikit-learn 提供了很多这个分类器的变体,其中一个最适合 word counts 的分类器是多项式变体。

1
2
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

为了预测,我们需要和之前一样提取特征。不同的是我们使用 trainform 因为已经在训练集上拟合过了

1
2
3
4
5
6
7
8
docs_new = ['God is love', 'OpenGl on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
'God is love' => soc.religion.christian
'OpenGl on the GPU is fast' => comp.graphics

建立流水线

为了使 vectorizer => transformer => classifier 更容易实现,scikit-learn 提供了一个 Pipeline 类如同一个复合分类器

1
2
3
4
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])

名字 vect, tfidf, clf 是任意的

1
text_clf.fit(twenty_train.data, twenty_train.target)
Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

在测试集上评估

1
2
3
4
5
6
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
categories=categories, shuffle=True,random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
0.83488681757656458
1
np.mean([True, False, True])
0.66666666666666663

下面采用 SVM,广泛被视为文本分类最好的算法之一,虽然比朴素贝叶斯速度慢一些。

1
2
3
4
5
6
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, random_state=42,
max_iter=5, tol=None))])
1
text_clf.fit(twenty_train.data, twenty_train.target)
Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False))])
1
2
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
0.9127829560585885

更多分类信息

1
2
from sklearn import metrics
print
<function print>
1
2
print(metrics.classification_report(twenty_test.target, predicted, 
target_names=twenty_test.target_names))
                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502
1
metrics.confusion_matrix(twenty_test.target, predicted)
array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])
1
twenty_test.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

正如预期的那样,混淆矩阵表明,atheism 和 christian 新闻组的帖子比 graphics 更容易混淆彼此。

使用网格搜索进行参数调整

我们已经遇到了一些超参数例如 TfidfTransformer 中的 use_idf。分类器也有许多参数。MultinomialNB 有光滑参数 alpha,SGDClassifier 有惩罚参数 alpha。

我们尝试线性 SVM 的各种参数

1
2
3
4
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
'clf__alpha': (1e-2, 1e-3)}

用 n_jobs 参数进行多 CPU 并行计算,-1表示使用所有 CPU

1
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
1
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
1
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
'soc.religion.christian'
1
gs_clf.best_score_
0.90000000000000002
1
2
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)

gs_clf.cv_results_ 给出了更多详细细节

1
gs_clf.cv_results_
/root/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/root/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/root/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/root/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/root/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)





{'mean_fit_time': array([ 0.18523932,  0.78656578,  0.17453774,  0.62395978,  0.20309472,
         0.53238297,  0.19375126,  0.53748655]),
 'mean_score_time': array([ 0.1029977 ,  0.18591921,  0.0927794 ,  0.19722565,  0.08967264,
         0.12721872,  0.12251814,  0.09853498]),
 'mean_test_score': array([ 0.8775,  0.875 ,  0.765 ,  0.78  ,  0.9   ,  0.89  ,  0.7675,  0.81  ]),
 'mean_train_score': array([ 0.99374372,  1.        ,  0.94123886,  0.97623272,  1.        ,
         1.        ,  0.98499057,  1.        ]),
 'param_clf__alpha': masked_array(data = [0.01 0.01 0.01 0.01 0.001 0.001 0.001 0.001],
              mask = [False False False False False False False False],
        fill_value = ?),
 'param_tfidf__use_idf': masked_array(data = [True True False False True True False False],
              mask = [False False False False False False False False],
        fill_value = ?),
 'param_vect__ngram_range': masked_array(data = [(1, 1) (1, 2) (1, 1) (1, 2) (1, 1) (1, 2) (1, 1) (1, 2)],
              mask = [False False False False False False False False],
        fill_value = ?),
 'params': [{'clf__alpha': 0.01,
   'tfidf__use_idf': True,
   'vect__ngram_range': (1, 1)},
  {'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)},
  {'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)},
  {'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)},
  {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)},
  {'clf__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)},
  {'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)},
  {'clf__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}],
 'rank_test_score': array([3, 4, 8, 6, 1, 2, 7, 5], dtype=int32),
 'split0_test_score': array([ 0.85820896,  0.8358209 ,  0.69402985,  0.70149254,  0.89552239,
         0.88059701,  0.76119403,  0.76119403]),
 'split0_train_score': array([ 0.9887218 ,  1.        ,  0.93233083,  0.96240602,  1.        ,
         1.        ,  0.97744361,  1.        ]),
 'split1_test_score': array([ 0.87969925,  0.88721805,  0.78947368,  0.82706767,  0.86466165,
         0.86466165,  0.71428571,  0.80451128]),
 'split1_train_score': array([ 0.99625468,  1.        ,  0.95131086,  0.98501873,  1.        ,
         1.        ,  0.97752809,  1.        ]),
 'split2_test_score': array([ 0.89473684,  0.90225564,  0.81203008,  0.81203008,  0.93984962,
         0.92481203,  0.82706767,  0.86466165]),
 'split2_train_score': array([ 0.99625468,  1.        ,  0.94007491,  0.98127341,  1.        ,
         1.        ,  1.        ,  1.        ]),
 'std_fit_time': array([ 0.0227632 ,  0.10502007,  0.03802567,  0.03324145,  0.06592948,
         0.09490877,  0.06331322,  0.08667628]),
 'std_score_time': array([ 0.02282605,  0.03587807,  0.04588374,  0.07623751,  0.02072694,
         0.01909799,  0.01674307,  0.03193668]),
 'std_test_score': array([ 0.01500217,  0.02847571,  0.05120452,  0.05605779,  0.03082125,
         0.0254174 ,  0.04620275,  0.04244373]),
 'std_train_score': array([ 0.00355103,  0.        ,  0.00779216,  0.00989579,  0.        ,
         0.        ,  0.01061333,  0.        ])}

Exercise

选择合适的 estimator

分享到