tf.estimator Quickstart

TensorFlow 的高级 machine learning API(tf.estimator) 使我们方便地构建,训练和评估大量 machine learning models。在本教程中,你将使用 tf.estimator 来建立一个 neural network classifier 并在鸢尾花卉数据集(Iris data set) 上基于萼片/花瓣(sepal/petal)的几何形状预测花的种类。

你将编写代码来实现以下5个步骤:

  1. 加载 CSVs 文件–包含有鸢尾花卉训练/测试集(Iris training/test set) 为一个 TensorFlow Dataset
  2. 构建一个 neural network classifier
  3. 使用训练数据训练模型
  4. 评估模型精度
  5. 对新的样本分类

注意: 在开始本教程之前要安装好 TensorFlow

完整的 Neural Network 源码

以下是完整的 neural network classifier 源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
from six.moves.urllib.request import urlopen

import numpy as np
import tensorflow as tf

# Data sets
IRIS_TRAINING = "iris_training.csv"
IRIS_TRAINING_URL = "http://download.tensorflow.org/data/iris_training.csv"

IRIS_TEST = "iris_test.csv"
IRIS_TEST_URL = "http://download.tensorflow.org/data/iris_test.csv"

def main():
# If the training and test sets aren't stored locally, download them.
if not os.path.exists(IRIS_TRAINING):
raw = urlopen(IRIS_TRAINING_URL).read()
with open(IRIS_TRAINING, "wb") as f:
f.write(raw)

if not os.path.exists(IRIS_TEST):
raw = urlopen(IRIS_TEST_URL).read()
with open(IRIS_TEST, "wb") as f:
f.write(raw)

# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TRAINING,
target_dtype=np.int,
features_dtype=np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TEST,
target_dtype=np.int,
features_dtype=np.float32)

# Specify that all features have real-value data
feature_columns = [tf.feature_column.numeric_column("x", shape=[4])]

# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model")
# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
shuffle=True)

# Train model.
classifier.train(input_fn=train_input_fn, steps=2000)

# Define the test inputs
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(test_set.data)},
y=np.array(test_set.target),
num_epochs=1,
shuffle=False)

# Evaluate accuracy.
accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]

print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

# Classify two new flower samples.
new_samples = np.array(
[[6.4, 3.2, 4.5, 1.5],
[5.8, 3.1, 5.0, 1.7]], dtype=np.float32)
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": new_samples},
num_epochs=1,
shuffle=False)

predictions = list(classifier.predict(input_fn=predict_input_fn))
predicted_classes = [p["classes"] for p in predictions]

print(
"New Samples, Class Predictions: {}\n"
.format(predicted_classes))

if __name__ == "__main__":
main()
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/iris_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f5ae685de10>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from /tmp/iris_model/model.ckpt-2000
INFO:tensorflow:Saving checkpoints for 2001 into /tmp/iris_model/model.ckpt.
INFO:tensorflow:loss = 5.46321, step = 2001
INFO:tensorflow:global_step/sec: 866.16
INFO:tensorflow:loss = 6.87284, step = 2101 (0.116 sec)
INFO:tensorflow:global_step/sec: 1041.16
INFO:tensorflow:loss = 3.69508, step = 2201 (0.096 sec)
INFO:tensorflow:global_step/sec: 985.203
INFO:tensorflow:loss = 2.3037, step = 2301 (0.101 sec)
INFO:tensorflow:global_step/sec: 983.131
INFO:tensorflow:loss = 7.33763, step = 2401 (0.102 sec)
INFO:tensorflow:global_step/sec: 916.557
INFO:tensorflow:loss = 7.2439, step = 2501 (0.110 sec)
INFO:tensorflow:global_step/sec: 987.515
INFO:tensorflow:loss = 5.76083, step = 2601 (0.101 sec)
INFO:tensorflow:global_step/sec: 980.429
INFO:tensorflow:loss = 2.64258, step = 2701 (0.102 sec)
INFO:tensorflow:global_step/sec: 969.033
INFO:tensorflow:loss = 10.926, step = 2801 (0.103 sec)
INFO:tensorflow:global_step/sec: 963.906
INFO:tensorflow:loss = 5.91679, step = 2901 (0.104 sec)
INFO:tensorflow:global_step/sec: 994.896
INFO:tensorflow:loss = 1.67117, step = 3001 (0.100 sec)
INFO:tensorflow:global_step/sec: 989.437
INFO:tensorflow:loss = 8.41107, step = 3101 (0.101 sec)
INFO:tensorflow:global_step/sec: 1042.21
INFO:tensorflow:loss = 10.0648, step = 3201 (0.096 sec)
INFO:tensorflow:global_step/sec: 981.781
INFO:tensorflow:loss = 2.24606, step = 3301 (0.102 sec)
INFO:tensorflow:global_step/sec: 985.14
INFO:tensorflow:loss = 5.87443, step = 3401 (0.101 sec)
INFO:tensorflow:global_step/sec: 975.12
INFO:tensorflow:loss = 4.06221, step = 3501 (0.103 sec)
INFO:tensorflow:global_step/sec: 956.165
INFO:tensorflow:loss = 0.989045, step = 3601 (0.104 sec)
INFO:tensorflow:global_step/sec: 941.931
INFO:tensorflow:loss = 3.49446, step = 3701 (0.107 sec)
INFO:tensorflow:global_step/sec: 954.209
INFO:tensorflow:loss = 4.13903, step = 3801 (0.104 sec)
INFO:tensorflow:global_step/sec: 1044.03
INFO:tensorflow:loss = 3.30474, step = 3901 (0.096 sec)
INFO:tensorflow:Saving checkpoints for 4000 into /tmp/iris_model/model.ckpt.
INFO:tensorflow:Loss for final step: 6.47402.
INFO:tensorflow:Starting evaluation at 2017-12-17-12:11:46
INFO:tensorflow:Restoring parameters from /tmp/iris_model/model.ckpt-4000
INFO:tensorflow:Finished evaluation at 2017-12-17-12:11:46
INFO:tensorflow:Saving dict for global step 4000: accuracy = 0.966667, average_loss = 0.0702326, global_step = 4000, loss = 2.10698

Test Accuracy: 0.966667

INFO:tensorflow:Restoring parameters from /tmp/iris_model/model.ckpt-4000
New Samples, Class Predictions:    [array([b'1'], dtype=object), array([b'2'], dtype=object)]

下面部分会详细地分析这段代码。

加载 Iris CSV data 到 TensorFlow

Iris data sets 包含150行的数据,包括3个相关的 Iris 品种:Iris setosa, Iris virginica, and Iris versicolor 的各50个样本。

每一行是包括下列数据的一个花的样本:sepal length,sepal width,petal length,petal width,和 flower species。花的品种用0,1,2这3个整数表示。

本教程中,Iris 数据已经被随机打乱并分成了两个 CSVs:

  • 120个样本的训练集 iris_training.csv
  • 30个样本的测试集 iris_test.csv

首先 import 所有必要的 modules,定义下载和存储数据的地址:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
from six.moves.urllib.request import urlopen

import tensorflow as tf
import numpy as np

IRIS_TRAINING = 'iris_training.csv'
IRIS_TRAINING_URL = 'http://download.tensorflow.org/data/iris_training.csv'

IRIS_TEST = 'iris_test.csv'
IRIS_TEST_URL = 'http;//download.tensorflow.org/data/iris_test.csv'

然后,如果训练集和测试集没有下载本地,就下载它们:

1
2
3
4
5
6
7
8
9
if not os.path.exists(IRIS_TRAINING):
raw = urlopen(IRIS_TRAINING_URL).read()
with open(IRIS_TRAINING, 'wb') as f:
f.write(raw)

if not os.path.exists(IRIS_TEST):
raw = urlopen(IRIS_TEST_URL).read()
with open(IRIS_TEST, 'wb') as f:
f.write(raw)

接下来,使用 load_csv_with_header() 方法将训练集与测试集载入到 Datasets 中。它有3个参数:

  • filename CSV文件的路径名称
  • target_dtype target value 的numpy数据类型
  • features_dtype feature values 的numpy数据类型

这里,target 是 flower species,从0到2的整数,类型是 np.int:

1
2
3
4
5
6
7
8
9
# Load datasets.
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TRAINING,
target_dtype=np.int,
features_dtype=np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TEST,
target_dtype=np.int,
feature_dtype=np.float32)

Datasets 在 tf.contrib.learn 中是 named tuples。你可以通过 data 或 target 这两个 fields 来访问特征数据与目标值。training_set.data 和 training_set.target 包含了训练集的特征数据和目标值。test_set.data 和 test_set.target 包含了测试集的特征数据和目标值。

之后几个部分,你将会使用 training_set.data 和 training_set.target 来训练你的模型,test_set.data 和 test_set.target 来评估你的模型。但首先你要构建你的模型

构建一个 DNN 分类器

tf.estimator 提供了大量的预定义的模型,称为 Estimators。这里,你将构建一个 DNN 分类器模型来拟合 Iris data。使用 tf.estimator,你可以用带有 tf.estimator.DNNClassifier 的几行代码来实例化:

1
2
3
4
5
6
7
8
# Specify that all features have real_value data
feature_columns = [tf.feature_coluimn.numeric_column('x', shape=[4])]

# Build 3 layer DNN with 10, 20, 10 units respectively
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10].
n_classes=3,
model_dir='/tmp/iris_model')

上面的代码首先定义了模型的特征的列数,并指定了数据类型。所有的特征数据是连续的,因此 tf.feature_column.numeric_column 是一个合适的函数。

然后,代码创建了一个 DNNClassifier model 使用以下参数:

  • feature_columns=feature_columns
  • hidden_units=[10, 20, 10] 3个 hidden layers,各自包含10, 20, 10个 neurons
  • n_classes=3 目标种类有3种
  • model_dir=/tmp/iris_model TensowFlow 在训练的过程中将会用来存储 checkpoint 数据 和 TensorBoard summaries 的目录

描述训练输出管线 (training input pipeline)

tf.estimator API 使用输入函数来创建 TensorFLow 操作用来生成模型的数据。我们可以使用 tf.estimator.inputs.numpy_input_fn 来产生输入管线:

1
2
3
4
5
6
# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={'x': np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
shuffle=True)

拟合 DNNClassifier

现在你已经构建了你的 DNN classifier model。你可以使用 train 方法来拟合。传入 train_input_fn 作为 input_fn,和训练迭代的次数,这里是2000次。

1
2
# Train model.
classifier.train(input_fn=train_input_fn, steps=2000)

模型的 state 在这个 classifier 里是保存的。因此你可以迭代地训练例如上述代码等价于:

1
2
classifier.train(input_fn=train_input_fn, steps=1000)
classifier.train(input_fn=train_input_fn, steps=1000)

评估模型精度

你已经训练了你的 DNNClassifier model 在 Iris training data 上。现在,你可以使用 evaluate 方法检查测试集上的精度。和 train 类似,evaluate 作为一个输入函数建立输入管线。 evaluate 返回一个 dict 带有评估结果。下面代码传入 Iris test 数据到 evaluate 中并从结果中打印 accuracy:

1
2
3
4
5
6
7
8
9
10
11
# Define the test inputs
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={'x': np.array(test_set.data)},
y=np.array(test_set.target),
num_epochs=1,
shuffle=False)

# Evaluate accuracy
accuracy_score = classifier.evaluate(input_fn=test_input_fn)['accuracy']

print('\nTest Accuracy: {0:f}\n'.format(accuracy_score))

注意: num_epochs=1 参数十分重要。test_input_fn 将会在整个数据上迭代一次,然后 raise OutOfRangeError。这个 error 发出信号使这个分类器停止评估,所以它会评估整个输入一次

当你运行后,将会打印如下结果:

1
Test Accuracy: 0.966667

分类新的样本

使用 estimator的 predict() 方法来分类新的样本。例如,对以下两个新的花样本分类。

predict 返回的是一个 generator of dicts。可以转换为 list。下列代码实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Classify two new flower samples.
new_samples = np.array(
[[6.4, 3.2, 4.5, 1.5],
[5.8, 3.1, 5.0, 1.7]], dtype=np.float32)
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x={'x': new_samples},
num_epochs=1,
shuffle=False)

predictions = list(classifier.predict(input_fn=predict_input_fn))
predicted_classes = [p['classes'] for p in predictions]

print('New Samples, Class Predictions: {}\n'.format(predicted_classes))

你的结果应该如下

1
New Samples, Class Predictions: [1 2]
分享到