Building Input Functions with tf.estimator

本教程介绍在 tf.estimator 中创建输入函数。你将会得到一个总览,关于如何构建一个 input_fn 来预处理并 feed 数据到你的模型。然后,你将应用一个 input_fn 到训练,评估和预测数据中,具体实例是用一个 NN 回归器预测房价中位数。

用 input_fn 定制输入管线(Input Pipelines)

input_fn 被用来传递特征与目标数据到 Estimator 的 train,evaluate,predict 方法中。使用者可以在 input_fn 中使用特征工程或者预处理技术。下面是一个例子:

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np

training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
filename=IRIS_TRAINING, target_dtype=np.int, features_dtype=np.float32)

train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
shuffle=True)

classifier.train(input_fn=train_input_fn, steps=2000)

剖析 input_fn

以下代码阐述了 input function 的一个基本的框架:

1
2
3
4
5
6
7
def my_input_fn():

# Preprocess your data here...

# ...then return 1) a mapping of feature columns to Tensors with
# the corresponding feature data, and 2) a Tensor containing labels
return feature_cols, labels

输入函数的主体包括具体的预处理输入数据的方法,例如剔除缺失数据或特征缩放(feature scaling)

输入函数必须返回下列两个值,包含最后要 feed 给你模型的 feature 和 label data。

feature_cols

一个包括 key/value 的 dict 将特征列的 names 映射到 Tensor s

labels

一个 Tensor 包括你的 label 值

转换特征数据到 Tensors

如果你的 feature/label 数据是 python array 或者存储在 pandas dataframes 或者 numpy arrays 中,你可以使用以下方法构建 input_fn:

1
2
3
4
5
6
import numpy as np
# numpy input_fn.
my_input_fn = tf.estimator.inputs.numpy_input_fn(
x={'x': np.array(x_data)},
y=np.array(y_data),
...)

1
2
3
4
5
6
import pandas as pd
# pandas input_fn
my_input_fn = tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({'x': x_data}),
y=pd.Series(y_data),
...)

对于稀疏的分类数据,你可能会使用 SparseTensor,它被3个参数实例化:

dense_shape

shape of the tensor。用 list 来说明每一维的维数。例如:dense_shape=[3,6] specifies a two-dimensional 3x6 tensor,dense_shape=[2,3,4] specifies a three-dimensional 2x3x4 tensor, and dense_shape=[9] specifies a one-dimensional tensor with 9 elements.

indices

indices=[[1,3], [2,4]] specifies that the elements with indexes of [1,3] and [2,4] have nonzero values.

values

For example, given indices=[[1,3], [2,4]], the parameter values=[18, 3.6] specifies that element [1,3] of the tensor has a value of 18, and element [2,4] of the tensor has a value of 3.6.

The following code defines a two-dimensional SparseTensor with 3 rows and 5 columns. The element with index [0,1] has a value of 6, and the element with index [2,4] has a value of 0.5 (all other values are 0):

1
2
3
sparse_tensor = tf.SparseTensor(indices=[[0, 1], [2, 4]],
values=[6, 0.5],
dense_shape=[3, 5])

这表示如下 tensor

1
2
3
[[0, 6, 0, 0, 0]
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0.5]]

传入 input_fn 数据到你的模型

为了传入数据到你的模型,你简单地传入你的输出函数到 train 中,例如

1
classifier.train(input_fn=my_input_fn, steps=2000)

注意到 input_fn 参数必须接受一个函数(i.e., input_fn=my_input_fn),而不是函数的返回值(input_fn=my_input_fn())。这意味着以下代码会产生 TypeError:

1
classifier.train(input_fn=my_input_fn(training_set), steps=2000)

然而,如果你想参数化你的输入函数,这里有另外的方法。你可以使用一个 wrapper function 不接受任何参数作为你的 input_fn,然后使用它来调用你希望加上参数的函数,例如:

1
2
3
4
5
6
7
def my_input_fn(data_set):
...

def my_input_fn_training_set():
return my_input_fn(training_set)

classifier.train(input_fn=my_inpuy_fn_training_set, steps=2000)

或者,你可以使用 Python 的 functools.partial 函数来构建一个新的函数

1
2
classifier.train(
input_fn=functools.partial(my_input_fn, data_set=training_set), steps=2000)

第三种方法是 wrap 你的 input_fn 在一个 lambda 函数中,然后将它传给 input_fn 参数:

1
classifier.train(input_fn=lambda: my_input_fn(training_set), steps=2000)

上面你在设计一个输入管线时能够接受参数的一个重大的优势是可以传递同样的 input_fn 到 evaluate 和 predict 中,仅仅需要修改参数

1
classifier.evaluate(input_fn=lambda: my_input_fn(test_set), steps=2000)

这个方法增加了代码的可维护性。不需要定义大量的 input_fn。

最后,你可以使用 tf.estimator.inputs 来从 numpy 或 pandas 数据集中创建 input_fn。另外的优势是你可以使用更多的参数,例如 num_epochs 和 shuffle

1
2
3
4
5
6
7
8
import pandas as pd

def get_input_fn_from_pandas(data_set, num_epochs=None, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pdDataFrame(...),
y=pd.Series(...),
num_epochs=num_epochs,
shuffle=shuffle)
1
2
3
4
5
6
7
8
import numpy as np

def get_input_fn_from_numpy(data_set, num_epochs=None, shuffle=True):
return tf.estimator.inputs.numpy_input_fn(
x={...},
y=np.array(...),
num_epochs=num_epochs,
shuffle=shuffle)

一个关于 Boston 房价的 NN model

在本教程的剩余部分,你会写出一个 input_fn 来预处理 Boston 房价数据的子集,并使用它来 feed 数据到一个 NN regressor 来预测房价中位数。

数据包括以下几个特征:

预测的房价是 MEDV。

Setup

下载下列数据集:boston_train.csv,boston_test.csv,boston_predict

以下教程会一步步的展示如何建立一个 input_fn,feed 数据到 NN regressor,训练并评估模型,最后对房价预测。完整代码

导入房价数据

首先 imports 并设置日志记录的详细信息来获得更详细的日志输出

1
2
3
4
5
6
import itertools

import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
/root/anaconda3/envs/tensorflow/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

定义数据集的列名。为了区分特征与标签,也定义了 FEATURES 与 LABEL。然后读取三个 CSVs 文件到 pandas DataFrames:

1
2
3
4
5
6
7
8
9
10
11
12
COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age",
"dis", "tax", "ptratio", "medv"]
FEATURES = ["crim", "zn", "indus", "nox", "rm",
"age", "dis", "tax", "ptratio"]
LABEL = "medv"

training_set = pd.read_csv('boston_train.csv', skipinitialspace=True,
skiprows=1, names=COLUMNS)
test_set = pd.read_csv('boston_test.csv', skipinitialspace=True,
skiprows=1, names=COLUMNS)
prediction_set = pd.read_csv('boston_predict.csv', skipinitialspace=True,
skiprows=1, names=COLUMNS)

定义特征列与创建 Regressor

下一步,创建1个关于输入数据 FeatureColumn 的 list,它详细说明了用来训练的特征的集合。因为在房价数据中所有的特征都是连续实值,你可以创建它们的 FeatureColumns 通过 tf.contrib.layers.real_valued_column() 函数:

1
feature_cols = [tf.feature_column.numeric_column(k) for k in FEATURES]

注意:要对特征列有深入了解,请看这里Linear Model Tutorial 介绍了如何对分类数据定义 FeatureColumns。

现在,实例化一个 DNNRegressor 作为一个 NN regression model。你用到两个参数:hidden_units,这是一个超参数定义了每一个隐藏层中节点的数目。(在这里,是两层隐藏层,每一层有10个节点)。另一个参数是 feature_columns,包括了刚才定义的 FeatureColumns。

1
2
3
regressor = tf.estimator.DNNRegressor(feature_columns=feature_cols,
hidden_units=[10, 10],
model_dir='/tmp/boston_model')
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/boston_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3beb1527f0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

建立 input_fn

为了传入输入数据到 regressor 中,引入一个 factory method 接受一个 pandas Dataframe 返回一个 input_fn:

1
2
3
4
5
6
def get_input_fn(data_set, num_epochs=None, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),
y=pd.Series(data_set[LABEL].values),
num_epochs=num_epochs,
shuffle=shuffle)

注意到输入数据作为 data_set 参数传入到 input_fn 中,这意味着这个函数可以处理任何 Dataframe,例如 training_set,test_set,prediction_set。

两个另外的参数:

num_epochs:

控制在整个数据集中训练的轮数。训练时设为 None,所以 input_fn 保持返回数据直到要求的训练迭代次数。评估和预测阶段,设置为1,所以 input_fn 会在整个数据上轮一次直到 raise OutOfRangeError。That error will signal the Estimator to stop evaluate or predict.

shuffle:

是否打乱数据,评估与预测时,不需要打乱 False。训练时 True。

训练 Regressor

为了训练 NN regressor,运行 train 并带有参数 training_set:

1
regressor.train(input_fn=get_input_fn(training_set), steps=5000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/boston_model/model.ckpt.
INFO:tensorflow:loss = 47975.1, step = 1
INFO:tensorflow:global_step/sec: 531.179
INFO:tensorflow:loss = 11445.7, step = 101 (0.189 sec)
INFO:tensorflow:global_step/sec: 698.58
INFO:tensorflow:loss = 8535.93, step = 201 (0.143 sec)
INFO:tensorflow:global_step/sec: 587.869
INFO:tensorflow:loss = 5559.58, step = 301 (0.171 sec)
INFO:tensorflow:global_step/sec: 694.811
INFO:tensorflow:loss = 10463.6, step = 401 (0.144 sec)
INFO:tensorflow:global_step/sec: 611.282
INFO:tensorflow:loss = 5873.93, step = 501 (0.163 sec)
INFO:tensorflow:global_step/sec: 592.174
INFO:tensorflow:loss = 7244.08, step = 601 (0.169 sec)
INFO:tensorflow:global_step/sec: 606.774
INFO:tensorflow:loss = 9400.8, step = 701 (0.165 sec)
INFO:tensorflow:global_step/sec: 580.034
INFO:tensorflow:loss = 7481.78, step = 801 (0.172 sec)
INFO:tensorflow:global_step/sec: 585.823
INFO:tensorflow:loss = 4866.61, step = 901 (0.171 sec)
INFO:tensorflow:global_step/sec: 605.228
INFO:tensorflow:loss = 6060.24, step = 1001 (0.165 sec)
INFO:tensorflow:global_step/sec: 590.249
INFO:tensorflow:loss = 7182.22, step = 1101 (0.170 sec)
INFO:tensorflow:global_step/sec: 583.159
INFO:tensorflow:loss = 4686.24, step = 1201 (0.171 sec)
INFO:tensorflow:global_step/sec: 517.634
INFO:tensorflow:loss = 5932.13, step = 1301 (0.194 sec)
INFO:tensorflow:global_step/sec: 575.329
INFO:tensorflow:loss = 5699.82, step = 1401 (0.173 sec)
INFO:tensorflow:global_step/sec: 592.402
INFO:tensorflow:loss = 7183.55, step = 1501 (0.169 sec)
INFO:tensorflow:global_step/sec: 595.558
INFO:tensorflow:loss = 5079.74, step = 1601 (0.168 sec)
INFO:tensorflow:global_step/sec: 590.992
INFO:tensorflow:loss = 6382.38, step = 1701 (0.169 sec)
INFO:tensorflow:global_step/sec: 607.674
INFO:tensorflow:loss = 5313.4, step = 1801 (0.165 sec)
INFO:tensorflow:global_step/sec: 587.091
INFO:tensorflow:loss = 3800.63, step = 1901 (0.170 sec)
INFO:tensorflow:global_step/sec: 589.809
INFO:tensorflow:loss = 3423.12, step = 2001 (0.170 sec)
INFO:tensorflow:global_step/sec: 580.676
INFO:tensorflow:loss = 5139.4, step = 2101 (0.172 sec)
INFO:tensorflow:global_step/sec: 605.294
INFO:tensorflow:loss = 2771.85, step = 2201 (0.165 sec)
INFO:tensorflow:global_step/sec: 592.575
INFO:tensorflow:loss = 6363.85, step = 2301 (0.169 sec)
INFO:tensorflow:global_step/sec: 579.445
INFO:tensorflow:loss = 4457.15, step = 2401 (0.173 sec)
INFO:tensorflow:global_step/sec: 606.599
INFO:tensorflow:loss = 3726.37, step = 2501 (0.165 sec)
INFO:tensorflow:global_step/sec: 596.433
INFO:tensorflow:loss = 4128.56, step = 2601 (0.168 sec)
INFO:tensorflow:global_step/sec: 599.769
INFO:tensorflow:loss = 3768.08, step = 2701 (0.167 sec)
INFO:tensorflow:global_step/sec: 612.086
INFO:tensorflow:loss = 6986.38, step = 2801 (0.164 sec)
INFO:tensorflow:global_step/sec: 592.518
INFO:tensorflow:loss = 2785.67, step = 2901 (0.169 sec)
INFO:tensorflow:global_step/sec: 593.526
INFO:tensorflow:loss = 2719.91, step = 3001 (0.168 sec)
INFO:tensorflow:global_step/sec: 672.223
INFO:tensorflow:loss = 5216.25, step = 3101 (0.150 sec)
INFO:tensorflow:global_step/sec: 564.751
INFO:tensorflow:loss = 3732.69, step = 3201 (0.176 sec)
INFO:tensorflow:global_step/sec: 571.024
INFO:tensorflow:loss = 5989.4, step = 3301 (0.175 sec)
INFO:tensorflow:global_step/sec: 605.097
INFO:tensorflow:loss = 5582.24, step = 3401 (0.166 sec)
INFO:tensorflow:global_step/sec: 498.785
INFO:tensorflow:loss = 5746.72, step = 3501 (0.200 sec)
INFO:tensorflow:global_step/sec: 559.438
INFO:tensorflow:loss = 2472.61, step = 3601 (0.178 sec)
INFO:tensorflow:global_step/sec: 718.192
INFO:tensorflow:loss = 2150.47, step = 3701 (0.141 sec)
INFO:tensorflow:global_step/sec: 709.384
INFO:tensorflow:loss = 2227.35, step = 3801 (0.140 sec)
INFO:tensorflow:global_step/sec: 580.385
INFO:tensorflow:loss = 3599.27, step = 3901 (0.172 sec)
INFO:tensorflow:global_step/sec: 605.189
INFO:tensorflow:loss = 3449.01, step = 4001 (0.165 sec)
INFO:tensorflow:global_step/sec: 499.852
INFO:tensorflow:loss = 3600.9, step = 4101 (0.202 sec)
INFO:tensorflow:global_step/sec: 553.669
INFO:tensorflow:loss = 5410.45, step = 4201 (0.179 sec)
INFO:tensorflow:global_step/sec: 584.358
INFO:tensorflow:loss = 3753.37, step = 4301 (0.171 sec)
INFO:tensorflow:global_step/sec: 569.141
INFO:tensorflow:loss = 1878.03, step = 4401 (0.176 sec)
INFO:tensorflow:global_step/sec: 593.617
INFO:tensorflow:loss = 2991.31, step = 4501 (0.168 sec)
INFO:tensorflow:global_step/sec: 586.204
INFO:tensorflow:loss = 4263.31, step = 4601 (0.171 sec)
INFO:tensorflow:global_step/sec: 691.632
INFO:tensorflow:loss = 4760.43, step = 4701 (0.145 sec)
INFO:tensorflow:global_step/sec: 581.878
INFO:tensorflow:loss = 3642.14, step = 4801 (0.171 sec)
INFO:tensorflow:global_step/sec: 598.99
INFO:tensorflow:loss = 5013.65, step = 4901 (0.167 sec)
INFO:tensorflow:Saving checkpoints for 5000 into /tmp/boston_model/model.ckpt.
INFO:tensorflow:Loss for final step: 5582.02.





<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x7f3beb064ef0>

评估模型

下面测试模型在测试集上的表现。运行 evaluate,这次把 test_set 传入到 input_fn 中:

1
2
ev = regressor.evaluate(
input_fn=get_input_fn(test_set, num_epochs=1, shuffle=False))
INFO:tensorflow:Starting evaluation at 2017-12-18-02:53:47
INFO:tensorflow:Restoring parameters from /tmp/boston_model/model.ckpt-5000
INFO:tensorflow:Finished evaluation at 2017-12-18-02:53:47
INFO:tensorflow:Saving dict for global step 5000: average_loss = 12.5115, global_step = 5000, loss = 1251.15
1
2
loss_score = ev['loss']
print('Loss: {0:f}'.format(loss_score))
Loss: 1251.153320

作出预测

最后使用模型来预测房价中位数,用 prediction_set,它包括了没有labels 的6个样本

1
2
3
4
5
6
7
y = regressor.predict(
input_fn=get_input_fn(prediction_set, num_epochs=1, shuffle=False))

# .predict() returns an iterator of dicts; convert to a list and print
# predictions
predictions = list(p['predictions'] for p in itertools.islice(y, 6))
print('Predictions: {}'.format(str(predictions)))
INFO:tensorflow:Restoring parameters from /tmp/boston_model/model.ckpt-5000
Predictions: [array([ 33.13015747], dtype=float32), array([ 19.69280624], dtype=float32), array([ 23.51768494], dtype=float32), array([ 36.13526535], dtype=float32), array([ 15.75910473], dtype=float32), array([ 19.81415749], dtype=float32)]
分享到