A machine learning cheat sheet, because there are too many ways to do any one thing.
This page is intended to be a living document recording an opinionated and sufficient subset of the scaffolding required to build a production level project on Tensorflow or PyTorch. Currently the section on Tensorflow is complete, and a PyTorch overview is underway.
Snippets below are aimed at advanced uses of TF+Pytorch. Frameworks like Keras/FastAI/Lightning/Catalyst are intentionally excluded.
Tensorflow 2.x
Preparing Data
Data for TF/Keras models is best handled as tf.data.Dataset
objects.
Creating Datasets
Datasets can be created by using the from_tensors
or from_tensor_slices
methods which, despite their names, take any tensor-ish object as input. This includes numpy arrays, python lists, and TF tensors.
tensor = tf.constant([[1, 2], [3, 4]])
dataset = tf.data.Dataset.from_tensors(t) # [[1, 2], [3, 4]] 1x elements of shape (2,2)
dataset = tf.data.Dataset.from_tensor_slices(t) # [1, 2], [3, 4] 2x elements of shape (2)
Processing Datasets
Often we’ll want to take a list of filenames and process, say, the images in those files. To do this, map the dataset over a parsing function. Specify num_parallel_calls=tf.data.experimental.AUTOTUNE
when mapping to allow TF to use builtin heuristics to parallelize the mapping.
def parse_function(fname):
parsed_example = tf.io.read_file(filename)
image = tf.io.decode_jpeg(parsed_example)
return image
fnames = glob.glob('images/*.jpg')
dataset = tf.data.Dataset.from_tensor_slices(fnames)
dataset = tf.data.Dataset.map(parse_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
Use batching, shuffling, and repeat the dataset when training for multiple epochs. Call repeat before batch to ensure consistent batch sizes in the case where the dataset size is not a multiple of the batch size.
dataset = dataset.repeat().shuffle(buffer_size=100, seed=0)
dataset = dataset.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
It’s often helpful to combine labels and data with zip
image_data = [[1,2], [3,4]]
label_data = ['apple', 'banana']
image_dataset = tf.data.Dataset.from_tensor_slices(image_data) # [1,2], [3,4]
label_dataset = tf.data.Dataset.from_tensor_slices(label_data) # 'apple', 'banana'
final_dataset = tf.data.Dataset.zip((image_dataset, label_dataset)) # ([1,2], 'apple'), ([3,4], 'banana')
See the TF Data Performance Guide for info on optimizing dataset operations. In general: interleave when you have multiple datasets, batch before map, cache when possible.
Creating Models
I recommend using tf.keras.Model
models even when operating in Tensorflow land. It’s fully compatible and has nice semantics. A very simple model that just wraps resnet looks like this:
class MyModel(tf.keras.model):
def __init__(self, num_classes=10, name=='my_model'):
super(MyModel, self).__init__(name=name)
self.backbone = tf.keras.applications.ResNet101(input_shape=(321,321,3), weights='imagenet', include_top=False)
self.classifier = tf.keras.layers.Dense(num_classes, activation=None, kernel_regularizer=None, name='desc_fc')
def call(self, inputs, training=True):
x = self.backbone(inputs)
logits = self.classifier(x)
return logits
You can, of course, nest any module-like objects
class WrapperModel(tf.keras.model):
def __init__(self, backbone, name=='my_model'):
super(MyModel, self).__init__(name=name)
self.backbone = backbone
def call(self, inputs, training=True):
x = self.backbone(inputs)
return x
backbone_model = MyModel()
model = WrapperModel(backbone_model)
Training
The main steps when training models are:
- Get the model output
- Compute a loss
- Compute and backpropagate the gradients with respect to the loss and model
- Repeat
With a model and dataset computing outputs is simple
model = create_model(num_classes)
batch = create_dataset().take(1)
probabilities = model(batch)
To record execution for automatic differentiation and backprop, use a tf.GradientTape
optimizer = tf.keras.optimizers.Adam()
with tf.GradientTape() as tape:
probabilities = model(batch)
loss = f.keras.losses.SparseCategoricalCrossentropy(labels, probabilities)
gradients = tape.gradient(loss, model.trainable_weights)
clipped, _ = tf.clip_by_global_norm(gradients, clip_norm=clip_val)
optimizer.apply_gradients(zip(clipped, weights))
The whole train loop might look like this:
while step < max_steps_count:
labels, batch = next(train_dataset_iterator)
with tf.GradientTape() as tape:
probabilities = model(batch)
loss = compute_loss(labels,probabilities)
gradients = tape.gradient(loss, model.trainable_weights)
clipped, _ = tf.clip_by_global_norm(gradients, clip_norm=clip_val)
optimizer.apply_gradients(zip(clipped, weights))
From there, recording the progress to Tensorboard is easy:
summary_writer = tf.summary.create_file_writer('train_logs', flush_millis=10000)
with summary_writer.as_default():
with tf.summary.record_if(
tf.math.equal(0, optimizer.iterations % report_interval)):
while step < max_steps_count:
... (see above)
tf.summary.scalar(
'loss/crossentropy', loss, step=optimizer.iterations.numpy())