TensorVue Callback¶

TensorFlow is an open-source machine learning library developed by Google, which allows you to design and create custom Machine Learning algorithms. These algorithms can take a long time to train, with accuracy and loss statistics reported after each epoch. To make it easy to keep track of these statistics as the training progresses, a Keras Callback has been created to upload information about the training of any Tensorflow Keras model to Simvue.

Further-docs

To view a detailed example of monitoring the training of a Tensorflow ML algorithm using the TensorVue callback, see the example here.

What is tracked¶

By default, the TensorVue callback will create a Simulation run, which represents the training of the entire model and contains statistics collected after each training epoch, and a series of Epoch runs, which contains statistics for a specific epoch collected after each training batch (this can be disabled if desired). If you have a separate validation session using model.evaluate, then an Evaluation run will also be created. The following things are tracked by the TensorVue callback:

Uploads the Python script creating the model as a Code Artifact
Uploads the model config as an Input Artifact
Uploads parameters about the model as Metadata
Uploads the Training Accuracy and Loss after each batch to an Epoch run
Uploads the Training and Validation Accuracy and Loss after each Epoch to the Simulation run
Uploads model checkpoints after each Epoch to the corresponding Epoch run as Output Artifacts(if enabled by the user)
Uploads the final model to the Simulation run as an Output Artifact

Usage¶

To use the TensorVue class, you must have the simvue_integrations repository installed. Create a virtual environment if you haven't already:

python -m venv venv
source venv/bin/activate

Then install the repository using pip:

pip install git+https://github.com/simvue-io/integrations.git@main#egg=simvue-integrations[tensorflow]

Before beginning training for your Tensorflow model, you need to create an instance of the TensorVue class. This class can take the following arguments:

run_name: Name of the Simvue run to create
run_folder: Name of the folder to store the run in, will create a folder with the same name as the run if not specified
run_description: Description of the run, optional
run_tags: List of tags associated with the run, optional
run_metadata: Metadata associated with the run, optional
run_mode: Whether Simvue should run in Online or Offline mode, by default Online
alert_definitions: Definitions of any alerts to add to the run as a dictionary of key/value pairs, optional
manifest_alerts: If using the Optimisation framework, which of the alerts defined above to add to the manifest run, by default None
simulation_alerts: Which of the alerts defined above to add to the simulation run, by default None
epoch_alerts: Which of the alerts defined above to add to the epoch runs, by default None
evaluation_alerts: Which of the alerts defined above to add to the evaluation runs, by default None
start_alerts_from_epoch: If epoch alerts are enabled, the number of the epoch which you would like to begin setting alerts for, by default 0
script_filepath: Path of the file to upload as Code to the simulation run, by default uses the file where the callback was instantiated
model_checkpoint_filepath: If using the ModelCheckpoint callback, the path where the checkpoint files are saved after each epoch, optional
model_final_filepath: The location where the final model should be stored after training is complete, by default /tmp/simvue/final_model.keras
evaluation_parameter: The parameter to check the value of after each Epoch, either 'accuracy', 'loss', 'val_accuracy', or 'val_loss', optional
evaluation_target: The target value of the parameter, which will cause the training to stop if satisfied, optional
evaluation_condition: How you wish to compare the latest value of the parameter to the target value, either '<', '>', '<=', '>=', '==', optional
create_epoch_runs: bool, Whether to create runs for the training data for each Epoch individually, by default True
optimisation_framework: Whether to use the Simvue ML Optimisation framework, by default False
simulation_run: If using the ML Opt framework and this callback is being called within the simulation function, the 'data' run which has been created by the framework for this trial, by default None
evaluation_run: If using the ML Opt framework and this callback is being called within the evaluation function, the 'eval' run which has been created by the framework for this trial, by default None

Your Python script may look something like this:

from tensorflow import keras
from simvue_integrations.plugins.tensorflow import TensorVue

# Define your model
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=(28, 28)))
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.01),
            loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=['accuracy'])

# Load your training data        
img_train, label_train, img_test, label_test = keras.datasets.fashion_mnist.load_data()

# Initialise your callback - minimum required is the Simvue run name, but can include any other details described above
tensorvue = sv_tf.TensorVue("recognising_clothes")

# Fit the model, using the tensorvue callback
model.fit(
    img_train,
    label_train,
    epochs=10,
    validation_split=0.2,
    callbacks=[tensorvue,]
)

# Evaluate the model, again using the tensorvue callback
results = model.evaluate(
    img_test,
    label_test,
    callbacks=[tensorvue,]
)

Adding Functionality¶

If you wish to store more data than the default TensorVue callback provides, you can create your own callback class which inherits from TensorVue. For detailed information on creating your own custom callbacks, see this guide.

For example, say you wanted the callback to upload the final accuracy and loss measurements as metadata to the Simvue run. To do this we will inherit from TensorVue, but override the on_train_end() method to add our new functionality:

class MyTensorVue(sv_tf.TensorVue):
    # This method will be called whenever a training session ends
    def on_train_end(self, logs):

        # Accuracy and Loss measurements are stored in `logs`:
        final_measurements = {
            "final_accuracy": logs.get("accuracy"),
            "final_loss": logs.get("loss")
        }

        # You can then access the Simulation run to upload these values to through `self.simulation_run`
        # Any of the methods available in the standard `simvue.Run` class are available here
        self.simulation_run.update_metadata(final_measurements)

        # Don't forget to then call the base TensorVue method!
        super().on_train_end(logs)