Creating your first pipeline¶
This is the notebook version of our Quickstart! Our goal here is to help you to get the first practical experience with our tool and give you a brief overview on some basic functionalities of ZenML.
In this example, we will create and run a simple pipeline featuring a local CSV dataset and a basic feedforward neural network and run it in our local environment. If you want to run this notebook in an interactive environment, feel free to run it in a Google Colab
First things first…¶
You can install ZenML through:
pip install zenml
Once the installation is completed, you can go ahead and create your first ZenML repository for your project. As ZenML repositories are built on top of Git repositories, you can create yours in a desired empty directory through:
git init
zenml init
Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.
Creating the pipeline¶
Once you set everything up, we can start our tutorial. The first step is to create an instance of a pipeline. ZenML comes equipped with different types of pipelines, but for this example we will be using the most classic one, namely a TrainingPipeline
.
While creating your pipeline, you can give it a name and use that name to reference the pipeline later.
from zenml.pipelines import TrainingPipeline
training_pipeline = TrainingPipeline(name='QuickstartPipeline')
In a ZenML TrainingPipeline
, there is a fixed set of steps representing the processes, which can be found in any machine learning workflow. These steps include:
Split: responsible for splitting your dataset into smaller datasets such as train, eval, etc.
Transform: responsible for the preprocessing of your data
Train: responsible for the model creation and training process
Evaluate: responsible for the evaluation of your results
Creating a datasource¶
However, before we dive into the aforementioned steps, let’s briefly talk about our dataset.
For this quickstart, we will be using the Pima Indians Diabetes Dataset and on it, we will train a model which will aim to predict whether a person has diabetes based on diagnostic measures.
In order to be able to use this dataset (which is currently in CSV format) in your ZenML pipeline, we first need to create a datasource
. ZenML has built-in support for various types of datasources and for this example you can use the CSVDatasource
. All you need to provide is a name
for the datasource and the path
to the CSV file.
from zenml.datasources import CSVDatasource
ds = CSVDatasource(name='Pima Indians Diabetes Dataset',
path='gs://zenml_quickstart/diabetes.csv')
Once you are through, you will have created a tracked and versioned datasource and you can use this datasource in any pipeline. Go ahead and add it to your pipeline.
training_pipeline.add_datasource(ds)
Configuring the split¶
Now, let us get back to the four essential steps where the first step is the Split.
For the sake of simplicity in this tutorial, we will be using a completely random 70-30
split into a train and evaluation dataset.
from zenml.steps.split import RandomSplit
training_pipeline.add_split(RandomSplit(split_map={'train': 0.7,
'eval': 0.3}))
Keep in mind, in a more complicated example, it might be necessary to apply a different splitting strategy. For these cases, you can use the other built-in split configuration ZenML offers or even implement your own custom logic into the split step.
Handling data preprocessing¶
The next step is to configure the step Transform, the data preprocessing.
For this example, we will use the built-in StandardPreprocesser
. It handles the feature selection and has sane defaults of preprocessing behaviour for each data type, such as stardardization for numerical features or vocabularization for non-numerical features.
In order to use it, you need to provide a list of feature names and a list of label names. Moreover, if you do not want it use the default transformation for a feature or you want to overwrite it with a different preprocessing method, this is also possible as we do in this example.
from zenml.steps.preprocesser import StandardPreprocesser
training_pipeline.add_preprocesser(
StandardPreprocesser(
features=['times_pregnant',
'pgc',
'dbp',
'tst',
'insulin',
'bmi',
'pedigree',
'age'],
labels=['has_diabetes'],
overwrite={'has_diabetes': {
'transform': [{'method': 'no_transform',
'parameters': {}}]}}))
Much like the splitting process, you might want to work on cases, where the capabilities of the StandardPreprocesser
do not match your task at hand. In this case, you can create your own custom preprocessing step, but we will go into that topic in a different tutorial.
Training your model¶
As the data is now ready, we can move onto the step Train, the model creation and training.
For this quickstart, we will be using the simple built-in FeedForwardTrainer
step and as the name suggests, it represents a feedforward neural network, which is configurable through a set of variables.
from zenml.steps.trainer import TFFeedForwardTrainer
training_pipeline.add_trainer(TFFeedForwardTrainer(loss='binary_crossentropy',
last_activation='sigmoid',
output_units=1,
metrics=['accuracy'],
epochs=20))
Of course, not every single machine learning problem is solvable by a simple feedforward neural network and most of the time, they will require a model which is tailored to the corresponding problem. That is why we created an interface where the users can implement their own custom models and integrate it in a trainer step. However this approach is not within the scope of this tutorial and you can learn more about it in our docs and the upcoming tutorials.
Evaluation of the results¶
The last step to configure in our pipeline is the Evaluate.
For this example, we will be using the built-in TFMAEvaluator
which uses Tensorflow Model Analysis to compute metrics based on your results (possibly within slices).
from zenml.steps.evaluator import TFMAEvaluator
training_pipeline.add_evaluator(
TFMAEvaluator(slices=[['has_diabetes']],
metrics={'has_diabetes': ['binary_crossentropy',
'binary_accuracy']}))
Running your pipeline¶
Now that everything is set, go ahead and run the pipeline, thus your steps.
training_pipeline.run()
With the execution of the pipeline, you should see the logs informing you about each step along the way. In more detail, you should first see that your dataset will is ingested through the component DataGen and then split by the component SplitGen. Afterwards data preprocessing will take place with the component Transform and will lead to the main training component Trainer. Ultimately, the results will be evaluated by the component Evaluator.
Post-training functionalities¶
Once the training pipeline is finished, you can check the outputs of your pipeline in different ways.
Dataset¶
As the data is now ingested, you can go ahead and take a peek into your dataset. You can achieve this by simply getting the datasources registered to your repository and calling the method sample_data
.
from zenml.repo import Repository
repo = Repository.get_instance()
datasources = repo.get_datasources()
datasources[0].sample_data()
Statistics¶
Furthermore, you can check the statistics which are yielded by your datasource and split configuration through the method view_statistics
. By using the magic
flag, we can even achieve this right here in this notebook.
training_pipeline.view_statistics(magic=True)
Evaluate¶
On the other hand, if you want to evalaute the results of your training process you can use the evaluate
method of your pipeline.
Much like the view_statistics
, if you execute evaluate
with the magic
flag, it will help you continue in this notebook and generate two new cells, each set up with a different evaluation tool:
Tensorboard can help you to understand the behaviour of your model during the training session
TFMA or tensorflow_model_analysis can help you assess your already trained model based on given metrics and slices on the evaluation dataset
Note: if you want to see the sliced results, comment in the last line and adjust it according to the slicing column. In the end it should look like this:
tfma.view.render_slicing_metrics(evaluation, slicing_column='has_diabetes')
training_pipeline.evaluate(magic=True)
… and this it it for the quickstart. If you came here without a hiccup, you must have successly installed ZenML, set up a ZenML repo, registered a new datasource, configured a training pipeline, executed it locally and evaluated the results. And, this is just the tip of the iceberg on the capabilities of ZenML.
However, if you had a hiccup or you have some suggestions/questions regarding our framework, you can always check our docs or our github or even better join us on our Slack channel.
Cheers!