End-to-End Machine Learning

Tesfaye Betemariam
8 min readFeb 22, 2022

Part I

Beyond the academic curiosity, the end goal of exploring machine learning or artificial intelligence algorithms is to apply them in solving business problems be it in drug manufacturing, credit approval, stock trading, urban planning, sports prediction, supply chain planning, weather prediction, government spending and impacts, or predicting the path of an asteroid towards the earth.

Because the end goal is solving real problems, just like implementation of traditional software solution for a problem requires some domain knowledge, implementation of ML/AI solutions requires a good understanding of the business and its workflows. Once the implementors become armed with a good dose of the business domain knowledge, the data presented to them will start to make sense. In part 1 of this article, we will lay out the introductory workflow of how we can take a machine learning system from framing the problem statement to deploying its solution in a production environment and monitoring it. In subsequent articles, we will dive down to see the various deployment environments and strategies in more details.

The ML/AI life cycle can be summarized into the following five stages.

The life cycle of a ML/AI system

1. Ideation

This is inception, the stage where we ask questions, setup the problem statements. It is probably the most important stage in exploiting the power of ML/AI to solve already known business problems or even birthing new business opportunities. Sometimes, a problem statement can be formed from the available data. However, in most cases, we ask questions not because we have data but we have a business problem to solve.

At this foundational stage, an individual such as the “AI Liaison” or the ML/AI team lead can gather preliminary information and coordinate brainstorming sessions among the stakeholders such as end users, SMEs, data analysts, and system developers. It may also be beneficial to let everyone know they can submit their idea of a problem that they think can be tackled using ML/AI techniques. The goal is to identify problems that can be solved or an area of the business workflow that can be enhanced by using the latest technology tools ML/AI provides. After the problem statements are somehow clearly defined, we move to the next stage which is data collection and preparation. This phase can be revisited later to expand on or refine the problem statement.

2. Data Preparation

Data, preferably in a very large quantity, is at the center of Machine Learning model training in order to solve simple to complex problems. Data being the main driver of ML/AI solution formulation, it is worth taking the time and effort to get this stage done right. It can involve the following data curation steps:

  • Data collection: Because we are dealing with a large amount data, this usually is done through integration with a data pipeline. However, data can be acquired through any available means.
  • Data visualization: the collected data may appear opaque until we start to see what is inside of it to understand it better. Data visualization tools will help us look into it using visual elements such as graphs, charts, maps, etc… to identify patterns in the data, find outliers, look at trends. There are powerful data visualization libraries in python, R, and other ML/AI programming environments.
  • Data cleansing: this involves cleaning, normalizing and appropriately encoding the data. Cleaning may involve removing duplicates, isolate erroneous data, remove records with null fields or fill null values with some kind of average values. Normalization involves formatting the data fields (columns) in some ways so that the data looks and read same way across all rows (records). It also can be transforming numeric data to a common scale without distorting the relative differences. For example, we can rescale the values of two different columns with values in a range of 0 to 1. The lowest value gets the new value of 0 and the largest gets the value of 1. Data Encoding in the context of ML/AI involves the conversion of categorical variables into numerical ones. For examples, a column with ‘YES’ and ‘NO’ values can be converted to 1 and 0 values.
  • Feature engineering: this is the process of creating, transforming, extracting, and selecting variables or features to make a dataset suitable for ML/AI input. It mostly involves identifying variables or features, mixing two or more of them to create a new variable, ranking various features or variables to select the most relevant ones, etc…
  • Label creation: this is the process of adding meaningful information or label that provide context to data records. Basically, we create an extra column or columns and specify the label information there. The goal is so that the ML/AI model can learn from this label. For example, chest x-ray records can be labeled by expert radiologists to indicate whether the x-ray indicate covid19 or not. When those labeled x-rays are used to train an ML/AI model, the model will end up learning what kind of x-rays show covid.

There are a number of tools to utilize in data collection and cleanup processes. For example the python libraries such as numpy, pandas, and matplotlib provide many functionalities to accomplish all of the data visualization, cleaning, and encoding tasks.

3. Model Training

A ML/AI Model Training Iteration

This is the core of the ML/AI solution formulation. It involves the following steps:

  • Choosing a model: Based on the nature of the problem we are trying to solve, we select a model or models. We will select the best performing in the end.
  • Feeding the data to the model: training the model involves passing our test data through the model repeatedly, correcting the training error (discrepancy between model output and expected values) each time.
  • Measuring model performance: This is the step where the model training setup does comparison of model output and the expected value. There are several metrics to accomplish this. Some examples include MSE, RMSE, Log Loss.
  • Fine tuning model parameters: These parameters can be of two classes: the learnable parameters (weight and biases), and what are called model hyper parameters (for example batch size, max epochs, etc…). The learnable parameters can be tuned by the model after each cycle through the dataset. It uses the measured error during the performance measurement step to correct the weight and biases that should be used during the next cycle. But setting and fine tuning the hyper parameters is subjective and is done by the model developers. That is why fine tuning hyper parameters is considered as an art.
  • Comparing and selecting best model: if we chose multiple models for training, we can compare their respective performances against a common test data and select the best performing in the context of the problem we are solving. There are several metrics to measure performance and excelling in one performance measure may not mean we have the best model. An appropriate metrics in the context of the problem being solved has to be used to compare the models. For example, if a model is intended for predicting if a person has covid, we should not be using a metrics such as Precision, which is not affected by False Negative values, to compare models. But Sensitivity measure can be helpful here as it is affected by False Negatives. In the context of this problem of screening covid patients, we can tolerate False Positives (healthy person labeled wrongly as having covid) versus False Negatives (person having covid told as free from covid). Some of the useful metrics include MAE, RMSE, Accuracy, Precision, Recall or Sensitivity, F1-Score, AU-ROC.
  • Making predictions: Finally we can use our chosen model to solve the prediction, classification, or clustering problem we intend the model for. We are now ready to put the system in a production environment.

4. Solution Deployment

Training a ML/AI model may be a computationally expensive task. It may take several hour or days to train a model. Therefore, there must be a way of persisting the trained model so that we can reuse it whenever we need to such as by deploying it in a production environment. In subsequent parts of this article (Part 2, Part 3), we intend to show some practical examples. But here, we will only go through the high level introduction to some of the techniques and technologies used to facilitate deployment of trained ML/AI models to a production environment.

After taking the time and effort to train our model, we need to persist it so we are able to use it for the job we trained it for without being forced to spend hours or days to train it again before each problem solving exercise. Different ML/AI platforms use different approaches to accomplish this. But the basic operations have resemblance. It involves saving or persisting a trained model and load it up from the file whenever needed to execute.

Persisting trained models/parameters and using them in deployment

If we take the three prominent platforms of developing and deploying ML/AI systems AWS Sage Maker provides a central repository of trained models called model registry for the purpose of persisting and reloading trained models; PyTorch uses TorchServe for the purpose of creating a model torch archive file (.mar) from serialized files (.pt) and model file (.py) to later serve it in TorchServe.

On the other hand, TensorFlow provides two options to persist trained models and reload them at a later time and different environment for execution.

  • Persist Checkpoints: this involves persisting the all parameter values used by the model during training. Checkpoints are just parameters alone and have no idea of what program generated them or how they should be used. As a result, they can only be used by the same source program that generated them in the first place. Therefore, they can only be deployed along with the original model program that saved or persisted them in the file they are in.
  • Persist the trained model: in this option, we can persist the model parameter values along with objects and description of the computation defined by the model we trained.

Finally, A production deployment may need to fulfill several requirements such as security and compliance requirements.

5. Monitoring

We cannot just deploy and forget our ML/AI models. Even traditional software may fail in production due to various reasons including bad input data and it needs some monitoring. ML/AI systems need even more monitoring because they depend on the data pipeline more as their program logic is derived from the data itself. As time goes on, the data stream that the model operates on in production environment start to deviate from the dataset the model has been trained on. This in turn causes model degradation and its performance deviates from the time when it was deployed.

This fortune article has a good example of this situation of data drift in its turn causing model drift. Here is how they described it:

The accuracy of a metric used to evaluate how many items are found at a store dropped to 61% from 93%, tipping off the Instacart engineers that they needed to re-train their machine learning model that predicts an item’s availability at a store. After all, customers could get annoyed being told one thing — the item that they wanted was available — when in fact it wasn’t, resulting in products never being delivered. “A shock to the system” is how Instacart’s machine learning director Sharath Rao described the problem to Fortune.

So from this stage, we go back to ideation, training data refreshing, re-train the model, and re-deploy the re-trained model, and keep monitoring. That is the life cycle of an ML/AI system.

Thank you for reading!

--

--

Tesfaye Betemariam

Solutions architect at the Center for Organizational Excellence; Full-stack application development, Data Science, AI/Machine Learning.