Machine-Learning-Platform-as-a-Service (ML PaaS) is one of the fastest growing services in the public cloud. It delivers efficient lifecycle management of machine learning models.
At a high level, there are three phases involved in training and deploying a machine learning model. These phases remain the same from classic ML models to advanced models built using sophisticated neural network architecture.
Provision and Configure Environment
Before the actual training takes place, developers and data scientists need a fully configured environment with the right hardware and software configuration.
Hardware configuration may include high-end CPUs, GPUs, or FPGAs that accelerate the training process. Configuring the software stack deals with installing a diverse set of frameworks and tools that are specific to the model.
These fully configured environments need to run as a cluster where training jobs may run in parallel. Large datasets need to be made locally available to each of the machines in the cluster to speed up access. Provisioning, configuring, orchestrating, and terminating the compute resources is a complex task.
The development and data science team rely on internal DevOps teams to tackle this problem. DevOps teams automate the steps through traditional provisioning and configuration tools such as Chef, Puppet, and Ansible. ML training jobs cannot start till DevOps teams hand off the environment to the data science team.
Training & Tuning an ML Model
Once the testbed is ready, data scientists perform the steps of data preparation, training, hyperparameter tuning, and evaluation of the model. This is an iterative process where each step may be repeated multiple times until the results are satisfactory.
During the training and tuning phase, data scientists record multiple metrics such as the number of nodes in a layer, the number of layers of deep learning neural network, the learning rate used by an optimizer, the scoring technique along with the actual score. These metrics are useful in choosing the right combination of parameters that deliver the most accurate results.
The available frameworks and tools don’t include the mechanism for logging and recording the metrics critical to the collaborative and iterative training process. Data science teams build their own logging engine for recording and tracking critical metrics. Since this engine is external to the environment, they need to maintain the logging infrastructure and visualization tools.
Serving and Scaling an ML Model
Once the data science team evolves a fully trained model, it is made available to developers to use it in production. The model, which is typically a serialized object, needs to be wrapped in a REST web service that can be consumed through standard HTTP client libraries and SDKs.
Since models are continuously trained and tuned, there will be new versions published often by the data science teams. DevOps is expected to implement a CI/CD pipeline to deploy the ML artifacts in production. They may have to perform blue/green deployments to find the best model for production usage.
The web service exposing the ML model has to scale to meet the demand of the consumers. It also needs to be highly secure aligning with the rest of the policies defined by central IT.
To meet these requirements, DevOps teams are turning to containers and Kubernetes to manage the CI/CD pipelines, security, and scalability of ML models. They are using tools such as Jenkins or Spinnaker to integrate the data processing pipeline with software delivery pipelines.
The Challenge for Developers and Data Scientists
In the above three phases, development and data science teams find it extremely challenging to deal with the first and last phases. Their strength is training, tuning, and evolving the most accurate model than dealing with infrastructure and software configuration. The high reliance on DevOps teams introduces an additional layer of dependency for these teams.
Developers are productive when they can use APIs for automating repetitive tasks. Unfortunately, there are no standard, portable, well-defined APIs for the first and the last phases of ML model development and deployment.
The rise of ML PaaS
ML PaaS delivers the best of both worlds — iterative software development and model management — to developers and data scientists. It removes the friction involved in configuring and provisioning environments for training and serving machine learning models.
The best thing about an ML PaaS is the availability of APIs that abstract the underlying hardware and software stack. Developers can call a couple of APIs to spin up a large cluster of GPU-based machines fully configured with data preparation tools, training frameworks, and monitoring tools to kick off a complex training job. They will also be able to take advantage of data processing pipelines for automating ETL jobs. When the model is ready, they will publish the latest version as a developer-facing web service without worrying about packaging and deploying the artifacts and dependencies.
Public cloud providers have all the required building blocks to deliver ML PaaS. They are now exposing an abstract service that connects the dots between compute, storage, networks, and databases to bring a unified service for developers. Even though the service can be accessed through the console, the real value of the platform is exploited through the CLI and SDK. DevOps teams can integrate the CLI to automation while developers consume the SDK from IDEs such as Jupyter Notebooks, VS Code, or PyCharm.
The SDK simplifies the creation of data processing and software delivery pipelines for developers. By changing a single parameter, they would be able to switch from a CPU-based training cluster to a powerful GPU cluster running the latest NVIDIA K80 or P100 accelerators.
Cloud providers such as Amazon, Google, IBM, and Microsoft have built robust ML PaaS offerings.