{"id":3457,"date":"2018-11-16T00:07:32","date_gmt":"2018-11-16T00:07:32","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=3457"},"modified":"2018-11-17T15:04:57","modified_gmt":"2018-11-17T15:04:57","slug":"how-to-do-deep-machine-learning-tasks-inside-kvm-guests-with-a-passed-through-nvidia-gpu","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/16\/how-to-do-deep-machine-learning-tasks-inside-kvm-guests-with-a-passed-through-nvidia-gpu\/","title":{"rendered":"How to Do Deep Machine Learning Tasks Inside KVM Guests with a Passed-through NVIDIA GPU"},"content":{"rendered":"<p>This article shows how to run deep machine learning tasks in a SUSE Linux Enterprise Server 15 KVM guest. In a first step, you will learn how to do the train\/test tasks using CPU and GPU separately. After that, we can compare the performance differences.<\/p>\n<h2>Preparation<\/h2>\n<p>But first of all, we need to do some preparation work before building both the <a href=\"http:\/\/caffe.berkeleyvision.org\/\" target=\"_blank\" rel=\"noopener\">Caffe<\/a> and the <a href=\"https:\/\/www.tensorflow.org\/\" target=\"_blank\" rel=\"noopener\">TensorFlow<\/a> frameworks with GPU support.<\/p>\n<p>1- Enable vt-d in the host bios and ensure the kernel parameter \u2018intel_iommu=on\u2019 is enabled.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-1-300x31.png\" alt=\"\" width=\"522\" height=\"54\" \/><\/p>\n<p>2- Pass the <em>nv970GTX<\/em> on to the SUSE Linux Enterprise Server 15 KVM guest through libvirt.<\/p>\n<p>Note:<br \/>\n* If there are multiple devices in the same iommu group, you need to pass all of them on to the guest.<br \/>\n* What is passed-through is the <em>970GTX<\/em> physical function, not a <em>vGPU<\/em> instance, because <em>970GTX<\/em> is not <em>vGPU<\/em> capable.<\/p>\n<p>3- Disable the visibility of KVM to the guest by hiding the KVM signature. Otherwise, the newer public NVIDIA drivers and tools refuse to work (Please refer to qemu commit#f522d2a for the details).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-2-264x300.png\" alt=\"\" width=\"431\" height=\"490\" \/><\/p>\n<p>4- Install the official NVIDIA display driver in the guest:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-3-300x149.png\" alt=\"\" width=\"427\" height=\"212\" \/><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-4-298x300.png\" alt=\"\" width=\"426\" height=\"429\" \/><\/p>\n<p>5- Install <a href=\"https:\/\/devblogs.nvidia.com\/cuda-10-features-revealed\/\" target=\"_blank\" rel=\"noopener\"><em>Cuda 10<\/em><\/a>, <a href=\"https:\/\/developer.nvidia.com\/cudnn\" target=\"_blank\" rel=\"noopener\"><em>cuDNN 7.3.1<\/em><\/a> and <a href=\"https:\/\/docs.nvidia.com\/deeplearning\/sdk\/nccl-install-guide\/index.html\" target=\"_blank\" rel=\"noopener\"><em>NCCL 2.3.5<\/em><\/a> in the guest:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-5-300x44.png\" alt=\"\" width=\"423\" height=\"62\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-6-300x51.png\" alt=\"\" width=\"424\" height=\"72\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-7-300x36.png\" alt=\"\" width=\"425\" height=\"51\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-8-300x42.png\" alt=\"\" width=\"421\" height=\"59\" \/><\/p>\n<h2>Build the Frameworks<\/h2>\n<p>Now it\u2019s time to build the TensorFlow framework with GPU support and the Caffe framework.<\/p>\n<p>As the existing whl package of TensorFlow 1.11 doesn\u2019t support <em>Cuda 10<\/em> yet, I built <a href=\"https:\/\/github.com\/tensorflow\/tensorflow\/releases\" target=\"_blank\" rel=\"noopener\">TensorFlow 1.12<\/a> from the official Git source.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-9-300x81.png\" alt=\"\" width=\"422\" height=\"114\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-10-300x150.png\" alt=\"\" width=\"426\" height=\"213\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-11-300x41.png\" alt=\"\" width=\"432\" height=\"59\" \/><\/p>\n<p>As next step, build a whl package and install it.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-12-300x286.png\" alt=\"\" width=\"433\" height=\"413\" \/><\/p>\n<p>Now let\u2019s create a simple example to test the TensorFlow GPU in the guest:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-13-219x300.png\" alt=\"\" width=\"430\" height=\"589\" \/><\/p>\n<p>Through the nvidia-smi command, you can see the process information on GPU0 while the example code is running.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-14-300x152.png\" alt=\"\" width=\"438\" height=\"222\" \/><\/p>\n<p>Next, let\u2019s build the Caffe framework from the source, and the Caffe python wrapper.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-15-187x300.png\" alt=\"\" width=\"436\" height=\"700\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-16-300x20.png\" alt=\"\" width=\"435\" height=\"29\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-17-300x54.png\" alt=\"\" width=\"433\" height=\"78\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-18-300x49.png\" alt=\"\" width=\"429\" height=\"70\" \/><\/p>\n<p>The setup is done!<\/p>\n<h2>Examples<\/h2>\n<p>Now let\u2019s try to execute some deep learning tasks.<\/p>\n<p>Example 1.1: This is a Caffe built-in example. Please refer to <a href=\"http:\/\/caffe.berkeleyvision.org\/gathered\/examples\/mnist.html\" target=\"_blank\" rel=\"noopener\">http:\/\/caffe.berkeleyvision.org\/gathered\/examples\/mnist.html<\/a> to learn more.<\/p>\n<p>Let\u2019s use GPU0 in a guest to train this <a href=\"http:\/\/deeplearning.net\/tutorial\/lenet.html\" target=\"_blank\" rel=\"noopener\"><em>LeNET<\/em><\/a> model.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-19-300x84.png\" alt=\"\" width=\"453\" height=\"127\" \/><\/p>\n<p>During the training progress, we should see that the loss rate presents the downward trend all the time along with continuous iteration. But as the output is too long, I will not show it here.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-20-300x85.png\" alt=\"\" width=\"455\" height=\"129\" \/><\/p>\n<p>We got four files at the given folder after the training is done. This is because I set up the system to save the model and the training status every 5000 times. This means we get 2 files after 5000 iterations and 2 files after 10000 iterations.<\/p>\n<p>Now we got a trained model. Let\u2019s test it with 10000 test images to see how good the accuracy is.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-21-300x10.png\" alt=\"\" width=\"690\" height=\"23\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-22-300x60.png\" alt=\"\" width=\"450\" height=\"90\" \/><\/p>\n<p>See? The accuracy is 0.9844. It is an acceptable result.<\/p>\n<p>Example 1.2: Now let\u2019s re-train a <em>LeNET<\/em> model using CPU instead of GPU \u2013 and let\u2019s see what happens.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-23-300x94.png\" alt=\"\" width=\"450\" height=\"141\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-24-300x89.png\" alt=\"\" width=\"452\" height=\"134\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-25-300x10.png\" alt=\"\" width=\"750\" height=\"25\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-26-300x55.png\" alt=\"\" width=\"458\" height=\"84\" \/><\/p>\n<p>When we compare the GPU and the CPU, we can see that there are huge performance differences, while we train\/test the <em>LeNET<\/em> with the mnist dataset.<\/p>\n<p>We know that the traditional <em>LeNET<\/em> convolutional neural network (CNN) contains seven layers. Except for the input layer, the <em>MNIST<\/em> database contains 60,000 training images and 10,000 testing images. That means the performance differences become more between the training by CPU and the training by GPU when using deeper neural network layers.<\/p>\n<p>Example 2.1: This example is a TensorFlow built-in example. Let\u2019s do a very simple mnist classifier using the same mnist dataset.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-27-300x210.png\" alt=\"\" width=\"460\" height=\"322\" \/><\/p>\n<p>Here we go: As no convolutional layers are involved, the time consumed is quite short. It is only 8.5 seconds. But the accuracy is 0.92, which is not good enough.<\/p>\n<p>If you want, you can check all details through the TensorBoard.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-28-300x51.png\" alt=\"\" width=\"459\" height=\"78\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-29-300x14.png\" alt=\"\" width=\"557\" height=\"26\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-30-300x186.png\" alt=\"\" width=\"458\" height=\"284\" \/><\/p>\n<p>Example 2.2: Now we create a network with five layers CNN which is similar to the <em>LeNET<\/em>. Let\u2019s re-train the system through GPU0 based on the TensorFlow framework.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-31-300x97.png\" alt=\"\" width=\"467\" height=\"151\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-32-300x145.png\" alt=\"\" width=\"466\" height=\"225\" \/><\/p>\n<p>You can see now that the accuracy is 0.99 \u2013 it got much better, and the time consumed is only 2m 16s.<\/p>\n<p><b>Example 2.3<\/b>: Finally, let\u2019s redo example 2.2 with CPU instead of GPU0, to check the performance differences.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-33-300x136.png\" alt=\"\" width=\"461\" height=\"209\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.suse.com\/c\/wp-content\/uploads\/2018\/11\/DL-34.png\" alt=\"\" width=\"293\" height=\"163\" \/><\/p>\n<p>With 0.99, the accuracy is really good now. But the time consumed is 19m 53s, which is way longer than the time consumed in example 2.2.<\/p>\n<h2>Summary<\/h2>\n<p>Finally, let\u2019s summarize our test results:<\/p>\n<ul>\n<li>The training\/testing performance differences are huge between CPU and GPU. They could be going into the hundreds of times if the network model is complex.<\/li>\n<li>SUSE Linux Enterprise Server 15 is a highly reliable platform whatever Machine Learning tasks you want to run on it for research or production purposes.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.suse.com\/c\/how-to-do-deep-machine-learning-tasks-inside-kvm-guests-with-a-passed-through-nvidia-gpu\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article shows how to run deep machine learning tasks in a SUSE Linux Enterprise Server 15 KVM guest. In a first step, you will learn how to do the train\/test tasks using CPU and GPU separately. After that, we can compare the performance differences. Preparation But first of all, we need to do some &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/16\/how-to-do-deep-machine-learning-tasks-inside-kvm-guests-with-a-passed-through-nvidia-gpu\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;How to Do Deep Machine Learning Tasks Inside KVM Guests with a Passed-through NVIDIA GPU&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3457","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3457","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=3457"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3457\/revisions"}],"predecessor-version":[{"id":3692,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3457\/revisions\/3692"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=3457"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=3457"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=3457"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}