{"id":3366,"date":"2018-11-14T00:49:50","date_gmt":"2018-11-14T00:49:50","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=3366"},"modified":"2018-11-17T14:29:05","modified_gmt":"2018-11-17T14:29:05","slug":"automate-sysadmin-tasks-with-pythons-os-walk-function","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/14\/automate-sysadmin-tasks-with-pythons-os-walk-function\/","title":{"rendered":"Automate Sysadmin Tasks with Python&#8217;s os.walk Function"},"content":{"rendered":"<p><em>Using Python&#8217;s os.walk function to walk through a tree of files and<br \/>\ndirectories.<\/em><\/p>\n<p>I&#8217;m a web guy; I put together my first site in early 1993. And<br \/>\nso, when I started to do Python training, I assumed that most of my<br \/>\nstudents also were going to be web developers or aspiring web<br \/>\ndevelopers. Nothing could be further from the truth. Although some of my<br \/>\nstudents certainly are interested in web applications, the majority of them<br \/>\nare software engineers, testers, data scientists and system<br \/>\nadministrators.<\/p>\n<p>This last group, the system administrators, usually comes into my<br \/>\ncourse with the same story. The company they work for has been writing Bash<br \/>\nscripts for several years, but they want to move to a higher-level<br \/>\nlanguage with greater expressiveness and a large number of third-party<br \/>\nadd-ons. (No offense to Bash users is intended; you can do amazing<br \/>\nthings with Bash, but I hope you&#8217;ll agree that the scripts can become<br \/>\nunwieldy and hard to maintain.)<\/p>\n<p>It turns out that with a few simple tools and ideas, these system<br \/>\nadministrators can use Python to do more with less code, as well as create<br \/>\nreports and maintain servers. So in this article, I describe<br \/>\none particularly useful tool that&#8217;s often overlooked: os.walk, a<br \/>\nfunction that lets you walk through a tree of files and<br \/>\ndirectories.<\/p>\n<h3>os.walk Basics<\/h3>\n<p>Linux users are used to the ls command to get a list of files in a<br \/>\ndirectory. Python comes with two different functions that can return<br \/>\nthe list of files. One is os.listdir, which means the &#8220;listdir&#8221;<br \/>\nfunction in the &#8220;os&#8221; package. If you want, you can pass the name of a<br \/>\ndirectory to os.listdir. If you don&#8217;t do that, you&#8217;ll get the names<br \/>\nof files in the current directory. So, you can say:<\/p>\n<p>In [10]: import os<\/p>\n<p>When I do that on my computer, in the current directory, I get the following:<\/p>\n<p>In [11]: os.listdir(&#8216;.&#8217;)<br \/>\nOut[11]:<br \/>\n[&#8216;.git&#8217;,<br \/>\n&#8216;.gitignore&#8217;,<br \/>\n&#8216;.ipynb_checkpoints&#8217;,<br \/>\n&#8216;.mypy_cache&#8217;,<br \/>\n&#8216;Archive&#8217;,<br \/>\n&#8216;Files&#8217;]<\/p>\n<p>As you can see, os.listdir returns a list of strings, with each<br \/>\nstring being a filename. Of course, in UNIX-type systems, directories<br \/>\nare files too\u2014so along with files, you&#8217;ll also see subdirectories<br \/>\nwithout any obvious indication of which is which.<\/p>\n<p>I gave up on os.listdir long ago, in favor of<br \/>\nglob.glob, which means<br \/>\nthe &#8220;glob&#8221; function in the &#8220;glob&#8221; module. Command-line users are used<br \/>\nto using &#8220;globbing&#8221;, although they often don&#8217;t know its name. Globbing<br \/>\nmeans using the * and ? characters, among others, for more flexible<br \/>\nmatching of filenames. Although os.listdir can return the list of<br \/>\nfiles in a directory, it cannot filter them. You can though with<br \/>\nglob.glob:<\/p>\n<p>In [13]: import glob<\/p>\n<p>In [14]: glob.glob(&#8216;Files\/*.zip&#8217;)<br \/>\nOut[14]:<br \/>\n[&#8216;Files\/advanced-exercise-files.zip&#8217;,<br \/>\n&#8216;Files\/exercise-files.zip&#8217;,<br \/>\n&#8216;Files\/names.zip&#8217;,<br \/>\n&#8216;Files\/words.zip&#8217;]<\/p>\n<p>In either case, you get the names of the files (and subdirectories) as<br \/>\nstrings. You then can use a for loop or a list comprehension to iterate<br \/>\nover them and perform an action. Also note that in contrast with<br \/>\nos.listdir, which returns the list of filenames without any path,<br \/>\nglob.glob returns the full pathname of each file, something I&#8217;ve<br \/>\noften found to be useful.<\/p>\n<p>But what if you want to go through each file, including every file in<br \/>\nevery subdirectory? Then you have a bit more of a problem. Sure, you could<br \/>\nuse a for loop to iterate over each filename and then use<br \/>\nos.path.isdir to figure out whether it&#8217;s a subdirectory\u2014and if so,<br \/>\nthen you could get the list of files in that subdirectory and add them<br \/>\nto the list over which you&#8217;re iterating.<\/p>\n<p>Or, you can use the os.walk function, which does all of this and<br \/>\nmore. Although os.walk looks and acts like a function, it&#8217;s actually a<br \/>\n&#8220;generator function&#8221;\u2014a function that, when executed, returns a<br \/>\n&#8220;generator&#8221; object that implements the iteration protocol. If you&#8217;re<br \/>\nnot used to working with generators, running the function can be<br \/>\na bit surprising:<\/p>\n<p>In [15]: os.walk(&#8216;.&#8217;)<br \/>\nOut[15]: &lt;generator object walk at 0x1035be5e8&gt;<\/p>\n<p>The idea is that you&#8217;ll put the output from os.walk in a<br \/>\nfor<br \/>\nloop. Let&#8217;s do that:<\/p>\n<p>In [17]: for item in os.walk(&#8216;.&#8217;):<br \/>\n&#8230;: print(item)<\/p>\n<p>The result, at least on my computer, is a huge amount of output,<br \/>\nscrolling by so fast that I can&#8217;t read it easily. Whether that<br \/>\nhappens to you depends on where you run this for loop on your<br \/>\nsystem and how many files (and subdirectories) exist.<\/p>\n<p>In each iteration, os.walk returns a tuple containing three<br \/>\nelements:<\/p>\n<ul>\n<li>The current path (that is, directory name) as a string.<\/li>\n<li>A list of subdirectory names (as strings).<\/li>\n<li>A list of non-directory filenames (as strings).<\/li>\n<\/ul>\n<p>So, it&#8217;s typical to invoke os.walk such that each of these three<br \/>\nelements is assigned to a separate variable in the for loop:<\/p>\n<p>In [19]: for currentdir, dirnames, filenames in os.walk(&#8216;.&#8217;):<br \/>\n&#8230;: print(currentdir)<\/p>\n<p>The iterations continue until each of the subdirectories under the<br \/>\nargument to os.walk has been returned. This allows you to perform<br \/>\nall sorts of reports and interesting tasks. For example, the above<br \/>\ncode will print all of the subdirectories under the current directory,<br \/>\n&#8220;.&#8221;.<\/p>\n<h3>Counting Files<\/h3>\n<p>Let&#8217;s say you want to count the number of files (not subdirectories)<br \/>\nunder the current directory. You can say:<\/p>\n<p>In [19]: file_count = 0<\/p>\n<p>In [20]: for currentdir, dirnames, filenames in os.walk(&#8216;.&#8217;):<br \/>\n&#8230;: file_count += len(filenames)<br \/>\n&#8230;:<\/p>\n<p>In [21]: file_count<br \/>\nOut[21]: 3657<\/p>\n<p>You also can do something a bit more sophisticated, counting how many<br \/>\nfiles there are of each type, using the extension as a classifier. You<br \/>\ncan get the extension with os.path.splitext, which returns two<br \/>\nitems\u2014the filename without the extension and the extension itself:<\/p>\n<p>In [23]: os.path.splitext(&#8216;abc\/def\/ghi.jkl&#8217;)<br \/>\nOut[23]: (&#8216;abc\/def\/ghi&#8217;, &#8216;.jkl&#8217;)<\/p>\n<p>You can count the items using one of my favorite Python data structures,<br \/>\nCounter. For example:<\/p>\n<p>In [24]: from collections import Counter<\/p>\n<p>In [25]: counts = Counter()<\/p>\n<p>In [26]: for currentdir, dirnames, filenames in os.walk(&#8216;.&#8217;):<br \/>\n&#8230;: for one_filename in filenames:<br \/>\n&#8230;: first_part, ext =<br \/>\n\u21aaos.path.splitext(one_filename)<br \/>\n&#8230;: counts[ext] += 1<\/p>\n<p>This goes through each directory under &#8220;.&#8221;, getting the<br \/>\nfilenames. It then iterates through the list of filenames, splitting<br \/>\nthe name so that you can get the extension. You then add 1 to the counter<br \/>\nfor that extension.<\/p>\n<p>Once this code has run, you can ask counts for a report. Because it&#8217;s<br \/>\na dict, you can use the items method and print the keys and values<br \/>\n(that is, extensions and counts). You can print them as follows:<\/p>\n<p>In [30]: for extension, count in counts.items():<br \/>\n&#8230;: print(f&#8221;&#8221;)<\/p>\n<p>In the above code, f strings displays the extension (in<br \/>\na field of eight characters) and the count.<\/p>\n<p>Wouldn&#8217;t it be nice though to show only the ten most common<br \/>\nextensions? Yes, but then you&#8217;d have to sort through the counts<br \/>\nobject. It&#8217;s much easier just to use the most_common method that<br \/>\nthe Counter object provides, which returns not only the keys and<br \/>\nvalues, but also sorts them in descending order:<\/p>\n<p>In [31]: for extension, count in counts.most_common(10):<br \/>\n&#8230;: print(f&#8221;&#8221;)<br \/>\n&#8230;:<br \/>\n.py 1149<br \/>\n867<br \/>\n.zip 466<br \/>\n.ipynb 410<br \/>\n.pyc 372<br \/>\n.txt 151<br \/>\n.json 76<br \/>\n.so 37<br \/>\n.conf 19<br \/>\n.py~ 12<\/p>\n<p>In other words\u2014not surprisingly\u2014this example shows that the most common file extension<br \/>\nin the directory I use for teaching Python courses is .py. Files<br \/>\nwithout any extension are next, followed by .zip, .ipynb (Jupyter<br \/>\nnotebooks) and .pyc (byte-compiled Python).<\/p>\n<h3>File Sizes<\/h3>\n<p>You can ask more interesting questions as well. For example, perhaps<br \/>\nyou want to know how much disk space is used by each of these file<br \/>\ntypes. Now you don&#8217;t add 1 for each time you encounter a file<br \/>\nextension, but rather the size of the file. Fortunately, this turns<br \/>\nout to be trivially easy, thanks to the os.path.getsize<br \/>\nfunction (this returns the same value that you would get from<br \/>\nos.stat):<\/p>\n<p>for currentdir, dirnames, filenames in os.walk(&#8216;.&#8217;):<br \/>\nfor one_filename in filenames:<br \/>\nfirst_part, ext = os.path.splitext(one_filename)<br \/>\ntry:<br \/>\ncounts[ext] +=<br \/>\n\u21aaos.path.getsize(os.path.join(currentdir,one_filename))<br \/>\nexcept FileNotFoundError:<br \/>\npass<\/p>\n<p>The above code includes three changes from the previous version:<\/p>\n<ol>\n<li>As indicated, this no longer adds 1 to the count for each extension,<br \/>\nbut rather the size of the file, which comes from<br \/>\nos.path.getsize.<\/li>\n<li>os.path.join puts the path and filename together<br \/>\nand (as a<br \/>\nbonus) uses the current operating system&#8217;s path separation character.<br \/>\nWhat are the odds of a program being used on a Windows system and,<br \/>\nthus, needing a backslash rather than a slash? Pretty slim, but it<br \/>\ndoesn&#8217;t hurt to use this sort of built-in operation.<\/li>\n<li>os.walk doesn&#8217;t normally look at symbolic links, which means<br \/>\nyou potentially can get yourself into some trouble trying to<br \/>\nmeasure the sizes of files that don&#8217;t exist. For this reason, here<br \/>\nthe counting is wrapped in a try\/except block.<\/li>\n<\/ol>\n<p>Once this is done, you can identify the file types consuming<br \/>\nthe greatest amount of space in the directory:<\/p>\n<p>In [46]: for extension, count in counts.most_common(10):<br \/>\n&#8230;: print(f&#8221;&#8221;)<br \/>\n&#8230;:<br \/>\n.pack 669153001<br \/>\n.zip 486110102<br \/>\n.ipynb 223155683<br \/>\n.sql 125443333<br \/>\n46296632<br \/>\n.json 14224651<br \/>\n.txt 10921226<br \/>\n.pdf 7557943<br \/>\n.py 5253208<br \/>\n.pyc 4948851<\/p>\n<p>Now things seem a bit different! In my case, it looks like I&#8217;ve got a lot of<br \/>\nstuff in .pack<br \/>\nfiles, indicating that my Git repository (where I store all of my<br \/>\nold training examples, exercises and Jupyter notebooks) is quite<br \/>\nlarge. I have a lot in zipfiles, in which I store my daily updates.<br \/>\nAnd of course, lots in Jupyter notebooks, which are written in JSON<br \/>\nformat and can become quite large. The surprise to me is the .sql<br \/>\nextension, which I honestly had forgotten that I had.<\/p>\n<h3>Files per Year<\/h3>\n<p>What if you want to know how many files of each type were modified in<br \/>\neach year? This could be useful for removing logfiles or (if you&#8217;re<br \/>\nlike me) identifying what large, unnecessary files are taking up<br \/>\nspace.<\/p>\n<p>In order to do that, you&#8217;ll need to get the modification time<br \/>\n(mtime,<br \/>\nin UNIX parlance) for each file. You&#8217;ll then need to convert that<br \/>\nmtime<br \/>\nfrom a UNIX time (that is, the number of seconds since January 1st, 1970)<br \/>\nto something you can parse and use.<\/p>\n<p>Instead of using a Counter object to keep track of things, you<br \/>\ncan just<br \/>\nuse a dictionary. However, this dict&#8217;s values will be a Counter, with<br \/>\nthe years serving as keys and the counts as values. Since you know that<br \/>\nall of the main dicts will be Counter objects, you can just use a<br \/>\ndefaultdict, which will require you to write less code.<\/p>\n<p>Here&#8217;s how you can do all of this:<\/p>\n<p>from collections import defaultdict, Counter<br \/>\nfrom datetime import datetime<\/p>\n<p>counts = defaultdict(Counter)<\/p>\n<p>for currentdir, dirnames, filenames in os.walk(&#8216;.&#8217;):<br \/>\nfor one_filename in filenames:<br \/>\nfirst_part, ext = os.path.splitext(one_filename)<br \/>\ntry:<br \/>\nfull_filename = os.path.join(currentdir,<br \/>\n\u21aaone_filename)<br \/>\nmtime =<br \/>\n\u21aadatetime.fromtimestamp(os.path.getmtime(full_filename))<br \/>\ncounts[ext][mtime.year] += 1<br \/>\nexcept FileNotFoundError:<br \/>\npass<\/p>\n<p>First, this creates counts as an instance of<br \/>\ndefaultdict with a<br \/>\nCounter. This means if you ask for a key that doesn&#8217;t yet exist,<br \/>\nthe key will be created, with its value being a new Counter<br \/>\nthat allows you to say something like this:<\/p>\n<p>counts[&#8216;.zip&#8217;][2018] += 1<\/p>\n<p>without having to initialize either the zip key (for counts) or the<br \/>\n2018 key (for the Counter object). You can just add one to the count,<br \/>\nand know that it&#8217;s working.<\/p>\n<p>Then, when you iterate over the filesystem, you grab the mtime<br \/>\nfrom the<br \/>\nfilename (using os.path.getmtime). That is turned into a<br \/>\ndatetime<br \/>\nobject with datetime.fromtimestamp, a great function that lets<br \/>\nyou<br \/>\nmove from UNIX timestamps to human-style dates and times. Finally, you<br \/>\nthen add 1 to your counts.<\/p>\n<p>Once again, you can display the results:<\/p>\n<p>for extension, year_counts in counts.items():<br \/>\nprint(extension)<br \/>\nfor year, file_count in sorted(year_counts.items()):<br \/>\nprint(f&#8221;tt&#8221;)<\/p>\n<p>The counts variable is now a defaultdict, but that means it behaves<br \/>\njust like a dictionary in most respects. So, you can iterate over its<br \/>\nkeys and values with items, which is shown here, getting each file<br \/>\nextension and the Counter object for each.<\/p>\n<p>Next the extension is printed, and then it iterates over the years and their<br \/>\ncounts, sorting them by year and printing them indented somewhat with<br \/>\na tab (t) character. In this way, you can see precisely how many<br \/>\nfiles of each extension have been modified per year\u2014and perhaps<br \/>\nunderstand which files are truly important and which you easily can get<br \/>\nrid of.<\/p>\n<h3>Conclusion<\/h3>\n<p>Python can&#8217;t and shouldn&#8217;t replace Bash for simple scripting, but in<br \/>\nmany cases, if you&#8217;re working with large number of files and\/or<br \/>\ncreating reports, Python&#8217;s standard library can make it easy to<br \/>\ndo such tasks with a minimum of code.<\/p>\n<p><a href=\"https:\/\/www.linuxjournal.com\/content\/automate-sysadmin-tasks-pythons-oswalk-function\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Using Python&#8217;s os.walk function to walk through a tree of files and directories. I&#8217;m a web guy; I put together my first site in early 1993. And so, when I started to do Python training, I assumed that most of my students also were going to be web developers or aspiring web developers. Nothing could &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/14\/automate-sysadmin-tasks-with-pythons-os-walk-function\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Automate Sysadmin Tasks with Python&#8217;s os.walk Function&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3366","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3366","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=3366"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3366\/revisions"}],"predecessor-version":[{"id":3625,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/3366\/revisions\/3625"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=3366"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=3366"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=3366"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}