{"id":1996,"date":"2018-10-31T00:44:59","date_gmt":"2018-10-31T00:44:59","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=1996"},"modified":"2018-10-31T03:44:40","modified_gmt":"2018-10-31T03:44:40","slug":"normalizing-filenames-and-data-with-bash","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/10\/31\/normalizing-filenames-and-data-with-bash\/","title":{"rendered":"Normalizing Filenames and Data with Bash"},"content":{"rendered":"<p><em>URLify: convert letter sequences into safe URLs with hex<br \/>\nequivalents.<\/em><\/p>\n<p>This is my 155th column. That means I&#8217;ve been writing for <em>Linux<br \/>\nJournal<\/em> for:<\/p>\n<p>$ echo &#8220;155\/12&#8221; | bc<br \/>\n12<\/p>\n<p>No, wait, that&#8217;s not right. Let&#8217;s try that again:<\/p>\n<p>$ echo &#8220;scale=2;155\/12&#8221; | bc<br \/>\n12.91<\/p>\n<p>Yeah, that many years. Almost 13 years of writing about shell scripts and<br \/>\nlightweight programming within the Linux environment. I&#8217;ve covered a lot<br \/>\nof ground, but I want to go back to something that&#8217;s fairly basic and<br \/>\ntalk about filenames and the web.<\/p>\n<p>It used to be that if you had filenames that had spaces in them, bad things would<br \/>\nhappen: &#8220;my mom&#8217;s cookies.html&#8221; was a recipe for disaster, not<br \/>\ngood cookies\u2014um, and not those sorts of web cookies either!<\/p>\n<p>As the web evolved, however, encoding of special characters became the norm,<br \/>\nand every Web browser had to be able to manage it, for better or worse. So<br \/>\nspaces became either &#8220;+&#8221; or %20 sequences, and everything else that<br \/>\nwasn&#8217;t a regular alphanumeric character was replaced by its hex ASCII<br \/>\nequivalent.<\/p>\n<p>In other words, &#8220;my mom&#8217;s cookies.html&#8221; turned into<br \/>\n&#8220;my+mom%27s+cookies.html&#8221; or &#8220;my%20mom%27s%20cookies.html&#8221;.<br \/>\nMany symbols took on a second life too, so &#8220;&amp;&#8221; and &#8220;=&#8221; and<br \/>\n&#8220;?&#8221; all got their own meanings, which meant that they needed to be<br \/>\nprotected if they were part of an original filename too. And what about if<br \/>\nyou had a &#8220;%&#8221; in your original filename? Ah yes, the recursive nature<br \/>\nof encoding things&#8230;.<\/p>\n<p>So purely as an exercise in scripting, let&#8217;s write a script that<br \/>\nconverts any string you hand it into a &#8220;web-safe&#8221; sequence. Before<br \/>\nstarting, however, pull out a piece of paper and jot down how you&#8217;d solve<br \/>\nit.<\/p>\n<h3>Normalizing Filenames for the Web<\/h3>\n<p>My strategy is going to be easy: pull the string apart into individual<br \/>\ncharacters, analyze each character to identify if it&#8217;s an alphanumeric,<br \/>\nand if it&#8217;s not, convert it into its hexadecimal ASCII equivalent,<br \/>\nprefacing it with a &#8220;%&#8221; as needed.<\/p>\n<p>There are a number of ways to break a string into its individual letters,<br \/>\nbut let&#8217;s use Bash string variable manipulations, recalling that<br \/>\n${#var}<br \/>\nreturns the number of characters in variable $var, and that<br \/>\n$ will<br \/>\nreturn just the letter in $var at position x. Quick now, does indexing start<br \/>\nat zero or one?<\/p>\n<p>Here&#8217;s my initial loop to break $original into its component letters:<\/p>\n<p>input=&#8221;$*&#8221;<\/p>\n<p>echo $input<\/p>\n<p>for (( counter=0 ; counter &lt; ${#input} ; counter++ ))<br \/>\ndo<br \/>\necho &#8220;counter = $counter &#8212; $&#8221;<br \/>\ndone<\/p>\n<p>Recall that $* is a shortcut for everything from the invoking command line<br \/>\nother than the command name itself\u2014a lazy way to let users quote the<br \/>\nargument or not. It doesn&#8217;t address special characters, but that&#8217;s<br \/>\nwhat quotes are for, right?<\/p>\n<p>Let&#8217;s give this fragmentary script a whirl with some input from the<br \/>\ncommand line:<\/p>\n<p>$ sh normalize.sh &#8220;li nux?&#8221;<br \/>\nli nux?<br \/>\ncounter = 0 &#8212; l<br \/>\ncounter = 1 &#8212; i<br \/>\ncounter = 2 &#8212;<br \/>\ncounter = 3 &#8212; n<br \/>\ncounter = 4 &#8212; u<br \/>\ncounter = 5 &#8212; x<br \/>\ncounter = 6 &#8212; ?<\/p>\n<p>There&#8217;s obviously some debugging code in the script, but it&#8217;s<br \/>\ngenerally a good idea to leave that in until you&#8217;re sure it&#8217;s working<br \/>\nas expected.<\/p>\n<p>Now it&#8217;s time to differentiate between characters that are acceptable<br \/>\nwithin a URL and those that are not. Turning a character into a hex sequence<br \/>\nis a bit tricky, so I&#8217;m using a sequence of fairly obscure<br \/>\ncommands. Let&#8217;s start with just the command line:<\/p>\n<p>$ echo &#8216;~&#8217; | xxd -ps -c1 | head -1<br \/>\n7e<\/p>\n<p>Now, the question is whether &#8220;~&#8221; is actually the hex ASCII sequence<br \/>\n7e or not. A quick glance at <a href=\"http:\/\/www.asciitable.com\">http:\/\/www.asciitable.com<\/a> confirms that, yes, 7e is<br \/>\nindeed the ASCII for the tilde. Preface that with a percentage sign, and<br \/>\nthe tough job of conversion is managed.<\/p>\n<p>But, how do you know what characters can be used as they are? Because of the weird<br \/>\nway the ASCII table is organized, that&#8217;s going to be three ranges:<br \/>\n0\u20139 is in one area of the table, then A\u2013Z in a second area and<br \/>\na\u2013z in a<br \/>\nthird. There&#8217;s no way around it, that&#8217;s three range tests.<\/p>\n<p>There&#8217;s a really cool way to do that in Bash too:<\/p>\n<p>if [[ &#8220;$char&#8221; =~ [a-z] ]]<\/p>\n<p>What&#8217;s happening here is that this is actually a regular expression (the<br \/>\n=~) and a range [a-z] as the test. Since the action<br \/>\nI want to take after<br \/>\neach test is identical, it&#8217;s easy now to implement all three tests:<\/p>\n<p>if [[ &#8220;$char&#8221; =~ [a-z] ]]; then<br \/>\noutput=&#8221;$output$char&#8221;<br \/>\nelif [[ &#8220;$char&#8221; =~ [A-Z] ]]; then<br \/>\noutput=&#8221;$output$char&#8221;<br \/>\nelif [[ &#8220;$char&#8221; =~ [0-9] ]]; then<br \/>\noutput=&#8221;$output$char&#8221;<br \/>\nelse<\/p>\n<p>As is obvious, the $output string variable will be built up to have the<br \/>\ndesired value.<\/p>\n<p>What&#8217;s left? The hex output for anything that&#8217;s not an otherwise<br \/>\nacceptable character. And you&#8217;ve already seen how that can be implemented:<\/p>\n<p>hexchar=&#8221;$(echo &#8220;$char&#8221; | xxd -ps -c1 | head -1)&#8221;<br \/>\noutput=&#8221;$output%$hexchar&#8221;<\/p>\n<p>A quick run through:<\/p>\n<p>$ sh normalize.sh &#8220;li nux?&#8221;<br \/>\nli nux? translates to li%20nux%3F<\/p>\n<p>See the problem? Without converting the hex into uppercase, it&#8217;s a bit<br \/>\nweird looking. What&#8217;s &#8220;nux&#8221;? That&#8217;s just another step in the subshell<br \/>\ninvocation:<\/p>\n<p>hexchar=&#8221;$(echo &#8220;$char&#8221; | xxd -ps -c1 | head -1 |<br \/>\ntr &#8216;[a-z]&#8217; &#8216;[A-Z]&#8217;)&#8221;<\/p>\n<p>And now, with that tweak, the output looks good:<\/p>\n<p>$ sh normalize.sh &#8220;li nux?&#8221;<br \/>\nli nux? translates to li%20nux%3F<\/p>\n<p>What about a non-Latin-1 character like an umlaut or an n-tilde? Let&#8217;s<br \/>\nsee what happens:<\/p>\n<p>$ sh normalize.sh &#8220;Se\u00f1or G\u00fcnter&#8221;<br \/>\nSe\u00f1or G\u00fcnter translates to Se%C3B1or%200AG%C3BCnter<\/p>\n<p>Ah, there&#8217;s a bug in the script when it comes to these two-byte character<br \/>\nsequences, because each special letter should have two hex byte sequences. In<br \/>\nother words, it should be converted to se%C3%B1or g%C3%BCnter (I restored the<br \/>\nspace to make it a bit easier to see what I&#8217;m talking about).<\/p>\n<p>In other words, this gets the right sequences, but it&#8217;s missing<br \/>\na percentage sign\u2014%C3B should be %C3%B, and<br \/>\n%C3BC should be %C3%BC.<\/p>\n<p>Undoubtedly, the problem is in the hexchar assignment subshell statement:<\/p>\n<p>hexchar=&#8221;$(echo &#8220;$char&#8221; | xxd -ps -c1 | head -1 |<br \/>\ntr &#8216;[a-z]&#8217; &#8216;[A-Z]&#8217;)&#8221;<\/p>\n<p>Is it the -c1 argument to xxd? Maybe. I&#8217;m going to leave identifying and<br \/>\nfixing the problem as an exercise for you, dear reader. And while you&#8217;re<br \/>\nfixing up the script to support two-byte characters, why not replace<br \/>\n&#8220;%20&#8221; with &#8220;+&#8221; too?<\/p>\n<p>Finally, to make this maximally useful, don&#8217;t forget that there are a<br \/>\nnumber of symbols that are valid and don&#8217;t need to be converted within<br \/>\nURLs too, notably the set of &#8220;-_.\/!@#=&amp;?&#8221;, so you&#8217;ll want to<br \/>\nensure that they don&#8217;t get hexified (is that a word?).<\/p>\n<p><a href=\"https:\/\/www.linuxjournal.com\/content\/normalizing-filenames-and-data-using-bash-string-variable-manipulations\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>URLify: convert letter sequences into safe URLs with hex equivalents. This is my 155th column. That means I&#8217;ve been writing for Linux Journal for: $ echo &#8220;155\/12&#8221; | bc 12 No, wait, that&#8217;s not right. Let&#8217;s try that again: $ echo &#8220;scale=2;155\/12&#8221; | bc 12.91 Yeah, that many years. Almost 13 years of writing about &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/10\/31\/normalizing-filenames-and-data-with-bash\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Normalizing Filenames and Data with Bash&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1996","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/1996","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=1996"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/1996\/revisions"}],"predecessor-version":[{"id":2052,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/1996\/revisions\/2052"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=1996"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=1996"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=1996"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}