Normalizing Filenames and Data with Bash

URLify: convert letter sequences into safe URLs with hex
equivalents.

This is my 155th column. That means I’ve been writing for Linux
Journal for:

$ echo “155/12” | bc
12

No, wait, that’s not right. Let’s try that again:

$ echo “scale=2;155/12” | bc
12.91

Yeah, that many years. Almost 13 years of writing about shell scripts and
lightweight programming within the Linux environment. I’ve covered a lot
of ground, but I want to go back to something that’s fairly basic and
talk about filenames and the web.

It used to be that if you had filenames that had spaces in them, bad things would
happen: “my mom’s cookies.html” was a recipe for disaster, not
good cookies—um, and not those sorts of web cookies either!

As the web evolved, however, encoding of special characters became the norm,
and every Web browser had to be able to manage it, for better or worse. So
spaces became either “+” or %20 sequences, and everything else that
wasn’t a regular alphanumeric character was replaced by its hex ASCII
equivalent.

In other words, “my mom’s cookies.html” turned into
“my+mom%27s+cookies.html” or “my%20mom%27s%20cookies.html”.
Many symbols took on a second life too, so “&” and “=” and
“?” all got their own meanings, which meant that they needed to be
protected if they were part of an original filename too. And what about if
you had a “%” in your original filename? Ah yes, the recursive nature
of encoding things….

So purely as an exercise in scripting, let’s write a script that
converts any string you hand it into a “web-safe” sequence. Before
starting, however, pull out a piece of paper and jot down how you’d solve
it.

Normalizing Filenames for the Web

My strategy is going to be easy: pull the string apart into individual
characters, analyze each character to identify if it’s an alphanumeric,
and if it’s not, convert it into its hexadecimal ASCII equivalent,
prefacing it with a “%” as needed.

There are a number of ways to break a string into its individual letters,
but let’s use Bash string variable manipulations, recalling that
${#var}
returns the number of characters in variable $var, and that
$ will
return just the letter in $var at position x. Quick now, does indexing start
at zero or one?

Here’s my initial loop to break $original into its component letters:

input=”$*”

echo $input

for (( counter=0 ; counter < ${#input} ; counter++ ))
do
echo “counter = $counter — $”
done

Recall that $* is a shortcut for everything from the invoking command line
other than the command name itself—a lazy way to let users quote the
argument or not. It doesn’t address special characters, but that’s
what quotes are for, right?

Let’s give this fragmentary script a whirl with some input from the
command line:

$ sh normalize.sh “li nux?”
li nux?
counter = 0 — l
counter = 1 — i
counter = 2 —
counter = 3 — n
counter = 4 — u
counter = 5 — x
counter = 6 — ?

There’s obviously some debugging code in the script, but it’s
generally a good idea to leave that in until you’re sure it’s working
as expected.

Now it’s time to differentiate between characters that are acceptable
within a URL and those that are not. Turning a character into a hex sequence
is a bit tricky, so I’m using a sequence of fairly obscure
commands. Let’s start with just the command line:

$ echo ‘~’ | xxd -ps -c1 | head -1
7e

Now, the question is whether “~” is actually the hex ASCII sequence
7e or not. A quick glance at http://www.asciitable.com confirms that, yes, 7e is
indeed the ASCII for the tilde. Preface that with a percentage sign, and
the tough job of conversion is managed.

But, how do you know what characters can be used as they are? Because of the weird
way the ASCII table is organized, that’s going to be three ranges:
0–9 is in one area of the table, then A–Z in a second area and
a–z in a
third. There’s no way around it, that’s three range tests.

There’s a really cool way to do that in Bash too:

if [[ “$char” =~ [a-z] ]]

What’s happening here is that this is actually a regular expression (the
=~) and a range [a-z] as the test. Since the action
I want to take after
each test is identical, it’s easy now to implement all three tests:

if [[ “$char” =~ [a-z] ]]; then
output=”$output$char”
elif [[ “$char” =~ [A-Z] ]]; then
output=”$output$char”
elif [[ “$char” =~ [0-9] ]]; then
output=”$output$char”
else

As is obvious, the $output string variable will be built up to have the
desired value.

What’s left? The hex output for anything that’s not an otherwise
acceptable character. And you’ve already seen how that can be implemented:

hexchar=”$(echo “$char” | xxd -ps -c1 | head -1)”
output=”$output%$hexchar”

A quick run through:

$ sh normalize.sh “li nux?”
li nux? translates to li%20nux%3F

See the problem? Without converting the hex into uppercase, it’s a bit
weird looking. What’s “nux”? That’s just another step in the subshell
invocation:

hexchar=”$(echo “$char” | xxd -ps -c1 | head -1 |
tr ‘[a-z]’ ‘[A-Z]’)”

And now, with that tweak, the output looks good:

$ sh normalize.sh “li nux?”
li nux? translates to li%20nux%3F

What about a non-Latin-1 character like an umlaut or an n-tilde? Let’s
see what happens:

$ sh normalize.sh “Señor Günter”
Señor Günter translates to Se%C3B1or%200AG%C3BCnter

Ah, there’s a bug in the script when it comes to these two-byte character
sequences, because each special letter should have two hex byte sequences. In
other words, it should be converted to se%C3%B1or g%C3%BCnter (I restored the
space to make it a bit easier to see what I’m talking about).

In other words, this gets the right sequences, but it’s missing
a percentage sign—%C3B should be %C3%B, and
%C3BC should be %C3%BC.

Undoubtedly, the problem is in the hexchar assignment subshell statement:

hexchar=”$(echo “$char” | xxd -ps -c1 | head -1 |
tr ‘[a-z]’ ‘[A-Z]’)”

Is it the -c1 argument to xxd? Maybe. I’m going to leave identifying and
fixing the problem as an exercise for you, dear reader. And while you’re
fixing up the script to support two-byte characters, why not replace
“%20” with “+” too?

Finally, to make this maximally useful, don’t forget that there are a
number of symbols that are valid and don’t need to be converted within
URLs too, notably the set of “-_./!@#=&?”, so you’ll want to
ensure that they don’t get hexified (is that a word?).

Source

Normalizing Filenames for the Web

Leave a Reply Cancel reply