{"id":12212,"date":"2019-03-23T15:00:00","date_gmt":"2019-03-23T15:00:00","guid":{"rendered":"http:\/\/www.appservgrid.com\/paw92\/?p=12212"},"modified":"2019-03-23T15:00:00","modified_gmt":"2019-03-23T15:00:00","slug":"how-to-convert-files-to-utf-8-encoding-in-linux","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/03\/23\/how-to-convert-files-to-utf-8-encoding-in-linux\/","title":{"rendered":"How to Convert Files to UTF-8 Encoding in Linux"},"content":{"rendered":"<p>In this guide, we will describe what character encoding and cover a few examples of converting files from one character encoding to another using a command line tool. Then finally, we will look at how to convert several files from any character set (<strong>charset<\/strong>) to\u00a0<strong>UTF-8<\/strong>\u00a0encoding in Linux.<\/p>\n<p>As you may probably have in mind already, a computer does not understand or store letters, numbers or anything else that we as humans can perceive except bits. A bit has only two possible values, that is either a\u00a0<code>0<\/code>or\u00a0<code>1<\/code>,\u00a0<code>true<\/code>\u00a0or\u00a0<code>false<\/code>,\u00a0<code>yes<\/code>\u00a0or\u00a0<code>no<\/code>. Every other thing such as letters, numbers, images must be represented in bits for a computer to process.<\/p>\n<p>In simple terms,\u00a0<strong>character encoding<\/strong>\u00a0is a way of informing a computer how to interpret raw zeroes and ones into actual characters, where a character is represented by set of numbers. When we type text in a file, the words and sentences we form are cooked-up from different characters, and characters are organized into a\u00a0<strong>charset<\/strong>.<\/p>\n<p>There are various encoding schemes out there such as\u00a0<strong>ASCII<\/strong>,\u00a0<strong>ANSI<\/strong>,\u00a0<strong>Unicode<\/strong>\u00a0among others. Below is an example of\u00a0<strong>ASCII<\/strong>\u00a0encoding.<\/p>\n<pre>Character  bits\r\nA               01000001\r\nB               01000010<\/pre>\n<p>In Linux, the\u00a0<strong>iconv<\/strong>\u00a0command line tool is used to convert text from one form of encoding to another.<\/p>\n<p>You can check the encoding of a file using the\u00a0<strong>file<\/strong>\u00a0command, by using the\u00a0<code>-i<\/code>\u00a0or\u00a0<code>--mime<\/code>\u00a0flag which enables printing of mime type string as in the examples below:<\/p>\n<pre>$ file -i Car.java\r\n$ file -i CarDriver.java\r\n<\/pre>\n<div id=\"attachment_23274\" class=\"wp-caption aligncenter\">\n<p><a href=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Check-File-Encoding-in-Linux.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-23274\" src=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Check-File-Encoding-in-Linux.png\" alt=\"Check File Encoding in Linux\" width=\"738\" height=\"135\" data-lazy-loaded=\"true\" \/><\/a><\/p>\n<p class=\"wp-caption-text\">Check File Encoding in Linux<\/p>\n<\/div>\n<p>The syntax for using\u00a0<strong>iconv<\/strong>\u00a0is as follows:<\/p>\n<pre>$ iconv option\r\n$ iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile \r\n<\/pre>\n<p>Where\u00a0<code>-f<\/code>\u00a0or\u00a0<code>--from-code<\/code>\u00a0means input encoding and\u00a0<code>-t<\/code>\u00a0or\u00a0<code>--to-encoding<\/code>\u00a0specifies output encoding.<\/p>\n<p>To list all known coded character sets, run the command below:<\/p>\n<pre>$ iconv -l \r\n<\/pre>\n<div id=\"attachment_23275\" class=\"wp-caption aligncenter\">\n<p><a href=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/List-Coded-Charsets-in-Linux.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-23275\" src=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/List-Coded-Charsets-in-Linux.png\" sizes=\"auto, (max-width: 808px) 100vw, 808px\" srcset=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/List-Coded-Charsets-in-Linux.png 808w, https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/List-Coded-Charsets-in-Linux-768x598.png 768w\" alt=\"List Coded Charsets in Linux\" width=\"808\" height=\"629\" data-lazy-loaded=\"true\" \/><\/a><\/p>\n<p class=\"wp-caption-text\">List Coded Charsets in Linux<\/p>\n<\/div>\n<h3>Convert Files from UTF-8 to ASCII Encoding<\/h3>\n<p>Next, we will learn how to convert from one encoding scheme to another. The command below converts from\u00a0<strong>ISO-8859-1<\/strong>\u00a0to\u00a0<strong>UTF-8<\/strong>\u00a0encoding.<\/p>\n<p>Consider a file named\u00a0<code>input.file<\/code>\u00a0which contains the characters:<\/p>\n<pre>\ufffd \ufffd \ufffd \ufffd\r\n<\/pre>\n<p>Let us start by checking the encoding of the characters in the file and then view the file contents. Closely, we can convert all the characters to\u00a0<strong>ASCII<\/strong>\u00a0encoding.<\/p>\n<p>After running the\u00a0<strong>iconv<\/strong>\u00a0command, we then check the contents of the output file and the new encoding of the characters as below.<\/p>\n<pre>$ file -i input.file\r\n$ cat input.file \r\n$ iconv -f ISO-8859-1 -t UTF-8\/\/TRANSLIT input.file -o out.file\r\n$ cat out.file \r\n$ file -i out.file \r\n<\/pre>\n<div id=\"attachment_23297\" class=\"wp-caption aligncenter\">\n<p><a href=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Converts-UTF8-to-ASCII-in-Linux.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-23297\" src=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Converts-UTF8-to-ASCII-in-Linux.png\" sizes=\"auto, (max-width: 943px) 100vw, 943px\" srcset=\"https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Converts-UTF8-to-ASCII-in-Linux.png 943w, https:\/\/www.tecmint.com\/wp-content\/uploads\/2016\/10\/Converts-UTF8-to-ASCII-in-Linux-768x218.png 768w\" alt=\"Convert UTF-8 to ASCII in Linux\" width=\"943\" height=\"268\" data-lazy-loaded=\"true\" \/><\/a><\/p>\n<p class=\"wp-caption-text\">Convert UTF-8 to ASCII in Linux<\/p>\n<\/div>\n<p><strong>Note<\/strong>: In case the string\u00a0<code>\/\/IGNORE<\/code>\u00a0is added to to-encoding, characters that can\u2019t be converted and an error is displayed after conversion.<\/p>\n<p>Again, supposing the string\u00a0<code>\/\/TRANSLIT<\/code>\u00a0is added to to-encoding as in the example above (<strong>ASCII\/\/TRANSLIT<\/strong>), characters being converted are transliterated as needed and if possible. Which implies in the event that a character can\u2019t be represented in the target character set, it can be approximated through one or more similar looking characters.<\/p>\n<p>Consequently, any character that can\u2019t be transliterated and is not in target character set is replaced with a question mark\u00a0<code>(?)<\/code>\u00a0in the output.<\/p>\n<h3>Convert Multiple Files to UTF-8 Encoding<\/h3>\n<p>Coming back to our main topic, to convert multiple or all files in a directory to UTF-8 encoding, you can write a small shell script called\u00a0<strong>encoding.sh<\/strong>\u00a0as follows:<\/p>\n<pre>#!\/bin\/bash\r\n#enter input encoding here\r\nFROM_ENCODING=\"value_here\"\r\n#output encoding(UTF-8)\r\nTO_ENCODING=\"UTF-8\"\r\n#convert\r\nCONVERT=\" iconv  -f   $FROM_ENCODING  -t   $TO_ENCODING\"\r\n#loop to convert multiple files \r\nfor  file  in  *.txt; do\r\n     $CONVERT   \"$file\"   -o  \"${file%.txt}.utf8.converted\"\r\ndone\r\nexit 0\r\n<\/pre>\n<div class=\"google-auto-placed ap_container\">\n<p>Save the file, then make the script executable. Run it from the directory where your files (<code>*.txt<\/code>) are located.<\/p>\n<pre>$ chmod  +x  encoding.sh\r\n$ .\/encoding.sh\r\n<\/pre>\n<p><strong>Important<\/strong>: You can as well use this script for general conversion of multiple files from one given encoding to another, simply play around with the values of the\u00a0<code>FROM_ENCODING<\/code>\u00a0and\u00a0<code>TO_ENCODING<\/code>\u00a0variable, not forgetting the output file name\u00a0<code>\"${file%.txt}.utf8.converted\"<\/code>.<\/p>\n<p>For more information, look through the\u00a0<strong>iconv<\/strong>\u00a0man page.<\/p>\n<pre>$ man iconv\r\n<\/pre>\n<p>To sum up this guide, understanding encoding and how to convert from one character encoding scheme to another is necessary knowledge for every computer user more so for programmers when it comes to dealing with text.<\/p>\n<\/div>\n<p><a href=\"https:\/\/www.tecmint.com\/convert-files-to-utf-8-encoding-in-linux\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this guide, we will describe what character encoding and cover a few examples of converting files from one character encoding to another using a command line tool. Then finally, we will look at how to convert several files from any character set (charset) to\u00a0UTF-8\u00a0encoding in Linux. As you may probably have in mind already, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/03\/23\/how-to-convert-files-to-utf-8-encoding-in-linux\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;How to Convert Files to UTF-8 Encoding in Linux&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-12212","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/12212","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=12212"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/12212\/revisions"}],"predecessor-version":[{"id":12213,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/12212\/revisions\/12213"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=12212"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=12212"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=12212"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}