Jump to content

How to Use UTF-8 with JavaScript

+ 1
  simonstl's Photo
Posted Nov 11 2009 10:51 AM

Unicode Transformation Format 8-bit (UTF-8) is in standard use by the majority of web application developers. The following excerpt from Cal Henderson's Building Scalable Web Sites explains how to use UTF-8 with Javascript.

Modern browsers have Unicode support built right into Javascript—the basic String class stores code points rather than bytes, and the string manipulation functions work correctly. When you copy data in and out of forms using Javascript, the data that finally gets submitted is UTF-8 (assuming you specified that for the page's encoding type).

The only thing to watch out for is that the built-in function escape(), which is used to format strings for inclusion in a URL, does not support Unicode characters. This means that if you want to let users input text that you'll then build a URL from (such as building a GET query string), then you can't use escape().

Luckily, since Javascript supports code points natively and allows you to query them using the String.getCodeAt() method, you can fairly easily write your own UTF-8-safe escaping function:

function escape_utf8(data) {

        if (data == '' || data == null){

               return '';

        }

       data = data.toString();

       var buffer = '';

       for(var i=0; i<data.length; i++){

               var c = data.charCodeAt(i);

               var bs = new Array();

              if (c > 0x10000){

                       // 4 bytes

                       bs[0] = 0xF0 | ((c &amp; 0x1C0000) >>> 18);

                       bs[1] = 0x80 | ((c &amp; 0x3F000) >>> 12);

                       bs[2] = 0x80 | ((c &amp; 0xFC0) >>> 6);

                   bs[3] = 0x80 | (c &amp; 0x3F);

               }else if (c > 0x800){

                        // 3 bytes

                        bs[0] = 0xE0 | ((c &amp; 0xF000) >>> 12);

                        bs[1] = 0x80 | ((c &amp; 0xFC0) >>> 6);

                       bs[2] = 0x80 | (c &amp; 0x3F);

             }else if (c > 0x80){

                      // 2 bytes

                       bs[0] = 0xC0 | ((c &amp; 0x7C0) >>> 6);

                      bs[1] = 0x80 | (c &amp; 0x3F);

               }else{

                       // 1 byte

                    bs[0] = c;

              }

             for(var j=0; j<bs.length; j++){

                      var b = bs[j];

                       var hex = nibble_to_hex((b &amp; 0xF0) >>> 4) 

                      + nibble_to_hex(b &amp;0x0F);buffer += '%'+hex;

              }

    }

    return buffer;

}

function nibble_to_hex(nibble){

        var chars = '0123456789ABCDEF';

        return chars.charAt(nibble);

}

The escape_utf8() function works by iterating over each code point in the string, creating a UTF-8 byte stream. It then loops over the bytes in this stream, formatting each one using the %XX format for escaping bytes in a URL. A further improvement to this function would be to leave alphanumeric characters as-is in the escaped version of the string, so that the returned values are easily readable in the common case.

Cover of Building Scalable Web Sites
Learn more about this topic from Building Scalable Web Sites. 

This comprehensive guide covers the design of software and hardware systems for web applications. Using scores of examples and leading-edge tips, it details proven methods for scaling web applications to millions of users. Topics include application architecture, development practices, technologies, Unicode, and general infrastructure work. Ideal for anyone ready to realize the cost and performance benefits available to web applications today.

Learn More Read Now on Safari


Tags:
0 Subscribe


0 Replies