As UTF-8 encoding becomes more common in web development, issues arise every now and again that cause bugs in code, and these bugs are not always easy to figure out. I was recently working on a Swedish website, Opus Bilprovning, through Graphic Fusion here in Tucson. Since the site is in Swedish, there are some extra characters that I had to deal with that I don’t use everyday, such as ä, å, and ö to name a few. Other than trying to pick up a few basic Swedish words, these new characters didn’t seem to be much of a big deal.
Then strlen() came along and broke the script
All was going fantastic until I was trying to extract out a word from a longer string that was UTF-8 encoded. As a simplified example, lets take the word “tjänster”, which is Swedish for “services” and count the characters using strlen()
$string = "tjänster"; echo "Length: " . strlen($string);
Looking at the word, I count 8 characters, but strlen() is telling me there are 9 characters. Why is this so? The problem comes from how strlen() determines the number of characters in the string. In many character sets, one character is represented with 1 byte, so the length of the string is the same thing as the number of bytes in the string. Since this is the case, php’s strlen() function returns the number of bytes in the string.
Why this doesn’t work with UTF-8 strings
In UTF-8, not all characters are represented with 1 byte. In fact, characters can be represented with as many as 4 bytes in UTF-8. In this example, the character “ä” is represented using 2 bytes in UTF-8, so the strlen() function returns 9 rather than the expected 8.
How this is solved
If all you are trying to do is count the length of the string, the solution is very simple; just pass the string through utf8_decode() prior to calling the strlen() function.
$string = "tjänster"; echo "Length: " . strlen( utf8_decode($string) );
In addition, there is another function that can be used to accomplish the same thing, and it has the added benefit that it can be used with character sets other than UTF-8. The function is mb_strlen(), and it requires two pieces of information to be passed to it, first, the string to determine the length of, and second, the encoding that is used for the string.
$string = "tjänster"; echo "Length: " . mb_strlen( $string, 'utf-8' );