PHP vs. The Developer: Encoding Character Sets

Ask most developers about character encoding and your apt to see one of three responses:

  1. a dramatic rolling of the eyes
  2. a looked of dazed incomprehension
  3. a pantomimed bullet to the brain

Ask a a PHP developer about character encodings and 99 out of 100 times you'll get either the second or third answer (the third being most likely if the developer has experience with them).

For a pretty decent description of the problems (and the best way to avoid them), check out: http://webmonkeyuk.wordpress.com/2011/04/23/how-to-avoid-character-encoding-problems-in-php/.

If avoiding the problem isn't an option, though, read on.

The Default PHP Encoding

PHP has two default encodings: the default character encoding for text read in from files and the default encoding for output (both to streams and files).

When text is read from a file, PHP retains the character encoding used in the file. So UTF-8 files will be read in as UTF-8 strings and ISO-8859-1 files will be read in as ISO-8859-1 encoded strings.

Note: PHP files are not excluded from the default-encoding rule. Specifically, strings defined in a PHP file will be encoded using whatever character set the file is stored in.

The other default encoding is that for output, which essentially means the value that is declared in the Content-Type header on any HTTP responses, and that one is set in the PHP INI file.

Getting Into Trouble

One of the most common problems with character encoding occurs when the default output character encoding is not consistent with the character encoding used in the server's PHP files. In that case, the server will be telling clients that response content is encoded in the wrong character set, which can result in scrambled text.

Another potential problem is inconsistent character encoding between PHP scripts and databases, which can also result in scrambled text.

Of course, the experienced PHP programmer would at least be aware of the multi-byte string extension designed to handle just these types of situations. Using mb_detect_encoding() and mb_convert_decoding, just such a developer may think she's got encoding licked. She'd be wrong.

Sadly, mb_detect_encoding() can flat out fail:

For UTF-16UTF-32UCS2 and UCS4, encoding detection will fail always.

Source: http://www.php.net/manual/en/function.mb-detect-order.php

It can also be flat our wrong. In fact, in my testing, mb_detect_encoding() always detected strings as either ASCII or UTF-8 (the difference being whether any non-ASCII characters were present).

Test Code

To test exactly how PHP and Apache handled character encoding, I set up a simple test script to encode/decode characters in HTTP requests and responses.

Here's the code:

<?php
$default_encoding = 'UTF-8';
$encoding = (array_key_exists('encoding', $_POST)) ? $_POST['encoding'] : $default_encoding;
$valid_encodings = array(
 'UTF-8', 'UTF-16', 'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
 'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9', 'ISO-8859-10',
 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15');
if (array_key_exists('string', $_POST)) {
 $string = $_POST['string'];
$encoded_string = mb_convert_encoding($string, $encoding, 'UTF-8');
$ch = curl_init($uri = 'http://' . 'portal.vidtel.lan' . $_SERVER['REQUEST_URI'] . '?test=1');
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Set the request data
 curl_setopt($ch, CURLOPT_POSTFIELDS, $encoded_string);
// Set the headers
 $headers = array(
 'Content-Type: text/plain;charset=' . strtolower($encoding),
 'Content-Length: ' . strlen($encoded_string));
 curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
$response = curl_exec($ch);
} elseif (array_key_exists('test', $_GET)) {
 if (preg_match('/charset=(.*);?$/i', $_SERVER['CONTENT_TYPE'], $match)) {
 $match = strtoupper($match[1]);
$encoding = (in_array($match, $valid_encodings)) ? $match : $default_encoding;
 } else {
 $encoding = $default_encoding;
 }
$encoded_string = file_get_contents('php://input');
 $string = ($encoding == $default_encoding) ? $encoded_string : mb_convert_encoding($encoded_string, $default_encoding, $encoding);
 printf('"%s" was detected as being encoded with %s.' . PHP_EOL, $string, mb_detect_encoding($encoded_string));
 echo '"' . $string . '" was actually encoded using ' . $encoding . ': ' . $encoded_string;
 exit();
}
?><!DOCTYPE html>
<html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <title></title>
 </head>
 <body>
 <h1>Provide a String and Character Set</h1>
 <form method="post">
 <p>
 <input name="string" <?php if (isset($string)) printf('value="%s" ', $string); ?>>
 <select name="encoding">
<?php foreach ($valid_encodings as $charset) : ?>
 <option<?php if ($charset == $encoding) echo ' selected="selected"'; ?>><?php echo $charset; ?></option>
<?php endforeach; ?>
 </select>
 <input type="submit" value="Test Encoding">
 </p>
 </form>
<?php if (isset($response)) : ?>
 <h1>Test Result</h1>
 <pre><?php echo $response; ?></pre>
<?php endif; ?>
 </body>
</html>