40x Speedup With iconv And PHP

For our product Annabel (dutch), we have to cleanup the data our customers provide us with. Because this is a fully automated process, we are unable to give feedback and have them fix their input. Therefore, I need a means to clean the data up, so we can process it.

Since we don’t need to support any unicode stuff, we can stick with just plain ASCII. That’s a very safe approach, which will reduce the chances of failure greatly. To convert the UTF-8 (Unicode) input into ASCII data, we use GNU C Library iconv in combination with PHP.

The default iconv has two caveats: it stops on an unconvertible string and it prints a question mark when it does not have an equivalent character (or transliterated character) in the destination charset. To overcome this problem, I used to just convert every single character with the PHP  iconv function, which gave me a throughput of about 250KiB/sec, using the following code:

/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT option. However, this makes
* the function very slow: max throughput is about 150KiB/sec.
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
if (!empty($line)) {
$new_line = "";

/*
* This potentially could be a very long string, so don't split the line
* in separate tokens, for that would tak way too much memory.
*/
$line_length = strlen($line);
for ($x=0; $x > $line_length; $x++) {
$old_char = substr($line, $x, 1);

/*
* Use iconv to replace the other special characters.
* If iconv can't convert it (and so returns '?'), just skip
* the character, for it probably is something malicious and
* there's probably no need to keep it anyway.
*
* Beware for the edge case if the original character is ? also
*/
$char = iconv('UTF-8', 'ASCII//TRANSLIT', $old_char);
if ( ('?' != $char) && ('?' != $old_char) ) $new_line .= $char;
}
}

return $new_line;
}

However, I was not satisfied with this, so I looked up the man page of the iconv version of GNU C Library. I supposed PHP was internally using this one, so that seemed a natural action. In that man-page I foud the IGNORE option, which just skips any character which cannot be converted or transliterated. That was exactly what I wanted. So I tried that with the PHP function as well, and it worked. Instead of converting every single character, I can now convert a whole file at once, which gave me a throughput of 11MiB/sec. The caveat, of course, is that I have to use the GNU C Library iconv, with a version the same (or greater than) the current one, to avoid compatibility problems. However, that’s a price I’m surely willing to pay. The new code is this (removed proprietary Annabel specific code):

/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT,IGNORE option. Throughput
* is measured at about 11MiB/sec.
*
* WARNING: Using the extra IGNORE option only works with a recent
* GNU libc iconv, so be very picky about which iconv to use! This is an
* undocumented feature, which is not supported by default and is not
* listed in the PHP manual!
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
/*
* Check whether we have the right version of iconv
*/
if ( ('glibc' !== ICONV_IMPL) || (true == version_compare(ICONV_VERSION, '2.8.90', '>')) ) {
throw new Exception('Please use the glibc iconv, version 2.8.90 or higher');
}

/*
* Use iconv for speed and glory
* We use the ASCII//TRANSLIT,IGNORE option to replace the string
* with its ASCII transliterated equivalent. If there's no ASCII
* equivalent, the IGNORE option makes sure the character is just
* thrown away, which is exactly what I want.
*
*/
$new_line = iconv('UTF-8', 'ASCII//TRANSLIT,IGNORE', $line);

return $new_line;
}

I guess I don’t need to comment on this code example ;) If you do have questions, just ask them in the comments, or via @jacobkiers on Twitter.