40x Speedup With iconv And PHP
Published on 03 March 2009 -This content is old and might be outdated.
For our product Annabel, we have to clean up the data our customers provide us. Because this is a fully automated process, we are unable to give direct feedback and have them fix their input. Therefore, I need a means to clean the data on our end, so we can process it.
Since we don’t need to support any unicode stuff, we can stick with just plain ASCII. That’s a very safe approach, which will reduce the chances of failure greatly. To convert the UTF-8 (Unicode) input into ASCII data, we use the iconv method from the GNU C Library in combination with PHP.
The default iconv implementation in PHP has two caveats: it stops when a string cannot be converted, and it prints a question mark when it does not have an equivalent (or transliterated) character in the destination character set. To overcome this problem, I used to just convert every single character with the PHP iconv function, which gave me a throughput of about 250KiB/sec, using the following code:
<?php
/**
* Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT option. However, this makes
* the function very slow: max throughput is about 150KiB/sec.
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
if (empty($line)) return '';
$new_line = "";
/*
* This potentially could be a very long string, so don't split the line
* in separate tokens, for that would tak way too much memory.
*/
$line_length = strlen($line);
for ($x = 0; $x > $line_length; $x++) {
$old_char = substr($line, $x, 1);
/*
* Use iconv to replace the other special characters.
* If iconv can't convert it (and so returns '?'), just skip
* the character, for it probably is something malicious and
* there's probably no need to keep it anyway.
*
* Beware of the edge case if the original character is a
* question mark itself.
*/
$char = iconv('UTF-8', 'ASCII//TRANSLIT', $old_char);
if ( ('?' != $char) && ('?' != $old_char) ) $new_line .= $char;
}
return $new_line;
}
However, I was not satisfied with this, so I looked up the man page of the iconv version of GNU C Library. I supposed PHP was internally using this one, so that seemed a natural action. In that man-page I found the IGNORE option, which just skips any character which cannot be converted or transliterated. That was exactly what I wanted. So I tried that with the PHP function as well, and it worked.
Instead of converting every single character, I can now convert a whole file at once, which gave me a throughput of 11MiB/sec. The caveat, of course, is that I have to use the GNU C Library iconv, with a version the same (or greater than) the current one, to avoid compatibility problems. However, that’s a price I’m surely willing to pay. The new code is like this:
<?php
/**
* Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT,IGNORE option. Throughput
* is measured at about 11MiB/sec.
*
* WARNING: Using the extra IGNORE option only works with a recent
* GNU libc iconv, so be very picky about which iconv to use! This is an
* undocumented feature, which is not supported by default and is not
* listed in the PHP manual!
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line)
{
/*
* Check whether we have the right version of iconv
*/
if ( ('glibc' !== ICONV_IMPL) || (true == version_compare(ICONV_VERSION, '2.8.90', '>')) ) {
throw new Exception('Please use the glibc iconv, version 2.8.90 or higher');
}
/*
* Use iconv for speed and glory
* We use the ASCII//TRANSLIT,IGNORE option to replace the string
* with its ASCII transliterated equivalent. If there's no ASCII
* equivalent, the IGNORE option makes sure the character is just
* thrown away, which is exactly what I want.
*
*/
$new_line = iconv('UTF-8', 'ASCII//TRANSLIT,IGNORE', $line);
return $new_line;
}
I guess I don’t need to give further comments on this code example 😉
If you have questions or comments, you could drop me a line.