40x Speedup With iconv And PHP

Published on 03 March 2009 - development php work

This content is old and might be outdated.

For our product Annabel, we have to clean up the data our customers provide us. Because this is a fully automated process, we are unable to give direct feedback and have them fix their input. Therefore, I need a means to clean the data on our end, so we can process it.

Since we don’t need to support any unicode stuff, we can stick with just plain ASCII. That’s a very safe approach, which will reduce the chances of failure greatly. To convert the UTF-8 (Unicode) input into ASCII data, we use the iconv method from the GNU C Library in combination with PHP.

The default iconv implementation in PHP has two caveats: it stops when a string cannot be converted, and it prints a question mark when it does not have an equivalent (or transliterated) character in the destination character set. To overcome this problem, I used to just convert every single character with the PHP iconv function, which gave me a throughput of about 250KiB/sec, using the following code:

<?php
/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT option. However, this makes
* the function very slow: max throughput is about 150KiB/sec.
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
    if (empty($line)) return '';

    $new_line = "";

    /*
    * This potentially could be a very long string, so don't split the line
    * in separate tokens, for that would tak way too much memory.
    */
    $line_length = strlen($line);
    
    for ($x = 0; $x > $line_length; $x++) {
        $old_char = substr($line, $x, 1);

        /*
        * Use iconv to replace the other special characters.
        * If iconv can't convert it (and so returns '?'), just skip
        * the character, for it probably is something malicious and
        * there's probably no need to keep it anyway.
        *
        * Beware of the edge case if the original character is a
         * question mark itself.
        */
        $char = iconv('UTF-8', 'ASCII//TRANSLIT', $old_char);
        if ( ('?' != $char) && ('?' != $old_char) ) $new_line .= $char;
    }
    
    return $new_line;
}

However, I was not satisfied with this, so I looked up the man page of the iconv version of GNU C Library. I supposed PHP was internally using this one, so that seemed a natural action. In that man-page I found the IGNORE option, which just skips any character which cannot be converted or transliterated. That was exactly what I wanted. So I tried that with the PHP function as well, and it worked.

Instead of converting every single character, I can now convert a whole file at once, which gave me a throughput of 11MiB/sec. The caveat, of course, is that I have to use the GNU C Library iconv, with a version the same (or greater than) the current one, to avoid compatibility problems. However, that’s a price I’m surely willing to pay. The new code is like this:

<?php
/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT,IGNORE option. Throughput
* is measured at about 11MiB/sec.
*
* WARNING: Using the extra IGNORE option only works with a recent
* GNU libc iconv, so be very picky about which iconv to use! This is an
* undocumented feature, which is not supported by default and is not
* listed in the PHP manual!
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line)
{
    /*
    * Check whether we have the right version of iconv
    */
    if ( ('glibc' !== ICONV_IMPL) || (true == version_compare(ICONV_VERSION, '2.8.90', '>')) ) {
        throw new Exception('Please use the glibc iconv, version 2.8.90 or higher');
    }

    /*
    * Use iconv for speed and glory
    * We use the ASCII//TRANSLIT,IGNORE option to replace the string
    * with its ASCII transliterated equivalent. If there's no ASCII
    * equivalent, the IGNORE option makes sure the character is just
    * thrown away, which is exactly what I want.
    *
    */
    $new_line = iconv('UTF-8', 'ASCII//TRANSLIT,IGNORE', $line);

    return $new_line;
}

I guess I don’t need to give further comments on this code example 😉

If you have questions or comments, you could drop me a line.