Archive for the ‘Development’ Category

40x Speedup With iconv And PHP

Posted on March 3rd, 2009 in Annabel, Development, Personal, Work | 1 Comment »

For our product Annabel (dutch), we have to cleanup the data our customers provide us with. Because this is a fully automated process, we are unable to give feedback and have them fix their input. Therefore, I need a means to clean the data up, so we can process it.

Since we don’t need to support any unicode stuff, we can stick with just plain ASCII. That’s a very safe approach, which will reduce the chances of failure greatly. To convert the UTF-8 (Unicode) input into ASCII data, we use GNU C Library iconv in combination with PHP.

The default iconv has two caveats: it stops on an unconvertible string and it prints a question mark when it does not have an equivalent character (or transliterated character) in the destination charset. To overcome this problem, I used to just convert every single character with the PHP  iconv function, which gave me a throughput of about 250KiB/sec, using the following code:

/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT option. However, this makes
* the function very slow: max throughput is about 150KiB/sec.
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
if (!empty($line)) {
$new_line = "";

/*
* This potentially could be a very long string, so don't split the line
* in separate tokens, for that would tak way too much memory.
*/
$line_length = strlen($line);
for ($x=0; $x < $line_length; $x++) {
$old_char = substr($line, $x, 1);

/*
* Use iconv to replace the other special characters.
* If iconv can't convert it (and so returns '?'), just skip
* the character, for it probably is something malicious and
* there's probably no need to keep it anyway.
*
* Beware for the edge case if the original character is ? also
*/
$char = iconv('UTF-8', 'ASCII//TRANSLIT', $old_char);
if ( ('?' != $char) && ('?' != $old_char) ) $new_line .= $char;
}
}

return $new_line;
}

However, I was not satisfied with this, so I looked up the man page of the iconv version of GNU C Library. I supposed PHP was internally using this one, so that seemed a natural action. In that man-page I foud the IGNORE option, which just skips any character which cannot be converted or transliterated. That was exactly what I wanted. So I tried that with the PHP function as well, and it worked. Instead of converting every single character, I can now convert a whole file at once, which gave me a throughput of 11MiB/sec. The caveat, of course, is that I have to use the GNU C Library iconv, with a version the same (or greater than) the current one, to avoid compatibility problems. However, that’s a price I’m surely willing to pay. The new code is this (removed proprietary Annabel specific code):

/**
*  Replaces special characters with their ASCII equivalents.
*
* This function uses iconv to replace each seperate character with its
* ASCII equivalent, using the ASCII//TRANSLIT,IGNORE option. Throughput
* is measured at about 11MiB/sec.
*
* WARNING: Using the extra IGNORE option only works with a recent
* GNU libc iconv, so be very picky about which iconv to use! This is an
* undocumented feature, which is not supported by default and is not
* listed in the PHP manual!
*
* @param string $line
* @return string
*/
protected function _convertSpecialChars($line) {
/*
* Check whether we have the right version of iconv
*/
if ( ('glibc' !== ICONV_IMPL) || (true == version_compare(ICONV_VERSION, '2.8.90', '<')) ) {
throw new Exception('Please use the glibc iconv, version 2.8.90 or higher');
}

/*
* Use iconv for speed and glory
* We use the ASCII//TRANSLIT,IGNORE option to replace the string
* with its ASCII transliterated equivalent. If there's no ASCII
* equivalent, the IGNORE option makes sure the character is just
* thrown away, which is exactly what I want.
*
*/
$new_line = iconv('UTF-8', 'ASCII//TRANSLIT,IGNORE', $line);

return $new_line;
}

I guess I don’t need to comment on this code example ;) If you do have questions, just ask them in the comments, or via @jacobkiers on Twitter.

PHP daemon with fork howto part 3: Creating a Hello World daemon with fork

Posted on December 24th, 2008 in Development, phpfork-series, Work | 4 Comments »

Well, it took some time, but now I’m back again with part 3 of the series on creating a PHP daemon with fork. In part 1, the introduction we talked about why I would write this series and gave an outline of it. In part 2 we talked about the way UNIX processes work, what forking is and so on. In this part, we will see how to create a basic daemon which launches processes saying the famous Hello World!

Let’s first create a basic Hello World script. And no, I’m not going to explain it ;)

< ?php
echo "Hello World\n";
?>

Okay, we’ve covered the basics. Next thing is to create a daemon. This can be done with pcntl_fork().

<?php
$newpid = pcntl_fork();

if ($newpid === -1) {
    die("Couldn't fork()!");
} else if ($newpid) {
    // I'm the parent, and I'm going to self-destruct
    exit(0);
}

// Become the session leader
posix_setsid();
usleep(100000);

// Fork again, but now as session leader.
$newpid = pcntl_fork();

if ($newpid === -1) {
    die("Couldn't fork()!");
} else if ($newpid) {
    // I'm the parent, and I'm going to self-destruct
    exit(0);
}

echo "Master started with pid " . posix_getpid() . "\n";

for($i = 0; $i < 10; $i++) {
    $pid = pcntl_fork();
    if (-1 == $pid) {
        echo "Couldn't fork!\n";
    } elseif ($pid === 0) {
        // I'm the child, and I'm going to say hello world!
        echo "Hello world! from child with pid " . posix_getpid() . "\n";
        exit(0);
    } else {
        // I'm the parent, do nothing
    }
}

exit(0);

Well, this is all of it. Let’s explain the various parts.

On the first nine lines you see we fork. The actual forking is done with:

$newpid = pcntl_fork();

This function spawns a new process, and then gives the control in the parent and the child process back at the first line after the fork, in this case an if-structure. In this if-structure, we check whether forking is actually done (you get -1 when it isn’t), and then kill the parent process. Please note that we don’t check whether we’re the child. If we’re not the parent, and forking didn’t fail, we’re automatically the child. So the next part of this script runs as the child process.

After we forked, we detach from the console, and become the session leader. This means that a new process group is created, and that all child processes we start will become part of this new process group, instead of the process group which called us. If you start a process on the command line, the shell is the session leader of that command. We don’t want the shell to be the process leader of any daemon, therefore, we create a new process group, with the daemon itself as the session leader. After that, we sleep for 0.1 seconds. This is all done with lines 12-13:

posix_setsid();
usleep(100000);

At this point, we are the session leader, so let’s start the real parent process of the daemon. This is just basic forking again, as you can see in lines 15-24. Then we print the current process id on line 26.

Next comes the “Hello World!” part. First, we start a loop which runs 10 times. In that loop, we fork() again, but now we do nothing in the parent (just keep it running). The else statement for checking the parent can therefore safely be left out; I kept it here for the sake of clarity. In the child process we print our famous Hello World!, along with the new process id. After printing, we stop the child, for there’s no need to keep it running. When we’re done running all child processes, the parent process (the daemon) also stops. This can all be seen in lines 28-43.

for($i = 0; $i < 10; $i++) {
    $pid = pcntl_fork();
    if (-1 == $pid) {
        echo "Couldn't fork!\n";
    } elseif ($pid === 0) {
        // I'm the child, and I'm going to say hello world!
        echo "Hello world! from child with pid " . posix_getpid() . "\n";
        exit(0);
    } else {
        // I'm the parent, do nothing
    }
}

exit(0);

Congratulations! At this point you created your first fully functional PHP daemon! This is really all there is! You can download the hello world daemon here.

In the next part, we will add some communication between the parent and the child processes.

Automatically add user to Ubuntu Linux and set password

Posted on December 23rd, 2008 in Development | No Comments »

I was wondering how to create a user on my system without requiring a prompt or CLI access. The following script is the result:

#!/bin/bash
UADD=/usr/sbin/useradd
OPENSSL=/usr/bin/openssl
SHELL=/bin/bash

# Generate 12 characters long password and print it.
PASS=`$OPENSSL rand -base64 12`
echo $PASS

# Make the password usable for the useradd utility
PASS=`mkpasswd $PASS`

# Create the user
$UADD -s $SHELL -m $1 -p $PASS

Oh, and of course this is released into the public domain. Use it the way you like it.

PHP daemon with fork howto part 2: About UNIX processes and PHP

Posted on October 16th, 2008 in Development, phpfork-series, Work | 5 Comments »

As I’ve written in part 1, the introduction of this series on PHP fork, in this part we will talk about the way processes work in UNIX and Linux, and how we can use this for PHP.

Many information in this article is based on the IBM Developerworks article on UNIX Processes. I added this part for your convenience and for completeness, but I won’t go into detail about UNIX processes. If you want to know more, read the beforementioned article.

Any program in UNIX is a process, except for the kernel. A program is in fact a bunch of data, with some instructions to do something with that data. A process tells when and how these instructions should be executed.

Each process has its own and unique (at that point in time) process id. It is impossible for two processes to have the same process id at the same time. It is, however, possible that some process id is assigned to another process. This only happens when the counter for process id’s runs out of space, and is reset, so that it starts with process id 2 (for process id 1 one never ever stops as long as the system is running).

So, when you launch a new process, that process will get its own, unique process id, which is an identifier in the whole system. Furthermore is the process id used to identify all resources in use by that process. A resource can be defined as anything which is needed to run, such as memory, disk space, a network socket, open files such as logfiles or input/output files and anything als you can think of.

This also means that each PHP program or script you are running, has its own process id, with the possible exception of the PHP scripts running inside a webserver process, like Apache httpd sometimes does.

When you spawn a new process, it internally uses the fork(1) system call, which returns the process id of the child in the parent, 0 in the child or -1 when something bad happened. Using this knowledge, we can check wether we are in the parent or in the child process, and act approriately. More on that in the next part of this tutorial, where we will create a PHP daemon, which will start processes for the sole purpose of saying Hello World multiple times.

PHP daemon with fork howto part 1: Introduction

Posted on October 16th, 2008 in phpfork-series, Work | 2 Comments »

Last weeks, I’ve been writing a daemon in PHP for my work at Alphacomm. It now is quite finished, and I decided to share this knowledge with you in a tutorial.

Another reason for writing this tutorial, is the lack of information available at the web on this subject. Generally, it’s very scattered and not really accessible. Also, most information I found about it, talked about a specific solution, rather than taking a more general approach. So, I decided to write a howto in a series of articles on PHP fork()-ing and daemons. I expect to post each week another part, due to my busy schedule.

In this series, I will cover the following subjects:

  1. Part 1: An introduction to this series, links to all parts
  2. Part 2: About UNIX processes and PHP
  3. Part 3: Creating a Hello World daemon with fork()
  4. Part 4: Adding IPC (Inter Process Communication) with PHP sockets
  5. Part 5: Talking to the outer world (also with PHP sockets)
  6. Part 6: Errata and other stuff (not sure of this one)

Disclaimer: these can change ;)

At last, I’d like to thank the guys who wrote Nanoweb, which really helped me to understand the whole process of creating and managing processes with PHP fork(). However, if you ever plan to actually read that sourcecode yourself, be warned. It is not for the faint-hearted.