I was working on a project that required iterating over a directory of files recently. Whenever I do this I reach for my old friend RecursiveDirectoryIterator. In this case I only needed the full path to each file. I was using the fact that RecursiveDirectoryIterator returns the full path as the key on each iteration step by default. I was ignoring the value, which is an SplFileInfo object. Looking through the documentation I saw there was a way to make RecursiveDirectoryIterator return just the full path as the value. However I ran into a bug that caused me to dig into the internals of the SPL and figure out how things were working.

Besides RecursiveDirectoryIterator there are two other main SPL classes for iterating over files in a directory: DirectoryIterator, and FilesystemIterator. When SPL was first introduced in PHP 5.0 FilesystemIterator did not exist. It was added in PHP 5.3. I never understood what the difference between it and DirectoryIterator was until now. The documentation does not give many clues. How this relates to RecursiveDirectoryIterator is that class used to extend from DirectoryIterator. When FilesystemIterator was introduced it was changed to extend from that class instead. The reasons for this will be clear later. So what are the differences between DirectoryIterator and FilesystemIterator? Here is what I found out.

DirectoryIterator

When you iterate using DirectoryIterator each value returned is that same DirectoryIterator object. The internal state is changed so that when you call isDir(), getPathname(), or similar methods the correct information is returned. If you were to ask for a key when iterating you will get an integer index value.

<?php
$files = new DirectoryIterator(/*...*/);
foreach ($files as $index => $iterator) {
    /*...*/
}

FilesystemIterator

FilesystemIterator (and thus RecursiveDirectoryIterator) on the other hand returns a new, different SplFileInfo object for each iteration step. The key is the full pathname of the file.

<?php
$files = new FilesystemIterator(/*...*/);
foreach ($files as $fullPath => $info) {
    /*...*/
}

This is by default. You can change what is returned for the key or value using the flags argument to the constructor. Your choices are:

  • CURRENT_AS_PATHNAME
  • CURRENT_AS_FILEINFO
  • CURRENT_AS_SELF Note that this makes FilesystemIterator and RecursiveDirectoryIterator behave like DirectoryIterator
  • KEY_AS_PATHNAME
  • KEY_AS_FILENAME

The bug I ran into has to do with the CURRENT_AS_PATHNAME option. Using it will cause PHP to throw a fatal exception. I made a pull request to fix this and submitted it via Github but as of the date of this blog post has yet to be merged.

I'm not sure about all of the histroy, why DirectoryIterator returned itself but RecursiveDirectoryIterator returned new SplFileInfo objects when SPL was created, but it is clear the FilesystemIterator class was introduced to make the API a little bit cleaner.

I have been interested in different URL routing techniques in web frameworks for a few years now. I have looked at the code for Symfony 2, Aura, and Slim's routing. They, at a simplified level, work the same way. I suspect most routing libraries written in the PHP 5.3+ era work similarly and there has not been any new developments for a few years... until now. A few interesting libraries have burst onto the scene in the past month or two that do things differently. One in particular causing some discussion and debate in the community. I want to talk about what makes them different, advantages and disadvantages, and what I want out of a PHP routing library that has yet to be fulfilled.

The Status Quo

At their root routing libraries map a URL pattern (usually the rewritten path portion) to a piece of code to be executed. A secondary purpose is to extract key => value parameters from the URL path. A typical routing library lets you define routes like so.

<?php
$router->addRoute('GET', '/post/{id}/comments', ['controller' => 'PostController', 'action' => 'show']);

That last argument might be an anonymous function or some other PHP callable. And maybe instead of calling addRoute() with GET you call a get() method. Those are small implementation details. Under the hood most routing libraries take those arguments and make a Route object out of them and store a collection of these Route objects. When routing is run the collection is looped over and for each Route a regex is used to extract the parameter names from the pattern, in our example the id, and turn that pattern into a real regex pattern. The generated regex is then used to test if the Route matches the requested URL/path/whatever. And if it does match this generated regex pulls the parameter values out of the requested URL.

Bullet

Bullet PHP bills itself as a resource-oriented micro PHP framework. The novel thing it does for routing is to break the requested path into segments, one for each directory in the path. The routing engine will try matching the first directory segment. If there is a callable registered for that segment the callable will be run. The engine will continue searching for matched routes and running any registered callables until all the segments are exhausted.

OK I think an example is in order. This is shamelessly stolen from the Bullet PHP README. Suppose the requested path is /events/45/edit. The routing engine will first look for matches for just /events. Any registered callbacks will be executed. Then the routing engine will look for matches for /events/45, again any registered callbacks will be executed. Finally /events/45/edit will be used to search for matches.

The examples in the documentation show that in the /events callable you could register the routes for /events/{id} and /events/{id}/edit.

<?php
$app = new Bullet\App(array(/* some config here */));

$app->path('events', function($request) use ($app) {

    $app->get(function($request) use ($app) {
        // list events
    });
    $app->post(function($request) use ($app) {
        // create an event
    });
    $app->path('new', function($request) use ($app) {
        // new event form
    });

    $app->param('int', function($request, $id) use ($app) {
        $app->get(function($request) use ($id) {
            // View an event
        });
        $app->put(function($request) use ($id) {
            // Update event
        });
        $app->delete(function($request) use ($id) {
            // Delete event
        });
        $app->path('edit', function($request) use ($id) {
            // edit event form
        });
    });
});

This is a form of creating a tree structure with your routes. A true tree structure is something I feel is missing from most PHP web frameworks. Tree structures can considerably speed up route matching since you can skip over dozens of routes with one check. For example you could skip over all of the /admin routes for your application in one route match check. More routing libraries should try to implement a tree structure rather than just a flat list of routes to loop over.

The downsides of Bullet PHP are that it expects the callable to be a Closure. No other types are excepted. In the authors defense he does state that the Closure is not the actual controller, but you call the actual controller from the Closure. That leads to my other complaint though. Although I like the tree structure and it promotes creating RESTful, resource oriented routes, doing to requires a lot of boiler plate code. I don't want to have to retype having /events/{id} GET, /events/{id} POST, etc for every resource. Nor do I want to have to re-implement the code for instantiating a controller object and calling the action method on each project. I'd like to see more PHP routing libraries with a resources() method that does something similar to the name method in Ruby on Rails. Aura Routing 2.0 added an attachResource() method but that is the only one I'm aware of. I have added my own implementation on top of Slim.

Pux

Pux is the more controversial of the two libraries presented in this post. Pux makes big claims in performance: 48X as fast as Symfony in static routes and 31X with dynamic routes. It achieves this in three ways. 1. It stores the routes as arrays vs Route objects. I don't know how much this actually saves these days since there has been a lot of optimizations in the PHP engine surrounding objects and memory performance. 2. It pre-compiles the regex patterns. In my example of the common implementation, this step of changing the URL pattern to a regex pattern happens on every request. Good libraries will lazily do this step, converting each one until a match is found and then stopping. Bad libraries do this step on all routes regardless. The pre-compiled route arrays are stored on disk, or in memory. This is the part that really interests me. 3. It uses a C extension. This part doesn't interest me at. Of course something written in C is faster than in PHP where it doesn't have the overhead of a virtual machine or hash table lookups for every variable or array offset. The C extension is not as useful to most of us becuase we can't use it on shared hosts or most Software-as-a-Service providers. The only way to use it is to host your own PHP on a VPS.

The fact that Pux is faster due to the C extension is no surprise to anyone. It's the pre-compiling of the regex patterns that is novel to me, and seems to be glazed over in a lot of the discussion about Pux. One of the reasons I like pre-compiling the regex patterns is that pre-processing of web application code is getting more common place. Sure you don't see it in PHP, but on the client side concatenating and minifying Javascript, compiling LESS or SCSS to CSS are basically a necessity now. Libraries could compile the regexes on the fly like the do now during development. Prior to release you run a CLI script as part of your release process and the compiled regexes along with other routing information are saved to file that is then read in on each request instead of the more verbose routing information.

I can imagine other uses for this type of pre-release preprocessing. A big one is for describing relational database tables, columns and relationships for an Active Record or Data Mapper library.

I should also point out an excellent blog post by Nikita Popov, a regular PHP core contributor about how to speed up regular expression matching in routing engines. The gist of it is that by combining individual regex patterns together you can approach or beat Pux's performance. You need to do some special processing to do this and there is at least one caveat to it but it is a very intersting approach to the problem.

My Perfect Routing Library

I have had several starts in the past at writing a routing library with some of these features in mind. What I have concentrated on is having a tree structure for the routing and how to best create RESTful route groups. I've started down the road of having Collection objects and Route objects using simple recursion to iterate over them all (or skip branches). I have thought about doing this so you could iterate using one of SPL's builtin recursive iterators, but have not had time to flush anything out. I've done it where the RESTful route group is just extending the Collection with Route objects for each URL-method combination. And I've done it where the RESTful groups is a specialized Route object that uses one regex to match all the URLs. My perfect library has to have at least those two, tree structures and easy way to create resource groups.

And as I said I am intrigued by the idea of pre-compiling the regex patterns, especially in production where it does not make sense to repeat this task over and over again for each request. So if I ever got the time to create my perfect routing library it would have that in it too. One last point to tink about is we have not seen anyone take advantage of generators on PHP 5.5 for a routing library. I have not thought of a way in which generators could do something novel for routing libraries yet, but there are much more creative people than me out there.

I was recently working on a project where I needed to recursively get all of the files with a particular extension inside a directory. Actually I needed to find all files with a .php extension but not .html.php. Sounds like a perfect use for RecursiveDirectoryIterator right? I could do something like the following.

<?php
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($dir));
foreach ($files as $path => $finfo) {
    if (substr($path, -4) != '.php' || substr($path, -9) == '.html.php') {
        continue;
    }
    // do stuff
}

But I figured I would give RecursiveFilterIterator a try. For those who don't know all of the classes mentioned are part of PHP's SPL extension. RecursiveFilterIterator is actually an abstract class, you have to extend it and implement the accept method. From that method you return a boolean where false is skip the items and true is pass it to the iteration loop.

<?php
class PHPFileIterator extends RecursiveFilterIterator
{
    public function accept()
    {
        $file = parent::current();
        $name = $file->getFilename();
        return (substr($name, -4) == '.php' && substr($name, -9) != '.html.php');
    }
}
$files = new PHPFileIterator(new RecursiveDirectoryIterator($dir));

I thought that because RecursiveFilterIterator implements OuterIterator I could just pass it to a foreach statement. However running this produced no results. Upon further inspection it the loop was hitting the first sub-directory and stopping there. Reading the user comments on the documentation page for RecursiveFilterIterator shows that you still need to wrap RecursiveFilterIterator in a RecursiveIteratorIterator. Sigh, OK.

<?php
$files = new RecursiveIteratorIterator(new PHPFileIterator(new RecursiveDirectoryIterator($dir)));

But still this did not work. It turns out it was not iterating down into the sub-directories. In the accept method I also had to return true when a directory was encountered.

<?php
class PHPFileIterator extends RecursiveFilterIterator
{
    public function accept()
    {
        $file = parent::current();
        if ($file->isDir()) {
            return true;
        }
        $name = $file->getFilename();
        return (substr($name, -4) == '.php' && substr($name, -9) != '.html.php');
    }
}
$files = new RecursiveIteratorIterator(new PHPFileIterator(new RecursiveDirectoryIterator($dir)));

OK, finally we are getting somewhere. There was one last hitch, I had to also tell the RecursiveDirectoryterator to skip dot files. This is what I ended up with.

<?php
class PHPFileIterator extends RecursiveFilterIterator
{
    public static function factory($dir)
    {
        return new RecursiveIteratorIterator(
            new PHPFileIterator(
                new RecursiveDirectoryIterator(
                    $dir,
                    FilesystemIterator::CURRENT_AS_FILEINFO | FilesystemIterator::SKIP_DOTS
                )
            )
        );
    }
    public function accept()
    {
        $file = parent::current();
        if ($file->isDir()) return true;
        $name = $file->getFilename();
        return (substr($name, -4) == '.php' && substr($name, -9) != '.html.php');
    }
}

$files = PHPFileIterator::factory($dir);

It's a shame that I need to use three objects to do this. And with the FilesystemIterator constants thrown in it's a lot of typing, thus the factory method. There are some other takeaways from this exercise.

  • The RecursiveFilterIterator will let you skip and entire branch of a tree structure by returning false from accept. Imagine creating a URL matching router using this. I could be quite powerful.
  • One of the downsides I ran into while running some tests is if you wanted to look for directories with a certain name or pattern you could not do it with RecursiveFilterIterator. You have to return true for a directory in accept otherwise the iterator won't recurse down the sub-directories.
  • I also found it a bit weird that FilesystemIterator has SKIP_DOTS enabled by default but RecursiveDirectoryIterator does not. I guess these kinds of inconsistencies are to be expected though, it is PHP.

I am taking a cue from respected PHP developer Paul M. Jones. Paul somewhat recently split his blog into two. One for personal stuff (he blogs about a lot of political stuff) and one for software development. In starting this blog I am doing something similar.

I have not been very active on my personal blog for a while. And part of that is because I have been reluctant to make a bunch of technical posts there. My family and some friends follow it and I think they would think it's weird that I am posting about all this PHP stuff when that is not my day job. But since I can't talk much about my day job (they are very much an old school, closed, engineering shop) and I have not been doing autocrossing or track days since becoming a homeowner and then dad, writing web apps is the most exiting thing going.

The other factor is I have been itching to convert my WordPress blog to Jekyll/Octopress. Through the generosity of Github I have a free space to host this blog and can play around with Jekyll and Octopress. I think my personal blog being on WordPress hurt my willingness to write as well. The software just felt so heavy. I should have made more use of the Android app though. Oh well.

I am looking forward to putting out some great content, mostly on PHP but also some CSS and Linux stuff.

Till next time...

Paul