Posts Tagged ‘php’

SOAP, Web Services, and PHP

Saturday, August 23rd, 2008

One of my projects in the past few weeks has been to put together a SOAP server for a client. So suddenly I’ve had to learn a lot of the nitty gritty details about what works and what doesn’t…

While they’re fresh, let me jot them down here. WARNING: Extremely technical content ahead.
(more…)

Technical note: HTTP Auth with AJAX

Saturday, June 7th, 2008

I’ve been struggling to get Project Auriga to set HTTP Auth from a nice pretty login form, and think I have it working.

What follows is a very technical discussion–if you’re a business reader, you should probably skip this post…

HTTP Auth is a specific mechanism for handling authentication. HTTP Auth is built into Apache and IIS, and so the server can handle authentication purely through configuration, offering many different back ends for storing the data. Browsers also handle HTTP Auth natively, popping up a normal login box whenever it gets a Basic Authentication request from the server. But this login box is ugly, and doesn’t provide a friendly experience to allow people to create an account, get a password resent, or anything–it falls back to a basic error page. You can, of course, customize the error page, but not necessarily help people with the password login itself.

There are several benefits to using HTTP Auth, though. First of all, other applications on the same server can accept the same credentials, allowing you to sign in once and access multiple applications without having to log into each one. Secondly, you can set up stronger authentication methods, such as client-side certificates. Also, you can configure the server to protect large parts of a web site very easily, reducing exposure to information disclosure.

So how do you make a sign-in form on a web application set http auth? Browsers do not allow you to access these settings via script. You can use an XmlHttpRequest object to set authentication, but only after the proper challenge has been sent from the server. The biggest problem is, if the server sends this challenge twice in a row, your browser will intercept the second request and pop up the ugly password prompt. So designing a form that keeps this login prompt from popping up under most circumstances is quite the challenge.

The gist of the issue is that while you can open an XmlHttpRequest object with a user and password for http authentication, the browser will only actually use those credentials after the server has rejected a request. The process looks like this:

  1. Your script creates and sends an XmlHttpRequest with http auth username and password.
  2. The browser submits the request to the server, without sending the username and password.
  3. The server responds with 401 requires authentication, and a WWW-Authenticate header specifying a realm.
  4. The browser looks in its cache to see if it already has http auth set for that domain and realm. If it does, it sends those credentials, NOT THE ONES you specified in your XmlHttpRequest. If it does not have those credentials, only then will it set http auth to what your script asked for.
  5. The server responds. Generally, if the username or password are incorrect, the server will repeat the 401 response, and WWW-Authenticate.
  6. The browser gets its second 401 in a row, and pops up its password box. Your script never gets a chance to intercept this. So if the stored http auth credentials are wrong, or the user mistypes the password, their browser takes over and you get a password prompt.

How do you handle this situation? It turns out you need to engage in some trickery on both the client and the server.

Here’s a basic flow of how you need to handle this, from both the server and the client perspective:

  1. First, collect the credentials from the user, and create your request as outlined above.
  2. Browser sends request without credentials.
  3. Server responds with 401 and WWW-Authenticate.
  4. Browser sends cached credentials, if they exist, or your credentials if not.
  5. If credentials are accepted, server allows log in and responds with 200. If credentials are not accepted, server returns an error code OTHER THAN 401, and does not send a WWW-Authenticate:
    1. We use 403 not authorized for a credential failure here. You might also use 400 Bad Request.
    2. Because the response was something other than 401, your browser caches the bad credentials.
    3. XmlHttpRequest status reflects the error condition.
    4. Your script checks the result for the error your server has returned. Now comes the crucial part:
    5. Your script submits a new request with different credentials to some server location that will return successfully. For example, we call a login method on our application, passing username “public” and password “?”.
    6. The browser sends the new credentials and submits the request.
    7. The server returns 200.
    8. The browser updates its http auth cached credentials with the new bogus ones.
  6. Now you can present an error to the user, and ask for new credentials.

The key to the above process is that if the browser gets two 401 responses without having a 200 somewhere between, it will pop up its password box and there’s nothing you can do about it. So the key is to use a different error code to indicate bad credentials, and do an intervening request that will return 200 so that you can re-authenticate.

Logging Out
You cannot really log out of HTTP Auth. But you can change the credentials to a known bad user. That’s a key technique we use to effectively log out of an application, and we re-use this method to reset after bad credentials.

On the server
I’m very much still in development with this. You can see the server side code for Project Auriga logins here.

In this system, we do set a cookie after successful login, to keep from having to check credentials again. This script also allows for cookie-only logins without using http auth. The important bits:

  • action=logout: if this is called, the script always returns successfully. This allows the client script to provide new bogus credentials. It passes a username of “public” to log out completely.
  • action=httpauth: if this is called, and there are no http auth credentials or the http auth username is “public”, return a 401 and WWW-Authenticate. This is always the first request from a browser, and triggers the browser to re-request with the credentials.
  • action=httpauth, with http auth username set, and it’s not “public”: The second or later requests, we never want to return a 401 or the browser will pop up its password prompt. So we return 403 (or 400) if the credentials are bad, or allow the script to continue processing if its good. In this case, our authenticate method returns true if credentials are good, false if the user is not found, and throws an exception if the credentials are bad.

That’s basically what you need to do on the server side. Now for the client.

Client-side logins
We’re using the Dojo Toolkit extensively in Project Auriga, so the login functions are using dojo.xhr* requests to wrap the XmlHttpRequest objects and provide convenient callback functions. You can see our login code here. Key items:

  • auriga.login is called by the login form. Note that if this is the first time to this page, the dojo.xhrPost actually happens twice: first time with no credentials, and the second time with them. If the second post is accepted, auriga.login_complete is called. If the second post returns any kind of error, auriga.login_err is called.
  • auriga.login_complete is easy… it just redirects to wherever the server response designates.
  • auriga.login_err is the real trick here. If it detects the error code we’ve chosen for bad passwords, it immediately calls the server logout method, to get a good response so the next time the browser gets a 401, it won’t immediately pop up the password box.

You can see the code in action on our demo server.

Other notes

  • Actually doing single sign-on is hard. We’re trying out different strategies for detecting whether a user already has http auth set, by calling our login method once on page load, but haven’t gotten that figured out. In our current script, just clicking Login with the form blank but authenticated elsewhere on the same domain and Realm, will log you in with your existing credentials.
  • Because your browser stores credentials based on the domain and the realm together, all applications that you set to share these items must accept the same credentials. If you have a different password on a different system on the same server, you must set a different realm, or logging into one will log you out of the other.
  • If you want to require http auth, but not Javascript, I suggest submitting something different to the server using Javascript to identify this type of request. Perhaps show your form only when Javascript is available, and when it’s not, have a link to a protected page to let your browser go ahead and show the password dialog.
  • Using http auth can actually allow users to disable cookies, if your application is RESTful. In Project Auriga, the session login script supports either–the client pages and logins work with either a cookie or http auth. The login process attempts to set http auth and a session cookie. On subsequent attempts, it uses the cookie to avoid re-authenticating every request.
  • Finally, a note on security: Basic Authentication provides no protection against passwords being sniffed over the network. If you need a secure login, be sure the server conversation uses SSL–otherwise neighbors on your wireless network can easily sniff out your password. HTTP Auth does not make your application more secure–it just makes it easier to share authentication with other resources on the same server.

Mythbusting PHP: 10 common myths about PHP

Saturday, February 2nd, 2008

PHP development is one of our specialties at Freelock Computing. I’ve written quite a few PHP applications, some from scratch, some starting with other people’s code, some as extensions for open source projects. I’ve also read a lot of criticism of PHP, and while some of it comes from knowledgeable programmers expert at PHP, most of it is uninformed hogwash. So in this post, I’m going to dispel many of the myths about PHP code, and identify its real strengths and weaknesses. Most myths have a kernel of truth in them somewhere, so I’ll try to set the record straight by identifying why PHP has each myth. Ready? Let’s get started.

1. PHP is crappy because there are so many crappy PHP programs.
This seems to be the biggest reason people think PHP is a bad language–there are a lot of bad PHP programs out there. Why is this? Probably because PHP is so accessible and ubiquitous that a lot of people without a programming background use it to learn programming. I’ve worked with programmers inside software companies who have much more formal background, or at least experience programming with others on a team. With somebody to guide them, they quickly learn the pitfalls to avoid, best coding practices, and development methodologies.

Most PHP coders on the other hand started out as web designers, putting together a web page for their neighbor, or their family, or a club of some kind. They have no formal training, no experience working on a development team, no guidance or knowledge about what makes for quality code. The result is inevitably spaghetti code, chunks cut and paste into place without real understanding of how they work, people fiddling with lines until it gives them the result they’re looking for.

Naturally, this leads to a lot of crappy software out there, riddled with security holes, maintenance nightmares, poor performance, and many other problems.

That does not mean the language itself is at fault. There are plenty of well-written programs out there that do an excellent job of doing the task they’re designed to do.

Result: Busted. Bad programmers does not mean its a bad language.

Let’s get a bit more specific about these code quality issues.

2. PHP is crappy because it’s hard to read all that HTML mixed in with programming logic.
Some argue that PHP is this way because it is a template language–it was designed to be an easy way to add basic programming functionality to a web site. And while that was its heritage, PHP has grown into a full-fledged powerful language capable of most anything you’d do with any other language.

Nothing in the language dictates that presentation code (HTML, Javascript) needs to intermingle with business logic. I consider the best programs to have a clear division of responsibility between these areas. We use a strong Model-View-Controller (MVC) architecture when creating custom applications, the same architecture provided by many frameworks, and advocated by experts for many other languages. And we’re hardly alone in this.

We use the Smarty template system to separate out the templates into a presentation layer. Our model is usually made up of fairly lightweight data objects that own the corresponding database tables. The controller layer is typically a dispatcher of procedural code, often with helper controller objects. You can apply most design patterns to PHP as readily as other object-oriented languages.

Now, the tools you use to develop PHP don’t enforce any of this. Unless you’re using a framework, you need to create all this structure yourself. But we don’t like HTML mixed in with our business logic ourselves, so we don’t do it.

Result: Busted.

3. PHP is crappy because it’s easy to hijack with all those global variables.
Funny how people try to create all these really easy ways to do things that turn out to be large mistakes from a security point of view. Microsoft has done this over and over. PHP has two particularly annoying “features” that have turned out to be security nightmares, originally there to make it simple to program: register_globals, and magic_quotes_gpc.

register_globals is a setting that takes any parameters passed in a request and automatically turns them into a global variable you can use in your script. The problem with this is that it’s very easy for an attacker to pre-define a variable that the script assumes to be unset. As I was learning to program in PHP, I wrote a 500-line script to check and double-check that each variable I was expecting from the browser was legitimate, and that all of my other variables were suitably protected.

At its worst, register_globals turned out to allow an attacker to include a malicious PHP file from a remote web server before your script even started, by setting an autoload variable for a particular module.

register_globals is evil.

But its vulnerabilities are widely known, and PHP has been set with register_globals turned off for several years now. It’s going away entirely in PHP 6.

magic_quotes_gpc is more of a pain. It was added to help prevent SQL injection attacks, and what it does is escape all of the values you receive from GET, POST, and COOKIE parameters, adding backslashes in front of any backslash or quote to make it so programmers who pass these variables straight into a database query have some protection built into the language. But it causes a lot of extra work, because your script doesn’t know whether this is on or off. If it’s on and you escape your strings, you end up with extra slashes in front of everything–and you end up with backslashes scattered all over your pages. We end up checking the setting of magic_quotes_gpc, and if it’s on, stripping the slashes before the rest of our script interacts with it.

For any experienced PHP programmer, these are solved problems.

Result: Busted, but there is valid criticism here.

4. PHP code doesn’t scale well.
Nonsense. This is purely myth. Here are some of the most popular sites on the Internet that run on PHP: Facebook, Flickr, Wikipedia, Digg, parts of Yahoo. All of those are in the top 20 most visited sites on the Internet.

Result: Busted. Very busted.

5. PHP is mainly a vehicle for Zend to get business.
I didn’t hear this one until just recently. Zend is a company with a strong stake in PHP. It controls a lot of the code, it has a decent editor with a debugger, a powerful framework, and a PHP accelerator available as proprietary add-ons. I’ve had a couple developers suggest that Zend has such a controlling interest in the language that it keeps others out, and you have to buy from Zend to make PHP work best.

Yet this ignores the other options out there. Zend does not have a monopoly in any of these areas. There are several other editors with good PHP debugging support, dozens of frameworks, and a handful of PHP accelerators out there, several of them completely free and open source. Now if you’re trying to change the core PHP language, you may need to work with Zend, and I have heard they aren’t necessarily the easiest to work with–they don’t readily accept changes to core features, and a few developers have left the PHP project because of disagreements over the direction of PHP. And some of these have been serious, related to hardening PHP to prevent some of the preventable security attacks through the language itself.

But as a PHP user, these issues seem far removed. PHP 6 is in development now, promises some decent improvements such as Unicode support and removal of some of the vulnerable settings.

Result: plausible, but not relevant to most PHP developers

6. You can’t compile PHP, so it will always be slow.
PHP is an interpreted language, and it doesn’t have a built-in compiler. The same is true of other web languages, at least Perl. Python has a built-in runtime compiling system, so you get compiled byte-code without having to do anything. I don’t know that much about Ruby in this area.

But you can accelerate PHP quite similar to Python, by adding an accelerator. Zend has a proprietary one. We use eAcclerator on our servers, and there are several others out there. These provide what is called an “opcode cache.” When PHP is executed, the interpreter makes two passes: first a conversion to native bytecode, and then execution of the bytecode. An opcode cache stores that first pass to disk, so subsequent calls can use what is essentially the same as compiled code. While it’s not permanent, and probably not as efficient as other compiled languages, it does seem to allow our servers to accommodate about 40% more traffic before bogging down.

Combining this with other caching strategies can allow PHP sites to scale up to serve the largest sites.

Result: Plausible, but workarounds available.

7. You can’t develop in PHP as fast as other languages. Like Ruby on Rails.
Ok. Now we’re getting to the ridiculous one. First off, Rails isn’t a language, it’s a framework. And by many accounts, it’s a good one, providing a lot of really powerful features right out of the box. It might have set a new high standard for developer-friendly frameworks. But it’s hardly the only one out there, and because it’s open source, many of the conventions it established have spread widely to other frameworks as well. CakePHP is a framework that aims to be the Rails for PHP.

Rails has its downsides as well. The CEO of Dreamhost has an interesting post about his experiences trying to get Rails to scale. While it may be fast to develop in, it may be at the expense of running fast enough to handle large loads. You also need to learn Ruby, which has quite a bit different syntax than PHP. PHP is quite similar to C, Java, Perl, and other popular languages, so it’s immediately familiar to many other programmers.

The biggest problem I have with Rails is the dogmatic nature of many of its practitioners. And it has gotten such widespread buzz in such a short period of time that in some ways it’s become the new PHP, the new pet technology by a lot of inexperienced programmers due to a low barrier to entry. If you’re a web designer and not already a programmer, you would probably choose Rails to get started in today, instead of PHP, because of all the hype. I think that’s going to lead to the same proliferation of lousy code that permeates the PHP landscape now.

While Ruby may be a nice language, there’s a lot more support for PHP right now, in available talent, web servers, scaling experience, and breadth of libraries available. And by starting with an application that meets 90% of your needs today, you can work on what makes your particular problem unique. Since so many applications and libraries are available for PHP that need very little customization to meet many business problems, developing from scratch with a powerful framework isn’t necessarily the fastest way to get the job done.

Result: Busted. While Ruby on Rails is nice, it’s not the only way to build an application quickly.

9. PHP is only good for web applications–it’s no good for anything else.
PHP was built to be a web application language, but it has a command line interface, a GUI toolkit based on GTK, and other features that mean you can feasibly write just about any kind of application you can think of in PHP.

However, nobody does. I have not seen a single PHP desktop application in use. While we do use it for scripting a few web server-related tasks, they tend to be maintenance tasks or a forked process from a web application. There aren’t lightweight PHP libraries optimized to run on embedded devices.

As I look over options to write software for the OpenMoko platform, for example, PHP does not appear to be a compelling option. Likewise, it seriously lacks the ability to interact with hardware or much on the OS level without calling a shell to start some other program. Perl has long been used for these environment, but Python has been taking these environments by storm.

So while it’s possible to use PHP for purposes other than web applications, it’s not convenient, conventional, easy, or widely done.

Result: Confirmed.

10. It’s not a real language–you can’t do proper object-oriented designs, objects are copied, etc.
PHP was never designed by computer scientists. You could argue it wasn’t designed at all. It was built from the beginning to solve a specific problem: to make active web sites. And it’s successful because it’s done that exceedingly well.

Over time, it has accumulated modules that do just about anything you might want to do in a web application, from talking to just about any database system out there to requesting pages from other servers to processing financial transactions to generating images and even PDF files. It added object orientation in PHP 4, and made it much more robust and similar to other languages with PHP 5. While it still doesn’t do multiple inheritance, or threading, or similar advanced programming techniques, you can implement most common design patterns, objects are now passed by reference, there are constructors and destructors and all sorts of things that give it as much power as any other language for most web applications.

While PHP certainly has its shortcomings, for the vast majority of web applications, it provides exactly the right combination of sufficient power to do the job, and a relatively straightforward way of getting the job done.

Result: Busted, for all but the most specialized applications.

That’s enough for now. In a future post, I’ll discuss the major drawbacks and benefits of PHP. Stay tuned!

The three spheres of web application platforms

Saturday, February 2nd, 2008

There are thousands of languages out there, but only a couple handfuls are used for web applications. Of these, PHP is a runaway success. Yet I constantly see it criticized by developers of other languages, often for completely untrue reasons. PHP has a bad rap, and while it certainly has its pitfalls, there’s many good reasons it has become such a popular language for web applications.

I consider there to be three major sets of languages currently used for web development. When talking with developers, you’ll usually find them gravitating to one of these three spheres: the Windows world of Microsoft ASP, ASP.NET, Cold Fusion, C#; the Java world; and the LAMP world. While some programmers cross between these, you’ll usually find people that are best in one particular area.

The Microsoft world grew out of ASP and Cold Fusion into the current .NET technologies. There is now an open source version of .NET called Mono, backed by Novell, which makes these technologies cross-platform. They’re mainly used by Microsoft and its partners, and small proprietary software companies in all sorts of vertical industries. Very few .NET applications are open source, compared to the other technologies.

The Java world seems to dominate the large enterprises. Companies that work with IBM extensively end up with Java-based enterprise applications–and there are a lot of them. Java was the “next big thing” in the second half of the 1990s, but it only seemed to gain a real foothold in large business. Quite a few of these applications are open source, and there’s a lot of applications large and small you can download freely and deploy–or pay thousands of dollars to a middleware vendor to have them get you running. Java has a wide mix of open source and proprietary applications available.

LAMP stands for “Linux, Apache, MySQL, and PHP,” though there are other P’s out there like Perl and Python. This describes the other major technology stack used in the web world, and follows the Unix design of small pieces loosely joined–you can substitute MySQL with Postgresql, Apache for another web server, and many other languages for PHP. There are far more open source applications available on the LAMP stack than the other two combined, mainly because the barrier of entry is really low–all you need is a spare old computer to install the stack, and all the software is free.

There used to be another popular language, TCL, running on the AOLServer, but you really don’t see much in that these days.

If you’re developing a web application, you can use any of these technology platforms to get the job done–in a web environment, they are all pretty much equivalent. Java and .NET have better support for desktop applications, but if your main interface is a web browser, there’s nothing you can’t do in LAMP that you can in the others.

LAMP is a family of technologies, with more variety than the other stacks. For the language, besides the ‘P’ languages of PHP, Perl, and Python, there’s also Ruby that has gained a lot of popularity lately. MySQL and Postgresql regularly vie for the database slot. Apache pretty much has the web server part locked up, but Linux can even be replaced with Windows to make it the WAMP stack and you can still run most of the same programs.

So why group technologies into these stacks? Mainly because they work well together on the same system. This boils down to the web server part of the system. If you’re using Microsoft IIS for your web site, you’ve got .NET, and while it’s possible to add PHP or Perl, it’s not commonly done. For Java, you need an application server. But Apache makes it pretty easy to plug in all sorts of the open source languages as modules, and run a bunch of them simultaneously. Much of these differences are due to historical and cultural differences, not really technical. It’s just that these particular sets of technology are regularly used with each other, so they’re going to be easier to get running and working correctly.

Let’s take a closer look at the LAMP family. Like many families, there’s in-fighting and bickering over who is best at what job. Postresql people look down their noses at MySQL, which they clearly consider to be inferior in just about every way (with some justification). Perl people wonder why others program in anything else, Python people think the other languages make programming too difficult, and Ruby programmers pride themselves in writing the shortest code to get the problem solved. They all sneer at PHP, regarding it as a toy language not capable of real programming. Yet you’ll find more open source web applications written in PHP than all of the rest of them. Why is this?

Read my next post to find out why.

Reliable code: building in robustness

Saturday, January 19th, 2008

Ok. Last post on the quality code series. One of the downsides of getting older is realizing you do have shortcomings. You know how when you’re young, going into a job interview, the toughest question is the one about your weaknesses? We’re all quite blind to our weaknesses, until experience comes up and forces you to realize you’re not perfect. Sometimes this happens early, sometimes late, but it happens to everyone sometime.

My coding weakness, it turns out, is reliability. I’m terrible at handling errors, building test frameworks, doing unit testing. I find all of that stuff quite boring. But it’s essential to building a reliable application.

Reliability and security go hand in hand. In security, you’re looking at the attacks, and making sure your code is secure against them. In reliability, you’re identifying what each chunk of code expects to get, and then define how to handle exceptions, unexpected input. Done correctly, reliable code is secure. But it’s a total pain to do, and it takes a lot longer to get there.

One of the code samples I examined recently was set up in a completely class-driven way, though I would not call it object oriented because none of the classes extended other classes. It was a rather simple, flat collection of objects and helpers and interfaces. It was not powerful. My guess is, it is not fast. It did not look very customizable. But it was certainly clear, and every single method inspected every single parameter, making sure the input was valid. Calls to other objects had extensive error handling built-in — this application looked like it could not fail without notifying the programmer exactly where the failure was, with helpful feedback.

This is tedious work. I save it for the polishing phases of a project, focusing on getting things to work in the first place. But there’s a strong argument to be made for building reliability into each module from the start. It’s a very different style of programming, and takes a lot longer to get there, but the end result will inevitably be more secure, less buggy, and more able to account for every possible scenario–even if it handles a scenario by saying “I can’t do that yet.”

I think there’s a personality difference between these development styles. The artist figures out some innovative way of solving the problem, gets a proof-of-concept working brilliantly quickly, and cranks through code producing a huge amount in a short amount of time. The craftsman takes a slower, methodical approach, crafting each module individually, building unit tests to make sure it works correctly as he goes, and building a system piece by polished piece.

Successful projects need both. The artist/hacker provides vision, drive, and momentum. The craftsman makes sure the system can handle the load, and can prove it’s doing what it’s designed to do.

The 80/20 rule comes into play here. 80% of the features can be hacked together very quickly, in the first 20% of the project. To make the project stand the tests of time, handle everything that might be thrown at it, and act as a foundation for a business or a mission-critical part, you need the craftsman to do the remaining 80% of the work to finish the job and get that final 20% of the functionality complete.

So here’s a checklist for evaluating reliability of a project:

  • Is the program broken up into discrete modules that can be completely tested one at a time?
  • Are there unit tests built for each module, testing the output for normal and exceptional conditions?
  • Is the input to each module validated and properly tested to handle all possible things that may be passed to it?
  • Does the module handle non-normal input, and raise the appropriate errors?
  • Are there regular tests of the software as a whole, and each module, to identify tests that fail, or regressions in the code?

The only way to ensure reliability is through rigorous testing. Some of the newer programming practices rely on test-driven development–first you define what a module does, then you develop a test for it, and then only after all that do you finally develop the module until it passes all the tests.

In a small business environment, this all may be too much overhead. 80% of an application may be enough, and at 20% of the cost, much more inline with the budget. But when you need something to be completely reliable, take a look at the testing framework, how much it covers, and how much of the application passes the tests.

Customizable code: writing future-proof code

Saturday, January 19th, 2008

Before code can be customizable, it must be clear. But clarity is not enough, if you’re going to be using a codebase in multiple places.

Many open source projects excel at customization. People have enough different uses for an application that very few work perfectly out of the box for everybody. Most companies want to apply their branding to the software we use. Some people need an application localized and translated for their audience. Sometimes a company just needs a small change to make the software better fit their needs.

It’s relatively simple to customize any application, if you have the source code. What becomes a huge challenge is maintaining your customizations when the underlying software is updated. If the software is not designed with specific ways of customizing it, it’s going to end up being difficult to maintain, unless you have gotten your changes incorporated back into the original software.

Architecting for customization
Applications that are designed for customization have clear divisions of code. This can happen for several different areas:

  • Templates or Themes. Most people want to be able to change the look and feel of a web application. If it has a template or theme system, you can just create a new theme and turn it on. Upgrades can then happen without clobbering your changes.
  • Language. Most successful open source projects have separate language files containing all of the labels, instructions, menus, and other text the application shows. Many come with multiple translations, and accept others as people contribute them.
  • Add-ons, plugins, modules, and components. Content management systems like Joomla and Drupal are particularly strong at this. SugarCRM is, too. They have a well-defined way of adding new functionality to the application, keeping it self-contained in a separate unit of code that a site administrator can manage through the interface.
  • An override mechanism. Some programs make it easy to replace the default behavior with your own version. ZenCart does this well–you can take many different core files, copy them into a particular directory associated with your site, and change them to make it do what you want. Upgrades to ZenCart will still use your versions of the files, even if the underlying file changes.

When you’re customizing an application, all of the other aspects of quality code apply to your customizations, as well as the original code. Your add-on is faster and more secure if you use the application’s interface for retrieving data instead of including your own. Your add-on is more powerful, clear, maintainable, and reliable if it uses the application’s defined ways of customizing it.

While not all open source is designed to be customized, it’s a strong consideration we’re looking at when we evaluate a project. So what do you do if you need to customize something that’s core to an application?

Customizing software not designed to be customized
If you need to make changes to the core part of an open source project, you’re setting yourself up for a maintenance nightmare. All active server software has updates. No program is perfect. Somebody, somewhere, will find a way to crack into it, and if you have business data or unethical competitors or disgruntled customers or employees, you will get targeted eventually. In the security community, people publish vulnerabilities to programs, so that they may be fixed. That means if you’re using common software packages, somebody needs to maintain it.

If you’re using software designed to be customized, and all your customizations are outside of the core code, this is not a major problem. A system administrator updates the core software, and if any of your customizations break, your developers update your customizations. However, if you had to make a lot of changes to core files, you’re in trouble. You either need to re-implement the security fixes in your code, or re-implement your customizations in the updated code.

There are basically 3 strategies for minimizing these issues:

  1. Use strong source-code management tools to manage your changes as patch-sets, and re-apply them at each upgrade, rewriting sections that no longer work.
  2. Fork the project, and take over responsibility for managing your branch. You’ll need to track the vulnerabilities in the parent project, and re implement security fixes in your own.
  3. Contribute your changes back to the original project, and persuade the maintainer to incorporate them into the main code tree.

When you look at these alternatives, clearly #3 is far less expensive for you than the other two–your customizations are no longer customizations, but part of the core software. This is actually how open source develops, and how you may change from being an open source consumer to an open source contributor.

Clear code: Building understandable applications

Tuesday, January 15th, 2008

Programming is an exercise in understanding a problem. To program effectively, you need to fully understand, in intricate detail, the problem your program is solving. Sometimes as a programmer you don’t fully understand the problem until you’ve wrestled with it a few times in code.

Most experienced programmers will tell you that when creating a large program, you almost always have to scrap your work at least once. At some point, you find that you’ve programmed your way into a dead end, that you just can’t quite get where you’re trying to go without doing it again. This is part of the process of understanding the problem, and usually once you’ve made this leap, you can visualize the whole thing laid out before you, and the next go around leads to a useful, functioning program. Not only that, but the next go-around has a much higher percentage of clear, understandable code.

Clarity in code is a sign of the maturity of the application. It’s also a sign of requirements that haven’t changed from the original. Inevitably, in the real world, code accumulates hairy sections to deal with changing requirements, accreting moss, dirt, and all sorts of cruft as the real world steps in to make things messy. The more clear, organized, well-defined, and well-documented a code base is, the longer it will last in the real world before needing a major revision.

If you see a project that seems completely transparent, easy to figure out, and easy to change, you’re probably looking at code that has been through some serious revision, and has been recently refactored to reflect the problem it’s trying to solve. As long as the fundamental assumptions of the design do not change, clean code is easy to enhance, extend, and otherwise adjust to meet new requirements. Until it gets hairy again and is time to start again.

Clean code is elegant. Clean code is flexible. Clean code is related to powerful code, but code can be powerful without being clean.

Here are some principles we use to develop or identify clean code.

Use a good overall architecture for your application.
Like many other software companies, we use a Model-View-Controller architecture for most of our projects. The Model defines the problem space, what data needs to be stored, and how it’s broken down. The View is the human interface, the presentation of the software to the user. The Controller connects the model to the view, and often enforces authorization rules and the interface to other systems.

In our applications, the model is almost always object-oriented. We build up classes of objects that correspond to what we’re modeling. We like using template systems like Smarty for the view, so our designers and front-end coders can change the presentation without affecting core business logic. Our controllers are a mix of objects and functional code, whatever seems most appropriate for the overall system.

Normalize data as much as practical.
In database terms, normalization is the process of identifying all the properties of all the objects that have a one-to-one relationship to each other, that fit cleanly in the same database table. For example, a contact has only one first name and one last name, one father, and one mother (at least in the biological sense), but might have more than one email address, mailing address, and phone number. When modeling this data structure, you might decide to have one contact table that allows for 3 email addresses. Or you might have a separate email address table that allows any number of email addresses associated with a contact. If you were going to fully normalize this data, you would have separate email address tables, phone number tables, and physical address tables. But is this really practical? Does your particular system need to track all the email addresses of a user, or is one (or two) enough? If you can limit it to one email address, it might make a fine unique identifier for your system, if you know your users don’t share email addresses.

But if you’re going to track three contacts for a company, why not normalize this into a separate table, and remove the arbitrary limitation? I shudder when I see fields named “email1, email2, email3, email4.”

Each database table should be owned by a single class.
If you have a contact table, you should probably have a contact class to manage it. While other classes may query this table in a join, those classes should be getting only specific fields from the table. Only the contact class should write to the contact table, and in most cases, all requests for any contact details should go through the contact class. The rest of your application should talk to a contact object, rather than the underlying data, except when you’re trying to optimize for speed.

The main benefit of this approach is that you can more easily change the structure of your database tables with minimal impact to your application. If you decide that you really do need more than one email address for a contact, you can do most of the heavy lifting in the contact class, and only need to make small changes to the template to show the new data. The other parts of your application should be unaffected, because they simply request the default email address from your contact object–which is smart enough to know that’s now coming from a different table.

If you really need to do sophisticated table joins to make your application fast, consider setting up a query builder structure. We sometimes set up static methods on a class that modify the different parts of a query to add the desired fields and do the appropriate joins.

Define who is responsible for what.
I’m not talking about people here–I’m talking about classes, files, and functions. Just like classes in the model own particular database tables, you should define which part of the application is responsible for all of the major parts of an application: authentication, authorization, state, the structure of the URL, form handling, initialization, etc. Each one of these functions should be owned by a particular part of the application. This “meta” stuff about the system we usually leave in the controller, often with included files dedicated to particular features. We usually build helper methods into base classes inherited by all of our data objects in the model, specifically for state and authorization.

Authentication, verifying that a user is who they say they are, should be consistent across your application. You usually have people log in with a username and password. The problem is, because the Web is stateless, you need to verify that you’re still talking with the same user on every single request. To do this, you either use http authentication, which passes the same credentials with each request, or you give the browser a token that you match up in a session. Your web application needs to verify the session or credentials with every single request, if it does anything that you don’t want the Internet at large to be able to do.

Authorization, granting access to particular objects and methods for particular users, can be a bit more complicated. There are several different models for authorization: simple ownership, group ownership, user levels, and full-fledged access control lists. Authorization can either be handled by the controller or by the model itself. If the code is clear, it should be apparent where authorization is handled, and how it may be changed.

Small Pieces Loosely Joined.
Even more than powerful programming, clear programming means breaking things up into manageable, understandable chunks. Each class in the model should correspond to the objects in the real world you’re modeling. The typical method on classes in our models are usually between 5 to 25 lines of PHP code. Some reach 30 or 40 lines, and only the really ugly ones reach 100 lines. If a method is reaching that threshold, it can probably be broken into several smaller helper methods that make the main method more readable. If these helper methods can be reused by other methods, well, you’re killing two birds with one stone. More often that not, this level of refactoring distills the essence of the problem down into components that make your code more powerful.

Most of the long methods in our code seem to be related to form processing, parsing different parameters to insert or update data across multiple database tables. Through a combination of setting up property maps inside the object, clever getter and setter methods, and utility methods that iterate across relevant properties, these long methods can be decimated to a few calls that make the method much more portable, resilient to bad data, and more easily overridden from subclasses, too.

Create effective documentation.
I’m just starting to get into the habit of creating JavaDoc/PHPDoc style of comments, documenting each function and method. I’m a long time user of the Komodo IDE from ActiveState, and it kindly shows you the comment immediately preceding a function you type, in a tooltip as you provide parameters. Being able to see what parameters your method is expecting, what it returns, and any gotchas about using it without opening the file containing the class, saves a lot of time during development. Those kinds of comments I consider to be required.

On the other hand, a comment that states the obvious is a waste of space. Comment anything unusual or unexpected. For example, if I assign a variable in an “if” expression, I’ll put a comment that I meant to assign it, that it’s not just missing the extra =.
if ($a = $b->value) // assigns value to $a, skips section if value is false

Related to inline code comments, use descriptive variable names, and consistent placeholders. I use $i, $j, $k for loops, $ar for generic arrays in helper functions, $obj for an unknown object, $t for a global Smarty template object. Otherwise I’m referring to $task, $oldtask, $project, $user, and $todotomorrow.

For complex projects, inline comments are not enough. You need a solid architectural document that illustrates objects and their relationships, workflow, and how to customize. Diagrams are good.

Finally, clear code is tidy code. While PHP isn’t as picky about tabs and whitespace as Python, properly nested code blocks promote readability, help keep your code valid, and gives you a quick indication about how deep you are inside a function.

Clear code invites customization, enhancement, and further development. Clear code is maintainable, and a sign that an application can likely be kept up-to-date for quite a while to come. Clear code takes more time to develop, but usually indicates a better understanding of the problem. Clear code is more portable, more reusable for other purposes, and more powerful.

Powerful code: Get more out of every line

Monday, January 14th, 2008

Programming borrows a lot from the construction industry. Many programming terms derive from construction: hacking, builds, development, architecture, scaffolding, frameworks, and dozens of others. But in some ways, programming has an element of power beyond construction.

Take, for example, a building. When you build a building, you start by pouring a foundation. On top of that, you construct a skeleton, add walls, a roof, sheetrock, siding, and all the plumbing and electrical. Each one of these details needs to be built by somebody–all four walls of each room needs to be framed in, wired, and finished.

In the world of programming, however, you really only need to build one wall, and then the computer can create as many copies as you need. So when building your program, you might create a “wall” class, which is comprised of a bunch of two by fours, sheathing, sheet rock, wiring, and outlets. You might give your wall a set of properties: width between studs, overall width, overall height, position of outlets, the number and dimensions of windows and doors, etc.

Once you have a wall defined with a bunch of appropriate variables, you can then work up to defining a room. Your room might have four walls, with windows and doors in particular positions. Obviously, there’s new levels of complexity here, but you don’t have to build every single wall if you can just specify a new wall with particular characteristics.

Now that we have a generic room, we can extend our room model by creating specific types, or sub-classes, of rooms: bedroom, bathroom, kitchen, utility room. And then we can define an apartment as a particular combination of rooms, and an apartment building as a particular combination of apartments.

A powerful program is one that allows you to say, “give me an apartment building with this many apartments of this base floorplan, and put it here.” A few lines of code specifying any details that vary from your standard, and you’re done with the basic system–you can start creating custom trim.

Object-oriented programming is powerful because it lets you start with a basic model, and extend it to create variations. Each variation (or subclass) inherits all the hard work that went into the underlying class, but adds only the details that make it different. The bathroom extends a generic room by adding plumbing and fixtures.

To me, this ability to inherit properties from other objects is the main reason to write object-oriented code. Some languages (like Java) force you to do everything in an object-oriented way, which strikes me as less practical–you need to find design patterns that work with that model to accomplish what you’re trying to do. But object orientation provides a powerful way of modeling a system.

When I review code, I’m looking for object orientation used in an effective, sensible way. Each real world object being modeled in a system should have a corresponding class in the underlying system. Classes should extend some basic data class to avoid repeating the same methods in a bunch of separate classes. Code should be built up into units that can become parts of other units, so that individual chunks can be kept small and understandable. If any PHP file ends up longer than a thousand lines, I start looking for ways of simplifying, streamlining, sharing code with other modules. If any individual method ends up longer than a hundred lines, it should be doing something extremely unusual that isn’t necessary anywhere else.

The Unix architecture is often summarized as “small pieces loosely joined.” Each identifiable chunk should be small and have a clearly defined purpose. Assembling these small pieces into a larger system results in great power while also allowing for reliability, security, and actually getting the project finished.

It’s all a matter of scope. When you’re looking at a wall object, you are working with two by fours, nails, and sheetrock. When examining a room, you’re working with walls, a ceiling, and a floor. Programming should hide the details of lower layers, and allow the programmer to focus on the necessary detail for the scope of the module she’s working on. The result is powerful code.

Why would you not need powerful code?
Pascal (and many others) is credited with the idea that it takes longer to write shorter code. This series of blog entries certainly illustrates the concept… The same principle holds true in code. If you’re creating a web application that’s never going to need revision, it can be much quicker to just write as you go and end up with some big long pile of spaghetti code. The instant you need to change it, or worse, somebody else needs to change it, fast, long-winded coding takes a lot more time to update.

As far as I’m concerned, the only reason to not take a structured, measured, powerful approach to coding is that you need something temporary working today, and don’t care that you’ll probably have to scrap it and do it right later.

How do you create powerful code?
Powerful code comes from structure. Frameworks deliver structure. This does not mean a particular framework is powerful for your application.

A skyscraper needs a much stronger foundation, and far better design to prevent collapse than a house. In programming, you can either use somebody else’s framework, or build or grow your own.

Developers love building frameworks. It’s fun to think of all the things that people might someday do with your framework, and build in a mechanism that provides useful ways of doing those things. The problem is, build in too many features to the framework and you just end up with a large bloated blob of code that nobody uses entirely, that nobody even knows how to use properly. Make your framework too small, and people end up having to do more work in the actual application.

The hot framework right now is Rails. It has a lot going for it–a solid philosophy of convention over configuration, auto-creation of all sorts of things like database tables you otherwise have to build yourself, and other features I’m sure you’ve heard about already from all the Rails developers out there.

Personally, I think frameworks like Rails are overrated, hiding too much of the implementation to be valuable. The perfect analogy for this is photography. If you take a basic photography course, you learn about the basic fundamentals: lens focal length, aperture, shutter speed, focus distance, and film speed. That’s all you need to know to take great pictures with any camera–at least any that allows you to set these things manually. Most cameras these days try to automate all of this for you, and most of the time they do a reasonable job. But most cameras also have a whole set of special settings. My Casio has a “Best Shot” mode, designed to set the camera up for different scenarios: landscapes, portraits, evening shots, indoors, backlit, etc. Some of these modes do really sophisticated things, but is it better for a photographer to understand all the different programmed modes, or the fundamentals of photography? I would argue the latter–with an understanding of how photography works, you can operate any camera. With an understanding of the programmed settings of a particular camera, you’re lost as soon as you move to another.

That’s the problem with frameworks–you spend more time learning all the ins and outs and arbitrary ways of tweaking it, instead of focusing on the actual task at hand–taking good pictures. Then again, I prefer a stick shift to an automatic every time…

When it comes to frameworks, less is more. The simplest possible framework that fits your application requirements is the one to use. If you can’t find one that fits, start with some simple data objects, an effective template library, and build your own, but don’t spend too much time on it–let it grow as you need it.

In the grand scheme of things, I don’t need a framework to create a database table for me–that’s a lot of extra code for something that only happens once. But for all those things you do need more than once–for the walls, rooms, and apartments in your building, design with care and power in mind.

For more about power, go read Paul Graham’s essay, Succinctness is Power. Then follow it up with Holding a Program in One’s Head.

Fast code: Speed and Scalability in PHP applications

Sunday, January 13th, 2008

Continuing on the series, the next item on the list seems to be the mistake I see the most–putting slow code in loops, loading up things that don’t need to be loaded, making simple requests expensive.

In terms of processing time, it’s expensive to open a database connection. It’s expensive to connect to another computer. It’s expensive to load up a big framework to respond to a single request. It’s relatively cheap to retrieve a pre-constructed page out of a cache.

The single biggest mistake I see that kills performance in code is putting database calls inside a loop. One code project we picked up had display code that showed the results of a search. First, it did a search to identify all the matching rows in the database. Then it looped through that result set, grabbing the rest of the data for each individual row, one query at a time. Then it cut down this set to the page size, discarding all that data it had loaded up. If the search yielded over a thousand results, it took over a minute to run! All of this data could be loaded with a single smarter database query–and doing so made the same search practically instantaneous.

This type of performance penalty is the main reason I don’t care for frameworks all that much–they often trade performance for programmer convenience. This is fine when your site is small, but leads to a lot more optimizing work down the road if your site takes off. And while good frameworks can turn result sets across objects efficiently, it usually takes learning how to make the framework do this in the first place–which means that programmers are better off learning how to do all of the work themselves before using a framework so they understand how to avoid these problems.

So here are some principles I use to make PHP applications speedy from day one.

Get as much data from each database query as you possibly can–but not much more
Unless a database table regularly contains a large blob we rarely need, go ahead and load up the entire row when creating a corresponding object. For example, in a project management tool, if asked to retrieve a task object, in my code you would provide a task id and you would get a task object pre-loaded with all the task object properties loaded with data from the database. While you can call getter methods to get individual properties, these do not result in yet another call to the database.

When retrieving arrays of task items, I usually provide a static search method that does a single query getting all of the data for all matching rows, constructing each task object, and passing it the already retrieved data so there are no further database calls–request the first 30 matching tasks, and the system still only does a single query on the database.

Doing a database query is expensive, but making a sophisticated query doesn’t add much to that as long as the database is properly indexed. When you know you have to do one, wring as much data as you can from each query. Use JOINs and database functions to do as much work as you possibly can in a single query.

I’m not that big a fan of stored procedures, mainly because I haven’t learned how to manage them effectively across deployment instances. Make a change to a code base, all you have to do to get it elsewhere is commit it to the repository and update your working copies. Make a change to a stored Postgres function, and you need to manually replace the function using psql or some other tool. But a stored function can be a way to offload more processing to the database, possibly gaining some performance in the process.

In general, I think of the database being in a separate silo than the business logic. The requests between these silos are what’s expensive–the processing in one side or the other is less so. Minimize the number of times you switch, and your application will be faster… As a side benefit, when your traffic outgrows what a single server can handle, and your database calls are actually on a different server, you won’t need to rewrite your application.

Avoid repeating yourself
Cut and paste when programming is a bad thing. Stepping through my own code with a debugger often reveals areas where I do the same thing twice. I loop through an array in one method to calculate some value. Then I loop through the same array somewhere else to perform some other operation. While loops can be fast, if you’re manipulating large objects or arrays, you still want to minimize this wherever you can. Sorting is expensive–wherever you can, let the database pre-sort your results for you. Look for opportunities to leverage work you’re doing in one part of your application to do double-duty and handle the task you’re doing elsewhere.

Writing code is a lot like writing anything else–it takes time to distill down to the essence. Early drafts can be much wordier than later drafts. If you have the time to go back and consolidate the areas of work, you’ll get a small performance benefit out of this.

Out of this list, this item is the least important. Try to consolidate as much as you can the first time through your code, but caching will far more than make up the difference. These are the slight improvements to save for future revisions–but if you see an obvious opportunity to combine and simplify code, take it.

Use Lazy-Loading wherever it makes sense
If your application needs to hit the database on every single request, go ahead and open a database connection early. If on some requests your application just returns static data, save a tiny bit of processing and skip the database connection. On a few projects, I’ve written code that connects to multiple databases, so I’ve written a simple stub class that maintains a singleton database connection object. In every method that connects to a database, it calls the static method that returns the database connection object, creating and establishing it if it doesn’t already exist.

We program extensively with Smarty, and in some projects use Smarty’s caching system. When used with a lazy-loading design, it’s extremely effective at speeding up page views. In our “standard” architecture, we have a controller stub that the browser requests. This stub examines the request, identifies the view and the data objects to load, and sometimes creates controller objects to handle specific requests. However, if you’re using a caching system, you need to check for a cached version before doing any of this processing. Either check the cache at the top of your controller, or move your controller itself into a file that’s loaded by a Smarty template. By having the template load the controller and decide what to do next, that processing never happens if Smarty retrieves the cached template instead.

Now that we program a lot with Ajax, we no longer automatically create a Smarty object for every request–first we check whether we’re returning HTML, XML, JSON, or something else, and only create the Smarty object for particular types of views.

These are examples of how we use lazy loading to avoid loading large chunks of code or establish database connections we never use.

Plan early on for caching
When you first launch your application, you probably don’t need caching because you’re not getting that much traffic. Some applications only run in private networks and never need to do any caching. But if you’re building a Facebook application or expecting huge amounts of traffic someday, create strategies for caching early on.

As I mentioned earlier, Smarty does this extremely well. You need to provide a way to uniquely identify an item in the cache, and Smarty will do it for you. Just make sure you check for the cached version before doing a lot of extra processing.

Without Smarty, it’s relatively easy to use output buffering to capture the output of your code and store it somewhere for later retrieval.

Many projects designed for traffic have simple switches you can just turn on to take advantage of caching, including Drupal and Joomla. After caching as much HTML as possible, the problem turns into more of a system administration project–installing an opcode cache like eAccellerator can help your server handle 30-40% more traffic, in our experience. These systems essentially compile your PHP to get more speed, and cache the result.

The next level of caching, for truly large sites, is using a system like memcached. Memcached provides a system for distributing a cache across multiple data servers, so for the truly large sites, the problem starts involving developers again. PHP provides a memcache module you use to store and retrieve your pages in memcached. When your site outgrows what can be run on two servers, it’s time to have your system administrators set up a memcached cluster and rewrite your application to use it.

Avoid over-engineering your application
I inherited another project gone awry that had started with some really huge, complicated framework that seemed half-done. Most projects we’re called in to complete involve spaghetti code, mixed logic and presentation, and no clear architecture. This one, in contrast, was over-engineered for the problem. To figure out how the code worked, I ran it through a debugger. To get to my main class for a particular object, it ran through a series of no less than 8 inherited classes. And worse, some utility methods were copied between child classes, instead of being put once higher in the class hierarchy. I saw clear reasons for having 3 layers of inheritance in this application. Not 8.

Since then I’ve seen a few times where developers seem to create more inherited classes just because it seems like they should to be correct, not because there was any practical value in it. I rarely see the need for more than 3 levels of object inheritance, and never more than 4 (at least in a web application). When your application needs to open 20 files just to respond to a simple AJAX data request, that’s over-engineered. When you create an elaborate class structure just to avoid a simple function, that’s over-engineered.

There’s a scale here, from non-engineered spaghetti code to rigid, sophisticated frameworks. I suspect that most people without formal training start with spaghetti code and gradually learn how to create more structured code–while computer science majors start out with over-engineered structures and eventually loosen up in the real world after running their code through some profilers and realizing they don’t need all that complexity for a simple problem. Everyone over time, at least anyone with a knack for this stuff, ends up somewhere in the middle, with enough architecture to do the job–and little more. There’s definitely some variation here as a matter of taste, but there are measurable problems with either extreme.

I further suspect that Rails might be so popular now because a lot of web developers out there with no formal training are suddenly seeing the benefits of structured code and smart frameworks.

Keep in mind how expensive each operation is
Some actions take a while to complete. In our experience, the most expensive actions involve connecting to another server, especially ones not in the same data center. Keep these in mind when coding, and don’t do them if it isn’t necessary. For very expensive operations, especially when you need to do a bunch at once, consider forking a process using a call to the shell, or move to a maintenance routine called from a cron job.

Expensive:

  • curl to connect to another server
  • Other functions used to connect to remote servers: fopen, file, etc
  • domxml, SimpleXml on very large XML documents
  • Sending mail to multiple recipients

Moderately expensive:

  • Sorting on large arrays
  • Database connection to remote server
  • domxml, SimpleXML on medium-sized documents
  • Recursive functions

Somewhat expensive:

  • Individual database queries
  • domxml, SimpleXML
  • Creating complex objects
  • Loading large files

Inexpensive:

  • XML event-based parsers
  • Retrieving cached files
  • Loops on small arrays
  • Lookups in hashes stored in memory, retrieving constants

Do you have any other tips for writing fast PHP code? Please add a comment below…

Secure code: Understanding PHP vulnerabilities

Saturday, January 12th, 2008

There are many articles that cover PHP vulnerabilities, but I’ve run across a lot of programmers and code that seems oblivious to them. When interviewing programmers, I look for an understanding of these types of vulnerabilities, and how to prevent their programs from being vulnerable to them.

Aside from register globals issues, most of these attacks are not specific to PHP.

Register Globals issues
From early on, the developers of PHP had this great idea: accept any parameters passed from the browser, and automatically turn them into variables available in the code. Well, it turned out to not be such a great idea–it meant that improperly initialized variables could be seeded by attackers to potentially do all sorts of damage. Worse, sometime after PHP 5 came out, someone figured out that you could pass a particular variable that would load and execute any PHP file before running the actual code–and this file could be on a completely different server, in a regular PHP installation.

Most other web languages never offered this convenience–you have to retrieve parameters from a browser through a specific module or array. PHP now provides arrays like $_GET, $_POST, $_REQUEST that are simple to use, but make it so you need to specifically request the variable you want from your code.

Any code that depends on register_globals being set is completely broken, as far as I’m concerned. If it’s on a server with an older version of PHP, it’s just waiting to get cracked. Any developer that relies on registered globals is programming for 10 years ago, and needs some serious education.

The main point here is that software should never trust data coming from the browser. I don’t care how much validation you do with Javascript; you’d better double-check the request on the server, and make sure either you set variables before you use them, or work in functions/classes that are not in the global scope.

SQL Injection vulnerabilities
This is the next most serious issue, and it affects pretty much all web languages, not just PHP. The most common way to interact with a database is to use a language called “structured query language” (SQL) to select rows of data from the database, update data, insert new data, or delete things. Once you learn the basic syntax and structure, it’s very easy to use. The problem is, you nearly always depend upon the user to identify what data to retrieve, or to provide the data to add or change.

Once again, we can’t ever trust data from the user. Most databases accept more than one query at a time, and most information used to select rows in a database is wrapped in single quotes:

SELECT first_name, last_name, salary FROM employees WHERE first_name LIKE 'John';

Beginner programmers drop the variable containing the search from the browser into the query, wrapping it in single quotes: LIKE ‘$firstname’;

Attackers simply put a single quote in the field, and then add another SQL command to do something malicious. Like delete the entire database.

Now, when you know there might be a quote in the variable, you can escape it by adding a backslash in front of it. PHP actually does this for you automatically if you have an evil setting called magic_quotes_gpc turned on. That’s why you often see a lot of backslashes in forums, blog comments, etc by the way. But there are ways of getting around that, as well.

At a minimum, all variables used in a query should be escaped using a function known to handle all possibilities, usually those provided for the specific database engine. What I look for in code is someone using a database abstraction layer or interface that allows for parameterized queries: instead of putting the variables directly in the query, you create a query with placeholders (usually a question mark, ?) where variables are to be substituted, and then pass an array of variables. The abstraction layer handles all of the escaping for you, and you end up with much cleaner code.

We use PEAR::DB as a database abstraction layer in most of our projects. Others include ADODB, or PEAR::MDB. PHP5 provides a mysqli interface capable of this, as well. If I see a mysql_query command in general application code, it gets marked way down in my book.

Mail Header Injection
Many programmers don’t realize it’s not safe to use the PHP mail() function without special protection. I didn’t believe this was a vulnerable function until one of our clients got attacked with it. Basically, the mail() function on a Linux system is a wrapper to the system sendmail command. Sendmail takes a plain text email, looks for a To, CC, and BCC addresses, and sends the message on its way. The problem is, attackers can inject fake headers into the message that basically hijacks your server to send spam. Any field that ends up in the header of a message–to, from, subject, or any other arbitrary header you collect could be used for this purpose.

I haven’t tested to or subject recently–there may be some built-in protection for these fields now. But to set the from address of a message, you pass it in an array or a string to the “header” parameter of mail(). This is ripe for exploit. All the attacker has to do is insert a newline, and then they can supply their own bcc field with hundreds of email addresses to spam. PHP and the sendmail binary will happily spew your attacker’s message to hundreds of users at a time. The next thing you know, your server will get on a blacklist for spamming, and nobody on that server will be able to send mail to domains like AOL or Comcast and other places that actively reject mail from known spammers.

Some kind soul posted a function to filter headers and ignore anything after a newline character to the comments section of the PHP documentation for the mail() function (the PHP documentation, and the comments, are a fantastic resource, and one of our favorite features of PHP). We have a simple safe_mail function that runs all the headers through this function, which also makes for a convenient way to intercept mail on a test environment.

This one isn’t talked about that much, but a programmer that protects a mail function properly is an indication of an experienced PHP developer.

Cross-site scripting (XSS)
Cross-site scripting is the current favorite exploit of attackers. Unlike the other attacks, they’re not attacking your site directly but exploiting it to attack your visitors. Of course, if your visitors have access to an administrative interface on the site, they could then use this to attack your site.

The real problem is that cross-site scripting is a great way to spread spyware, and so many sites are vulnerable to it. MySpace was long a victim of XSS. Ebay, too. Basically any site that allows users to add content that is shown to other users is vulnerable to XSS, unless the application developer has taken specific measures to prevent this. In this age of social networking, that is a huge number of sites.

If an attacker can find a way to get a script into a page shown to others, there’s lots of things he can do. Sometimes it’s as simple as adding <script> and a chunk of javascript or a location to load a javascript from. Other times they will attach a mouseover event or some other devious place. Sometimes they insert an object or iframe containing their malicious content.

If they can load an arbitrary script of their choosing, they can view anything on that page and watch anything the visitor types into that window. That’s expected, defined behavior, and that’s not going to change. So at a minimum, they can get passwords to your site and from there, they can do anything on your site that an attacked user can do.

But they don’t start there. Both Internet Explorer and Firefox have contained vulnerabilities that allow an attacker to escape the sandbox of that browser window to be able to monitor other windows, or even at worst install malicious software on the user’s computer. That is how spyware is spread. And once they have their own malicious software installed on your computer, they own it–they can monitor every mouse movement and keystroke, they can use it to send spam or attack other computers or do whatever they want.

Cross-site scripting is diabolical. It doesn’t usually harm your site, because attackers don’t want you to know you’re carrying their malware. Application developers ignore these issues to the peril of the entire Internet…

Session Hijacking
Web applications differ from most other applications in that they are considered “stateless”. That is, the server does not know the state of anything the user is doing, and starts in exactly the same condition for every request. In most applications, however, you are working through some sort of process and what you do next depends on the action you take when you’re in a particular state. What actions you have available to you depends upon the state of the object you’re working with.

For example, if you’re working with a user object, it might have several states: “unconfirmed”, “logged in”, “not logged in”, “suspended”. For users that are suspended, the application would prevent access to private data. For users who are unconfirmed, the application might offer to resend a confirmation link. For users who are logged in, the application would provide access to appropriate parts.

In a web application, it’s up to the programmer to define these states and handle them appropriately–PHP has no internal concept of state at all. Every request coming into your application must do all the work of loading the appropriate objects, defining what state they’re in, and doing whatever action is necessary.

PHP and other languages do however provide a mechanism for keeping track of users, with something called a session. PHP basically provides an automatic mechanism for storing variables associated with a user session on the server, instead of the browser. Since as we know well by now, you can’t trust anything coming from a browser, a session is a much safer place to store critical data to help you determine the state of your application and not have to reconstruct it completely on every page. It’s especially used for logins.

The problem is, sessions can be hijacked. PHP and other languages use a cookie to store a simple unique identifier for the session in the browser, which the browser helpfully returns on every request. If the browser has been compromised (by a cross-site scripting attack, or spyware, etc) an attacker can read these cookies and pass somebody else’s session identifier into your application, and if you don’t protect against this, hijack the original user’s session.

That takes some effort, however. Much more of the problem is when a user turns cookies off. Back in the late 1990s/early 2000s, many users got completely paranoid that cookies identified them wherever they went on the Internet, and many applications help users manage their cookies. So this general paranoia about cookies actually makes the situation worse, because if the user turns off cookies, your application either needs to force them to reauthenticate, or allow the browser to pass their session identifier through another means.

PHP has yet another configuration parameter to automatically allow session ids to be passed via a GET request instead of a cookie. The problem is, when this is done, the session identifier becomes part of the URL in the browser address bar. Users then bookmark their session id, post it to their blog or a forum, do whatever with it they want. And if your application is not written to handle this, other completely innocent users may find themselves logged into your application under a hijacked session id!

Applications using sessions must use some other source to verify that the session corresponds to the right user. In some cases, it may be enough to just require cookies and not allow session identifiers to come through any other vector. In others, programmers may need to consider using http authentication or other methods to verify that they have the right user.

Session hijacking is one of the toughest vulnerabilities to manage, if you need to protect any sensitive data. Even if you don’t, the application should deal appropriately with accidental session hijacking, because it’s very common and easy for users to do.

Other vulnerabilities
The list doesn’t stop there, but those are the serious mistakes I see, sometimes on a weekly basis. It’s hard to write secure code, but starting with security as a mindset goes a long way towards preventing problems down the road.

To summarize, here are some general tips to keeping applications safe from these types of attacks. If I’m interviewing you for a programmer position, I will be asking you about these:

  1. Never trust input from the browser.
  2. Turn off register_globals, but always assume it’s on and protect your variables anyway.
  3. Use a database abstraction layer, and parameterized queries.
  4. Be extra careful with database statements that cannot be parameterized.
  5. Strip all script, object, and iframe tags out of user inputs. Strip all Javascript and event attributes from any HTML you do allow.
  6. Never trust input from the browser.
  7. Use wrapper functions to add extra protection to common functions like mail().
  8. Be extremely careful with sessions that are used to authenticate users.
  9. Provide an appropriate level of protection for private data.

Any other vulnerability types you care about, when writing or reviewing web application code?

Quality Code: How do you judge?

Friday, January 11th, 2008

We’re hiring programmers, over at Freelock. I’ve been going through lots code samples to try to identify how experienced and competent a particular developer is. I also do this on a regular basis to evaluate how solid a particular open source project is.

I’ve seen a lot of code in various languages. As a technical writer, I used to write documentation for programmers teaching them how to use a particular interface or system. I’ve been involved with traditional software development projects at large software companies and startups. And I’ve done my share of actual programming of web applications.

I’m finding there are several indicators I look for when evaluating code, specifically for PHP, our language of choice. I’ll go in more depth on each of these qualities in future posts, but for now just thought I’d capture them while they’re fresh in my mind. So when I review code of a web application, here are some qualities I’m looking for:

  • Secure. Does the application trust users to provide good data? Does it protect its internals to prevent all the various types of exploits out there? Does it protect data from malicious users?
  • Fast. This could mean many things, but I’m looking for efficiency across layers. Is there a database call inside a loop that gets called a couple hundred times? That’s a huge speed killer. I look for code that has an appropriate level of abstraction to the size of the problem–and makes sensible choices about how much data to load for each request.
  • Powerful. This one is stolen from Paul Graham. Does the code use object-orientation and inheritance in a powerful way? I like seeing utility methods on base classes, which can then be leveraged to make very short, easy-to-understand final classes. Are the methods attached to the appropriate level of the class hierarchy? How short can you make the main logic of the application?
  • Clear. Going hand-in-hand with power, clarity is about making it apparent what each chunk of code is for, and how to go about changing it to make it work the way you want. Clear code is maintainable, well-documented, easy to customize.
  • Customizeable. Was the program designed in a way that’s easy to override, easy to customize, easy to run in other environments? Can it be managed effectively, and work broken up into different units?
  • Reliable. Does each function or method cover all possible scenarios? Is there proper error-handling in the code? When an end user hits upon some combination of things that the programmer never anticipated, does the program die ungracefully, or provide useful feedback?

Very few programmers hit all of these. My biggest weak area is the reliability one–after reviewing other people’s code, I find a lot less exception handling in my code. We’ve all got something to learn. But reviewing other people’s code can help you spot weaknesses in your own, and develop a much stronger sense of how to do it right.

[Edit: Adding links to more detailed posts as I publish them]

REST, PHP, PUT, and WebDAV on Apache

Thursday, November 8th, 2007

We’re doing a fair amount of AJAX development these days, and ran into a problem with the REST convention. Thought I’d put my notes here in case somebody else runs into this.

REST, short for “Representational State Transfer” is a new-ish approach to managing state in a web application. With PHP, you typically manage state using its session features, which pass a cookie back and forth from the browser. Then the server needs to store values for each browser session, mirroring in some fashion the data in the browser with a cache on the server. This breaks down in a couple cases:

  1. You want to have multiple browser windows open into the same application
  2. Your application runs on a cluster of servers, and sessions cannot be safely/quickly retrieved (extremely high traffic sites)

REST is simply a set of guidelines for keeping all state in the browser, and passing all the necessary parameters the server needs to handle a request with every single request. And the REST architects are making use of the full HTTP specification, rather than just the parts that are in widespread use.

The vast majority of web sites make use of only two HTTP methods: GET, and POST. But there are quite a few others that are defined in the original specification, particularly PUT and DELETE that the REST practitioners rely on. It adds a couple useful verbs to the language that otherwise add more parameters to your simple POST.

The PHP language handles GET and POST very nicely, giving the programmer an array of variables passed by the browser. It can handle PUT and DELETE as well, but you have to access the body of these requests using a raw input stream. That much is fine. The problem I ran into when deploying an application that relies on PUT was that my PHP application never received the request. It did on my development machine, but when I moved it to an internal production machine, all the PUT requests were getting a “403 Unauthorized” error back from the server. Why would this work on one machine but not another?

It turns out to be a conflict with WebDAV. WebDAV is an extension of the HTTP protocol, which also uses PUT and DELETE, and adds a few others like OPTIONS, MKCOL, and more. We use WebDAV instead of FTP to allow our clients to copy files up to our servers–we can lock them into particular directories, and generally secure the server much better than we can with FTP.

However, apparently WebDAV intercepts ALL PUT requests to the server, even if it’s for a different virtual host. A bug in the mod_dav module, or mod_dav_fs? Don’t know. What’s strange is that Subversion, which runs as a Dav handler, does not conflict with our use of PUT–only regular Dav.

So for those running into this, what we found is that as long as you don’t have “Dav On” anywhere on that Apache server, PUT requests make it to the PHP handler just fine. Subversion can be enabled on particular servers or subdirectories, and then it becomes the PUT handler for those areas. The Dav modules can be loaded with no conflict. But as soon as you turn Dav On on any virtual host on the entire server, your PHP script will no longer get the request.