Archive for January, 2008

New E-Commerce software: Magento

Saturday, January 19th, 2008

Just ran across a new Open Source shopping cart system, Magento. We’ve been using Zen Cart for a while now, and it’s great to see an alternative.

We actually really like Zen Cart. It’s fast, clear, and customizable. From a quick look at the Magento demo and feature list, it looks like they’re starting with Ajax in mind, but it doesn’t look like there’s that much different in the administration area. Will have to keep an eye on this one.

I give up. Trackbacks and Pingbacks now closed.

Saturday, January 19th, 2008

It’s too bad the spammers are out to piss all over the public commons. Since I’ve started writing more regularly here, I’ve been getting inundated with pingbacks and trackbacks, and have to keep marking them as spam, a couple dozen a day. Don’t have time to do this, so I’ve just turned this off… I appreciate any links you’d like to make to here, but please fill out a comment if you’d like to continue the discussion so I’m aware of your post.

I used to use Akismet to filter out comment spam, but spammers now seem to make each post unique enough that that became ineffective–I would still get dozens of comments to moderate every day. So then I switched to the Recaptcha.net system you can see on any of my posts–which has been working great for comments, but it doesn’t attempt to deal with all the automated trackpacks/pingbacks.

So here we are, back to comments only. Please leave one, or drop me an email here if you’d like to discuss anything I’m writing about, and are not just another spammer…

Reliable code: building in robustness

Saturday, January 19th, 2008

Ok. Last post on the quality code series. One of the downsides of getting older is realizing you do have shortcomings. You know how when you’re young, going into a job interview, the toughest question is the one about your weaknesses? We’re all quite blind to our weaknesses, until experience comes up and forces you to realize you’re not perfect. Sometimes this happens early, sometimes late, but it happens to everyone sometime.

My coding weakness, it turns out, is reliability. I’m terrible at handling errors, building test frameworks, doing unit testing. I find all of that stuff quite boring. But it’s essential to building a reliable application.

Reliability and security go hand in hand. In security, you’re looking at the attacks, and making sure your code is secure against them. In reliability, you’re identifying what each chunk of code expects to get, and then define how to handle exceptions, unexpected input. Done correctly, reliable code is secure. But it’s a total pain to do, and it takes a lot longer to get there.

One of the code samples I examined recently was set up in a completely class-driven way, though I would not call it object oriented because none of the classes extended other classes. It was a rather simple, flat collection of objects and helpers and interfaces. It was not powerful. My guess is, it is not fast. It did not look very customizable. But it was certainly clear, and every single method inspected every single parameter, making sure the input was valid. Calls to other objects had extensive error handling built-in — this application looked like it could not fail without notifying the programmer exactly where the failure was, with helpful feedback.

This is tedious work. I save it for the polishing phases of a project, focusing on getting things to work in the first place. But there’s a strong argument to be made for building reliability into each module from the start. It’s a very different style of programming, and takes a lot longer to get there, but the end result will inevitably be more secure, less buggy, and more able to account for every possible scenario–even if it handles a scenario by saying “I can’t do that yet.”

I think there’s a personality difference between these development styles. The artist figures out some innovative way of solving the problem, gets a proof-of-concept working brilliantly quickly, and cranks through code producing a huge amount in a short amount of time. The craftsman takes a slower, methodical approach, crafting each module individually, building unit tests to make sure it works correctly as he goes, and building a system piece by polished piece.

Successful projects need both. The artist/hacker provides vision, drive, and momentum. The craftsman makes sure the system can handle the load, and can prove it’s doing what it’s designed to do.

The 80/20 rule comes into play here. 80% of the features can be hacked together very quickly, in the first 20% of the project. To make the project stand the tests of time, handle everything that might be thrown at it, and act as a foundation for a business or a mission-critical part, you need the craftsman to do the remaining 80% of the work to finish the job and get that final 20% of the functionality complete.

So here’s a checklist for evaluating reliability of a project:

  • Is the program broken up into discrete modules that can be completely tested one at a time?
  • Are there unit tests built for each module, testing the output for normal and exceptional conditions?
  • Is the input to each module validated and properly tested to handle all possible things that may be passed to it?
  • Does the module handle non-normal input, and raise the appropriate errors?
  • Are there regular tests of the software as a whole, and each module, to identify tests that fail, or regressions in the code?

The only way to ensure reliability is through rigorous testing. Some of the newer programming practices rely on test-driven development–first you define what a module does, then you develop a test for it, and then only after all that do you finally develop the module until it passes all the tests.

In a small business environment, this all may be too much overhead. 80% of an application may be enough, and at 20% of the cost, much more inline with the budget. But when you need something to be completely reliable, take a look at the testing framework, how much it covers, and how much of the application passes the tests.

On Vendor Lock-in

Saturday, January 19th, 2008

I was listening to the latest episode of LugRadio the other day, and they had a discussion on vendor lock-in by open source distribution companies. I think they missed the point about vendor lock-in: that it locks users into a particular vendor, usually through some means that makes it hard to switch to a better solution later. So I wrote up a reply to send to them that I’m posting below, slightly edited. There is also an ongoing conversation about the topic at the LugRadio forum, and I see several posters are making the same points that I do here.

Open source business is the antithesis of vendor lock-in. Vendor Lock-in is when a vendor uses some sneaky, underhanded, unadvertised method to make it impossible to recover any of your original investment if you ever decide to go with a different system. Vendor lock-in is accomplished by using all the dirty tactics the proprietary software world has used for ages–closed systems that lock away your data, hidden undocumented features, patents, and sneaky licenses.

What you described on the show was not vendor lock-in. It’s called healthy competition, and it’s how open source software innovates. How is optimizing your client OS to work with your server OS vendor lock-in, if anybody else can see what you’re doing and do the same thing? Furthermore, how is it different than competition between KDE and Gnome, or vi and emacs, or any other of the many long-term competitions in the open source world?

Any distribution that is not looking for ways to improve their users’ experience is on the fast track to irrelevance. Take a look at some recent examples you should’ve used in your discussion:

* Xgl vs Aiglx: Novell went off and created Xgl, while Red Hat essentially recruited a bunch of other projects to do the same thing in a different way. Different distributions became real-life test beds for real innovation, and the better technology won.

* Xen. Novell and Red Hat have a great lead over Ubuntu on management tools for Xen. You could’ve accused either of those companies of trying to provide better experiences for their users, but that’s just good business, not vendor lock-in. Ubuntu may be behind, but they’re able to pick and choose their approach to managing Xen–nothing Red Hat or Novell has done is keeping their technology out of Ubuntu or any other distribution looking for enterprise customers. Neither Red Hat or Novell has achieved any kind of lock-in with enterprise customers–what they’ve achieved is leadership.

* Upstart. Here’s an area Ubuntu pioneered, and others are adopting.

* LTSP, K12LTSP vs Edubuntu.

I could go on and on. So I will. Distributions are always trying to shine at some particular set of features. Users decide which ones are appropriate for their needs. This is a fantastic thing. If Ubuntu weren’t trying to make their servers work particularly well with their desktops, they open an opportunity for another distribution who would. As long as a distribution can stay ahead of the competition technically, they deserve all the success they get–they’re pioneering the way, and the whole open source community benefits.

Okay. Now here is what would be vendor lock-in. If Canonical created some tricky way of making their servers talk to their clients, and then patented it so they could sue anybody else who tried to do the same thing, THAT would be vendor lock-in. If Red Hat embedded some private key on their commercial server that unlocked some turbo-samba supercharger, and encrypted their algorithm so nobody else could see it, and then put the key to unlock that speed in their desktop, THAT would be vendor lock-in.

But any open source company that tried such a tactic would be instantly cut off from the rest of the community–and they would probably have to violate a bunch of GPLd software to do so…

The competition between distributions make all of them better. While we’re all racing each other to see who can innovate faster, we still get the benefits of each other’s code, and Microsoft and Apple are starting to disappear in the dust in our rear view mirrors.

One other point I’d like to make: the earlier an open source project tries some new tactic to improve computing, and commits code to a repository the whole world can see, the better. Prior art is the key to defeating all the frivolous patents companies are taking out. If somebody tries something really inventive to eke out a bit more performance, I want that in a public Subversion revision associated with a date and a free license–it’s the best insurance we’ve got against a broken patent system.

On Patents and Free Software

Saturday, January 19th, 2008

I’ve spoken with a lot of entrepreneurs around Seattle, who have a misconception that using open source might somehow force them to give away their intellectual property. Intellectual Property is a hot topic around here, and entrepreneurs are told regularly how they need to have some to get funded. Yet they often think they can add their patented idea to free software and lock up their core idea. It’s a bit funny how they want to have their cake and eat it too.

I’m talking specifically about patents here. My understanding of the intent of patents is to give the originator of an idea a legal monopoly that allows them to invest large amounts of capital to bring a new innovative product to market, for the benefit of the rest of us. In an industrial age, patents make a lot of sense–building an assembly line takes a lot of capital, and if you have a bunch of competitors, nobody would make the investment to build the infrastructure, if they couldn’t lock up the market and get paid back for their investment.

Now, I won’t go into the social ramifications of this arrangement, but I’d like to make two arguments about patents in the age of software here:

  1. The cost of building a distribution and manufacturing network for a software idea is pretty much nil.
  2. Very few software ideas are non-obvious and innovative, or worthy of patent protection.

On the first point, if you had to build or buy your whole set of tools to run your application–compilers, web servers, operating systems, text editors–then yes, maybe you need some sort of way of protecting your investment in all that infrastructure. But with free software, you get all of this for free and have it deployed today on a $600 computer. Your next dollar spent is in your time getting your application actually written. The individual craftsman working from home after hours can develop software in his spare time that can rival anything coming out of a venture-funded startup or a multi-million dollar corporation with the help of all of this free software. Why does the startup need protection from a solo garage programmer? The only reason I can think of is to keep free-market economics from harming the investors, and profits going into the pockets of the entrenched players.

Patents take time and money to obtain. In the world of free software, programmers are far better served creating their idea and bringing it to market first, rather than wasting time and money on patents. Developing a solid program quickly and accumulating a base of customers for your service is going to be the best way to stay ahead of your competition.

So the big question I have for startups is, why should you get all the benefits of Free software with no financial outlay and ability to bypass all sorts of startup cost, and then keep your invention protected by patent? If patents protect a large capital investment necessary to bring a product to market, but then suddenly there is no large capital investment necessary to do so, why should you still need a patent? You can bring a product to market today with a few hundred dollars and a lot of elbow grease.

Of course, you might still be trying to get fabulously wealthy by locking up your idea so nobody else can use it. Fine. Take out your patent. Buy licenses from all the proprietary vendors to use as a platform for your idea. Don’t use software covered by the GPL, because you would not be able to protect your patent when you distributed your product–the GPL is incompatible with your patents, and you would lose your license to use the GPL software. Pay a whole lot more for startup costs, all to protect your fine idea… and then you’ll run up hard against the second point I’m making here: lots of other people have the same idea as you.

Our patent courts are getting inundated by suits from patent “trolls” who have purchased patents from inventors or researchers, and then sit on those patents purely for the purpose of suing others for profit. The targets of their lawsuits almost always had no awareness of the patent they’re being sued for violating–they came up with the same idea independently. Why on earth would we reward the person who filled out a bunch of paperwork to take out a patent, and punish the person who actually brought it to market so the rest of us could benefit?

Patents are supposed to be innovative, and non-obvious. The fact that many different companies come up with the same ideas independently all the time should indicate that those ideas are obvious to people in the field. Because the cost of bringing these ideas to market have dropped to virtually nothing, software companies do not need patent protection–there is very little capital outlay necessary to protect. Our current patent system punishes innovation, instead of promoting it.

Fortunately for the open source community, there are several things that make us more resistant to the patent threat than proprietary companies. We should assert these points whenever someone in the community is threatened by a patent holder:

  • No big pot of money. Because we don’t need much capital to get started, there’s not really any prize for patent trolls to go after. The patent trolls attack software companies, and their goal is not to drive the company out of business, but to profit off somebody else’s work. If there’s not enough cash in one place to provide that profit, they won’t bother to try. They’ll keep suing Microsoft and Blackberry and all those other venture-backed startups that still think the patent system isn’t broken.
  • Prior art. Here the public process of open source projects is a huge advantage. One of the best ways of invalidating a patent is to prove that there was prior art. What matters for this is the date of publication–was there prior art published before the date of the patent application? If so, and if that can be proven, it can completely invalidate the patent. Well, the Internet is nothing if not a big publishing machine. A public repository with revisions available by date seems to me like a great way of proving when an idea first entered a codebase of a particular project. Back that up with a look in the Wayback machine or a Wikipedia article, blog entries, IRC logs, and we can prove prior art with any discussions that happened before the patent application.

    Those poor, poor proprietary companies… they can’t even talk about things they’re trying to patent until they have their application in, or they might invalidate their own patent. When it comes to validating patents, the first date of publication or patent application wins.

  • The GPL. GPL v2 prevents patent holders from providing patent licenses to some recipients of the software, but not others. GPL v3 makes the rules much more specific, basically making covered software incompatible with patents. If you patent something you’ve extended from GPL v3 software, you’re violating the license and lose the right to use that software. So much for that business model.

Patents were originally designed to make it financially possible to bring innovative products to market so the world could benefit from that innovation. The current patent system, at least for the software world, does the opposite, standing in the way and penalizing innovation instead of promoting it. The Free Software movement is trying to do the same thing patents were created to do: make it possible to bring innovations to the public and protect innovators, in the face of a broken patent system. And while free software has taken away the potential for huge profits, it has also taken away the huge costs.

I was talking with a programmer last week about open source. He kept asking, “but how can I make money programming free software?” He seemed to think he was entitled to develop his idea, control it from start to finish, get venture funding, bring it to market, and everybody would buy it and make him the next Bill Gates. He kept saying that free software was not friendly to the community, because it took away everybody’s ability to make money. He seemed to think that proprietary software companies developing in a closed ecosystem was the key to developing good software, that that was the only way to make money in software, so if you didn’t work with his software community, you wouldn’t make any money.

I asked him where the money came from, and his answer was “the customers.”

Ah, so let me get this straight: customers are going to fund your closed ecosystem of buddies all trying to corner the market on particular ideas, just to keep this group of developers making money. Really? Even when there’s a completely viable alternative that does every bit as much as your system, without vendor lock-in?

Free software is, if nothing else, an advocacy group for users. Customers are the users in this case. Free software is software that users can use for whatever purpose they wish, can give to anybody they want, and change to suit their needs in any way they want–as long as they don’t restrict what other users do with it.

Free software is about community. However, in the free software world, users and customers are part of the community, not merely an external entity funding it all, a sugar daddy. In the free software community, the lines between users and contributors, between customers and vendors, become quite blurry. Doing something at the expense of part of the community, just so you can make a fabulous profit, isn’t going to keep you in business very long.

Customizable code: writing future-proof code

Saturday, January 19th, 2008

Before code can be customizable, it must be clear. But clarity is not enough, if you’re going to be using a codebase in multiple places.

Many open source projects excel at customization. People have enough different uses for an application that very few work perfectly out of the box for everybody. Most companies want to apply their branding to the software we use. Some people need an application localized and translated for their audience. Sometimes a company just needs a small change to make the software better fit their needs.

It’s relatively simple to customize any application, if you have the source code. What becomes a huge challenge is maintaining your customizations when the underlying software is updated. If the software is not designed with specific ways of customizing it, it’s going to end up being difficult to maintain, unless you have gotten your changes incorporated back into the original software.

Architecting for customization
Applications that are designed for customization have clear divisions of code. This can happen for several different areas:

  • Templates or Themes. Most people want to be able to change the look and feel of a web application. If it has a template or theme system, you can just create a new theme and turn it on. Upgrades can then happen without clobbering your changes.
  • Language. Most successful open source projects have separate language files containing all of the labels, instructions, menus, and other text the application shows. Many come with multiple translations, and accept others as people contribute them.
  • Add-ons, plugins, modules, and components. Content management systems like Joomla and Drupal are particularly strong at this. SugarCRM is, too. They have a well-defined way of adding new functionality to the application, keeping it self-contained in a separate unit of code that a site administrator can manage through the interface.
  • An override mechanism. Some programs make it easy to replace the default behavior with your own version. ZenCart does this well–you can take many different core files, copy them into a particular directory associated with your site, and change them to make it do what you want. Upgrades to ZenCart will still use your versions of the files, even if the underlying file changes.

When you’re customizing an application, all of the other aspects of quality code apply to your customizations, as well as the original code. Your add-on is faster and more secure if you use the application’s interface for retrieving data instead of including your own. Your add-on is more powerful, clear, maintainable, and reliable if it uses the application’s defined ways of customizing it.

While not all open source is designed to be customized, it’s a strong consideration we’re looking at when we evaluate a project. So what do you do if you need to customize something that’s core to an application?

Customizing software not designed to be customized
If you need to make changes to the core part of an open source project, you’re setting yourself up for a maintenance nightmare. All active server software has updates. No program is perfect. Somebody, somewhere, will find a way to crack into it, and if you have business data or unethical competitors or disgruntled customers or employees, you will get targeted eventually. In the security community, people publish vulnerabilities to programs, so that they may be fixed. That means if you’re using common software packages, somebody needs to maintain it.

If you’re using software designed to be customized, and all your customizations are outside of the core code, this is not a major problem. A system administrator updates the core software, and if any of your customizations break, your developers update your customizations. However, if you had to make a lot of changes to core files, you’re in trouble. You either need to re-implement the security fixes in your code, or re-implement your customizations in the updated code.

There are basically 3 strategies for minimizing these issues:

  1. Use strong source-code management tools to manage your changes as patch-sets, and re-apply them at each upgrade, rewriting sections that no longer work.
  2. Fork the project, and take over responsibility for managing your branch. You’ll need to track the vulnerabilities in the parent project, and re implement security fixes in your own.
  3. Contribute your changes back to the original project, and persuade the maintainer to incorporate them into the main code tree.

When you look at these alternatives, clearly #3 is far less expensive for you than the other two–your customizations are no longer customizations, but part of the core software. This is actually how open source develops, and how you may change from being an open source consumer to an open source contributor.

Clear code: Building understandable applications

Tuesday, January 15th, 2008

Programming is an exercise in understanding a problem. To program effectively, you need to fully understand, in intricate detail, the problem your program is solving. Sometimes as a programmer you don’t fully understand the problem until you’ve wrestled with it a few times in code.

Most experienced programmers will tell you that when creating a large program, you almost always have to scrap your work at least once. At some point, you find that you’ve programmed your way into a dead end, that you just can’t quite get where you’re trying to go without doing it again. This is part of the process of understanding the problem, and usually once you’ve made this leap, you can visualize the whole thing laid out before you, and the next go around leads to a useful, functioning program. Not only that, but the next go-around has a much higher percentage of clear, understandable code.

Clarity in code is a sign of the maturity of the application. It’s also a sign of requirements that haven’t changed from the original. Inevitably, in the real world, code accumulates hairy sections to deal with changing requirements, accreting moss, dirt, and all sorts of cruft as the real world steps in to make things messy. The more clear, organized, well-defined, and well-documented a code base is, the longer it will last in the real world before needing a major revision.

If you see a project that seems completely transparent, easy to figure out, and easy to change, you’re probably looking at code that has been through some serious revision, and has been recently refactored to reflect the problem it’s trying to solve. As long as the fundamental assumptions of the design do not change, clean code is easy to enhance, extend, and otherwise adjust to meet new requirements. Until it gets hairy again and is time to start again.

Clean code is elegant. Clean code is flexible. Clean code is related to powerful code, but code can be powerful without being clean.

Here are some principles we use to develop or identify clean code.

Use a good overall architecture for your application.
Like many other software companies, we use a Model-View-Controller architecture for most of our projects. The Model defines the problem space, what data needs to be stored, and how it’s broken down. The View is the human interface, the presentation of the software to the user. The Controller connects the model to the view, and often enforces authorization rules and the interface to other systems.

In our applications, the model is almost always object-oriented. We build up classes of objects that correspond to what we’re modeling. We like using template systems like Smarty for the view, so our designers and front-end coders can change the presentation without affecting core business logic. Our controllers are a mix of objects and functional code, whatever seems most appropriate for the overall system.

Normalize data as much as practical.
In database terms, normalization is the process of identifying all the properties of all the objects that have a one-to-one relationship to each other, that fit cleanly in the same database table. For example, a contact has only one first name and one last name, one father, and one mother (at least in the biological sense), but might have more than one email address, mailing address, and phone number. When modeling this data structure, you might decide to have one contact table that allows for 3 email addresses. Or you might have a separate email address table that allows any number of email addresses associated with a contact. If you were going to fully normalize this data, you would have separate email address tables, phone number tables, and physical address tables. But is this really practical? Does your particular system need to track all the email addresses of a user, or is one (or two) enough? If you can limit it to one email address, it might make a fine unique identifier for your system, if you know your users don’t share email addresses.

But if you’re going to track three contacts for a company, why not normalize this into a separate table, and remove the arbitrary limitation? I shudder when I see fields named “email1, email2, email3, email4.”

Each database table should be owned by a single class.
If you have a contact table, you should probably have a contact class to manage it. While other classes may query this table in a join, those classes should be getting only specific fields from the table. Only the contact class should write to the contact table, and in most cases, all requests for any contact details should go through the contact class. The rest of your application should talk to a contact object, rather than the underlying data, except when you’re trying to optimize for speed.

The main benefit of this approach is that you can more easily change the structure of your database tables with minimal impact to your application. If you decide that you really do need more than one email address for a contact, you can do most of the heavy lifting in the contact class, and only need to make small changes to the template to show the new data. The other parts of your application should be unaffected, because they simply request the default email address from your contact object–which is smart enough to know that’s now coming from a different table.

If you really need to do sophisticated table joins to make your application fast, consider setting up a query builder structure. We sometimes set up static methods on a class that modify the different parts of a query to add the desired fields and do the appropriate joins.

Define who is responsible for what.
I’m not talking about people here–I’m talking about classes, files, and functions. Just like classes in the model own particular database tables, you should define which part of the application is responsible for all of the major parts of an application: authentication, authorization, state, the structure of the URL, form handling, initialization, etc. Each one of these functions should be owned by a particular part of the application. This “meta” stuff about the system we usually leave in the controller, often with included files dedicated to particular features. We usually build helper methods into base classes inherited by all of our data objects in the model, specifically for state and authorization.

Authentication, verifying that a user is who they say they are, should be consistent across your application. You usually have people log in with a username and password. The problem is, because the Web is stateless, you need to verify that you’re still talking with the same user on every single request. To do this, you either use http authentication, which passes the same credentials with each request, or you give the browser a token that you match up in a session. Your web application needs to verify the session or credentials with every single request, if it does anything that you don’t want the Internet at large to be able to do.

Authorization, granting access to particular objects and methods for particular users, can be a bit more complicated. There are several different models for authorization: simple ownership, group ownership, user levels, and full-fledged access control lists. Authorization can either be handled by the controller or by the model itself. If the code is clear, it should be apparent where authorization is handled, and how it may be changed.

Small Pieces Loosely Joined.
Even more than powerful programming, clear programming means breaking things up into manageable, understandable chunks. Each class in the model should correspond to the objects in the real world you’re modeling. The typical method on classes in our models are usually between 5 to 25 lines of PHP code. Some reach 30 or 40 lines, and only the really ugly ones reach 100 lines. If a method is reaching that threshold, it can probably be broken into several smaller helper methods that make the main method more readable. If these helper methods can be reused by other methods, well, you’re killing two birds with one stone. More often that not, this level of refactoring distills the essence of the problem down into components that make your code more powerful.

Most of the long methods in our code seem to be related to form processing, parsing different parameters to insert or update data across multiple database tables. Through a combination of setting up property maps inside the object, clever getter and setter methods, and utility methods that iterate across relevant properties, these long methods can be decimated to a few calls that make the method much more portable, resilient to bad data, and more easily overridden from subclasses, too.

Create effective documentation.
I’m just starting to get into the habit of creating JavaDoc/PHPDoc style of comments, documenting each function and method. I’m a long time user of the Komodo IDE from ActiveState, and it kindly shows you the comment immediately preceding a function you type, in a tooltip as you provide parameters. Being able to see what parameters your method is expecting, what it returns, and any gotchas about using it without opening the file containing the class, saves a lot of time during development. Those kinds of comments I consider to be required.

On the other hand, a comment that states the obvious is a waste of space. Comment anything unusual or unexpected. For example, if I assign a variable in an “if” expression, I’ll put a comment that I meant to assign it, that it’s not just missing the extra =.
if ($a = $b->value) // assigns value to $a, skips section if value is false

Related to inline code comments, use descriptive variable names, and consistent placeholders. I use $i, $j, $k for loops, $ar for generic arrays in helper functions, $obj for an unknown object, $t for a global Smarty template object. Otherwise I’m referring to $task, $oldtask, $project, $user, and $todotomorrow.

For complex projects, inline comments are not enough. You need a solid architectural document that illustrates objects and their relationships, workflow, and how to customize. Diagrams are good.

Finally, clear code is tidy code. While PHP isn’t as picky about tabs and whitespace as Python, properly nested code blocks promote readability, help keep your code valid, and gives you a quick indication about how deep you are inside a function.

Clear code invites customization, enhancement, and further development. Clear code is maintainable, and a sign that an application can likely be kept up-to-date for quite a while to come. Clear code takes more time to develop, but usually indicates a better understanding of the problem. Clear code is more portable, more reusable for other purposes, and more powerful.

Powerful code: Get more out of every line

Monday, January 14th, 2008

Programming borrows a lot from the construction industry. Many programming terms derive from construction: hacking, builds, development, architecture, scaffolding, frameworks, and dozens of others. But in some ways, programming has an element of power beyond construction.

Take, for example, a building. When you build a building, you start by pouring a foundation. On top of that, you construct a skeleton, add walls, a roof, sheetrock, siding, and all the plumbing and electrical. Each one of these details needs to be built by somebody–all four walls of each room needs to be framed in, wired, and finished.

In the world of programming, however, you really only need to build one wall, and then the computer can create as many copies as you need. So when building your program, you might create a “wall” class, which is comprised of a bunch of two by fours, sheathing, sheet rock, wiring, and outlets. You might give your wall a set of properties: width between studs, overall width, overall height, position of outlets, the number and dimensions of windows and doors, etc.

Once you have a wall defined with a bunch of appropriate variables, you can then work up to defining a room. Your room might have four walls, with windows and doors in particular positions. Obviously, there’s new levels of complexity here, but you don’t have to build every single wall if you can just specify a new wall with particular characteristics.

Now that we have a generic room, we can extend our room model by creating specific types, or sub-classes, of rooms: bedroom, bathroom, kitchen, utility room. And then we can define an apartment as a particular combination of rooms, and an apartment building as a particular combination of apartments.

A powerful program is one that allows you to say, “give me an apartment building with this many apartments of this base floorplan, and put it here.” A few lines of code specifying any details that vary from your standard, and you’re done with the basic system–you can start creating custom trim.

Object-oriented programming is powerful because it lets you start with a basic model, and extend it to create variations. Each variation (or subclass) inherits all the hard work that went into the underlying class, but adds only the details that make it different. The bathroom extends a generic room by adding plumbing and fixtures.

To me, this ability to inherit properties from other objects is the main reason to write object-oriented code. Some languages (like Java) force you to do everything in an object-oriented way, which strikes me as less practical–you need to find design patterns that work with that model to accomplish what you’re trying to do. But object orientation provides a powerful way of modeling a system.

When I review code, I’m looking for object orientation used in an effective, sensible way. Each real world object being modeled in a system should have a corresponding class in the underlying system. Classes should extend some basic data class to avoid repeating the same methods in a bunch of separate classes. Code should be built up into units that can become parts of other units, so that individual chunks can be kept small and understandable. If any PHP file ends up longer than a thousand lines, I start looking for ways of simplifying, streamlining, sharing code with other modules. If any individual method ends up longer than a hundred lines, it should be doing something extremely unusual that isn’t necessary anywhere else.

The Unix architecture is often summarized as “small pieces loosely joined.” Each identifiable chunk should be small and have a clearly defined purpose. Assembling these small pieces into a larger system results in great power while also allowing for reliability, security, and actually getting the project finished.

It’s all a matter of scope. When you’re looking at a wall object, you are working with two by fours, nails, and sheetrock. When examining a room, you’re working with walls, a ceiling, and a floor. Programming should hide the details of lower layers, and allow the programmer to focus on the necessary detail for the scope of the module she’s working on. The result is powerful code.

Why would you not need powerful code?
Pascal (and many others) is credited with the idea that it takes longer to write shorter code. This series of blog entries certainly illustrates the concept… The same principle holds true in code. If you’re creating a web application that’s never going to need revision, it can be much quicker to just write as you go and end up with some big long pile of spaghetti code. The instant you need to change it, or worse, somebody else needs to change it, fast, long-winded coding takes a lot more time to update.

As far as I’m concerned, the only reason to not take a structured, measured, powerful approach to coding is that you need something temporary working today, and don’t care that you’ll probably have to scrap it and do it right later.

How do you create powerful code?
Powerful code comes from structure. Frameworks deliver structure. This does not mean a particular framework is powerful for your application.

A skyscraper needs a much stronger foundation, and far better design to prevent collapse than a house. In programming, you can either use somebody else’s framework, or build or grow your own.

Developers love building frameworks. It’s fun to think of all the things that people might someday do with your framework, and build in a mechanism that provides useful ways of doing those things. The problem is, build in too many features to the framework and you just end up with a large bloated blob of code that nobody uses entirely, that nobody even knows how to use properly. Make your framework too small, and people end up having to do more work in the actual application.

The hot framework right now is Rails. It has a lot going for it–a solid philosophy of convention over configuration, auto-creation of all sorts of things like database tables you otherwise have to build yourself, and other features I’m sure you’ve heard about already from all the Rails developers out there.

Personally, I think frameworks like Rails are overrated, hiding too much of the implementation to be valuable. The perfect analogy for this is photography. If you take a basic photography course, you learn about the basic fundamentals: lens focal length, aperture, shutter speed, focus distance, and film speed. That’s all you need to know to take great pictures with any camera–at least any that allows you to set these things manually. Most cameras these days try to automate all of this for you, and most of the time they do a reasonable job. But most cameras also have a whole set of special settings. My Casio has a “Best Shot” mode, designed to set the camera up for different scenarios: landscapes, portraits, evening shots, indoors, backlit, etc. Some of these modes do really sophisticated things, but is it better for a photographer to understand all the different programmed modes, or the fundamentals of photography? I would argue the latter–with an understanding of how photography works, you can operate any camera. With an understanding of the programmed settings of a particular camera, you’re lost as soon as you move to another.

That’s the problem with frameworks–you spend more time learning all the ins and outs and arbitrary ways of tweaking it, instead of focusing on the actual task at hand–taking good pictures. Then again, I prefer a stick shift to an automatic every time…

When it comes to frameworks, less is more. The simplest possible framework that fits your application requirements is the one to use. If you can’t find one that fits, start with some simple data objects, an effective template library, and build your own, but don’t spend too much time on it–let it grow as you need it.

In the grand scheme of things, I don’t need a framework to create a database table for me–that’s a lot of extra code for something that only happens once. But for all those things you do need more than once–for the walls, rooms, and apartments in your building, design with care and power in mind.

For more about power, go read Paul Graham’s essay, Succinctness is Power. Then follow it up with Holding a Program in One’s Head.

Fast code: Speed and Scalability in PHP applications

Sunday, January 13th, 2008

Continuing on the series, the next item on the list seems to be the mistake I see the most–putting slow code in loops, loading up things that don’t need to be loaded, making simple requests expensive.

In terms of processing time, it’s expensive to open a database connection. It’s expensive to connect to another computer. It’s expensive to load up a big framework to respond to a single request. It’s relatively cheap to retrieve a pre-constructed page out of a cache.

The single biggest mistake I see that kills performance in code is putting database calls inside a loop. One code project we picked up had display code that showed the results of a search. First, it did a search to identify all the matching rows in the database. Then it looped through that result set, grabbing the rest of the data for each individual row, one query at a time. Then it cut down this set to the page size, discarding all that data it had loaded up. If the search yielded over a thousand results, it took over a minute to run! All of this data could be loaded with a single smarter database query–and doing so made the same search practically instantaneous.

This type of performance penalty is the main reason I don’t care for frameworks all that much–they often trade performance for programmer convenience. This is fine when your site is small, but leads to a lot more optimizing work down the road if your site takes off. And while good frameworks can turn result sets across objects efficiently, it usually takes learning how to make the framework do this in the first place–which means that programmers are better off learning how to do all of the work themselves before using a framework so they understand how to avoid these problems.

So here are some principles I use to make PHP applications speedy from day one.

Get as much data from each database query as you possibly can–but not much more
Unless a database table regularly contains a large blob we rarely need, go ahead and load up the entire row when creating a corresponding object. For example, in a project management tool, if asked to retrieve a task object, in my code you would provide a task id and you would get a task object pre-loaded with all the task object properties loaded with data from the database. While you can call getter methods to get individual properties, these do not result in yet another call to the database.

When retrieving arrays of task items, I usually provide a static search method that does a single query getting all of the data for all matching rows, constructing each task object, and passing it the already retrieved data so there are no further database calls–request the first 30 matching tasks, and the system still only does a single query on the database.

Doing a database query is expensive, but making a sophisticated query doesn’t add much to that as long as the database is properly indexed. When you know you have to do one, wring as much data as you can from each query. Use JOINs and database functions to do as much work as you possibly can in a single query.

I’m not that big a fan of stored procedures, mainly because I haven’t learned how to manage them effectively across deployment instances. Make a change to a code base, all you have to do to get it elsewhere is commit it to the repository and update your working copies. Make a change to a stored Postgres function, and you need to manually replace the function using psql or some other tool. But a stored function can be a way to offload more processing to the database, possibly gaining some performance in the process.

In general, I think of the database being in a separate silo than the business logic. The requests between these silos are what’s expensive–the processing in one side or the other is less so. Minimize the number of times you switch, and your application will be faster… As a side benefit, when your traffic outgrows what a single server can handle, and your database calls are actually on a different server, you won’t need to rewrite your application.

Avoid repeating yourself
Cut and paste when programming is a bad thing. Stepping through my own code with a debugger often reveals areas where I do the same thing twice. I loop through an array in one method to calculate some value. Then I loop through the same array somewhere else to perform some other operation. While loops can be fast, if you’re manipulating large objects or arrays, you still want to minimize this wherever you can. Sorting is expensive–wherever you can, let the database pre-sort your results for you. Look for opportunities to leverage work you’re doing in one part of your application to do double-duty and handle the task you’re doing elsewhere.

Writing code is a lot like writing anything else–it takes time to distill down to the essence. Early drafts can be much wordier than later drafts. If you have the time to go back and consolidate the areas of work, you’ll get a small performance benefit out of this.

Out of this list, this item is the least important. Try to consolidate as much as you can the first time through your code, but caching will far more than make up the difference. These are the slight improvements to save for future revisions–but if you see an obvious opportunity to combine and simplify code, take it.

Use Lazy-Loading wherever it makes sense
If your application needs to hit the database on every single request, go ahead and open a database connection early. If on some requests your application just returns static data, save a tiny bit of processing and skip the database connection. On a few projects, I’ve written code that connects to multiple databases, so I’ve written a simple stub class that maintains a singleton database connection object. In every method that connects to a database, it calls the static method that returns the database connection object, creating and establishing it if it doesn’t already exist.

We program extensively with Smarty, and in some projects use Smarty’s caching system. When used with a lazy-loading design, it’s extremely effective at speeding up page views. In our “standard” architecture, we have a controller stub that the browser requests. This stub examines the request, identifies the view and the data objects to load, and sometimes creates controller objects to handle specific requests. However, if you’re using a caching system, you need to check for a cached version before doing any of this processing. Either check the cache at the top of your controller, or move your controller itself into a file that’s loaded by a Smarty template. By having the template load the controller and decide what to do next, that processing never happens if Smarty retrieves the cached template instead.

Now that we program a lot with Ajax, we no longer automatically create a Smarty object for every request–first we check whether we’re returning HTML, XML, JSON, or something else, and only create the Smarty object for particular types of views.

These are examples of how we use lazy loading to avoid loading large chunks of code or establish database connections we never use.

Plan early on for caching
When you first launch your application, you probably don’t need caching because you’re not getting that much traffic. Some applications only run in private networks and never need to do any caching. But if you’re building a Facebook application or expecting huge amounts of traffic someday, create strategies for caching early on.

As I mentioned earlier, Smarty does this extremely well. You need to provide a way to uniquely identify an item in the cache, and Smarty will do it for you. Just make sure you check for the cached version before doing a lot of extra processing.

Without Smarty, it’s relatively easy to use output buffering to capture the output of your code and store it somewhere for later retrieval.

Many projects designed for traffic have simple switches you can just turn on to take advantage of caching, including Drupal and Joomla. After caching as much HTML as possible, the problem turns into more of a system administration project–installing an opcode cache like eAccellerator can help your server handle 30-40% more traffic, in our experience. These systems essentially compile your PHP to get more speed, and cache the result.

The next level of caching, for truly large sites, is using a system like memcached. Memcached provides a system for distributing a cache across multiple data servers, so for the truly large sites, the problem starts involving developers again. PHP provides a memcache module you use to store and retrieve your pages in memcached. When your site outgrows what can be run on two servers, it’s time to have your system administrators set up a memcached cluster and rewrite your application to use it.

Avoid over-engineering your application
I inherited another project gone awry that had started with some really huge, complicated framework that seemed half-done. Most projects we’re called in to complete involve spaghetti code, mixed logic and presentation, and no clear architecture. This one, in contrast, was over-engineered for the problem. To figure out how the code worked, I ran it through a debugger. To get to my main class for a particular object, it ran through a series of no less than 8 inherited classes. And worse, some utility methods were copied between child classes, instead of being put once higher in the class hierarchy. I saw clear reasons for having 3 layers of inheritance in this application. Not 8.

Since then I’ve seen a few times where developers seem to create more inherited classes just because it seems like they should to be correct, not because there was any practical value in it. I rarely see the need for more than 3 levels of object inheritance, and never more than 4 (at least in a web application). When your application needs to open 20 files just to respond to a simple AJAX data request, that’s over-engineered. When you create an elaborate class structure just to avoid a simple function, that’s over-engineered.

There’s a scale here, from non-engineered spaghetti code to rigid, sophisticated frameworks. I suspect that most people without formal training start with spaghetti code and gradually learn how to create more structured code–while computer science majors start out with over-engineered structures and eventually loosen up in the real world after running their code through some profilers and realizing they don’t need all that complexity for a simple problem. Everyone over time, at least anyone with a knack for this stuff, ends up somewhere in the middle, with enough architecture to do the job–and little more. There’s definitely some variation here as a matter of taste, but there are measurable problems with either extreme.

I further suspect that Rails might be so popular now because a lot of web developers out there with no formal training are suddenly seeing the benefits of structured code and smart frameworks.

Keep in mind how expensive each operation is
Some actions take a while to complete. In our experience, the most expensive actions involve connecting to another server, especially ones not in the same data center. Keep these in mind when coding, and don’t do them if it isn’t necessary. For very expensive operations, especially when you need to do a bunch at once, consider forking a process using a call to the shell, or move to a maintenance routine called from a cron job.

Expensive:

  • curl to connect to another server
  • Other functions used to connect to remote servers: fopen, file, etc
  • domxml, SimpleXml on very large XML documents
  • Sending mail to multiple recipients

Moderately expensive:

  • Sorting on large arrays
  • Database connection to remote server
  • domxml, SimpleXML on medium-sized documents
  • Recursive functions

Somewhat expensive:

  • Individual database queries
  • domxml, SimpleXML
  • Creating complex objects
  • Loading large files

Inexpensive:

  • XML event-based parsers
  • Retrieving cached files
  • Loops on small arrays
  • Lookups in hashes stored in memory, retrieving constants

Do you have any other tips for writing fast PHP code? Please add a comment below…

Secure code: Understanding PHP vulnerabilities

Saturday, January 12th, 2008

There are many articles that cover PHP vulnerabilities, but I’ve run across a lot of programmers and code that seems oblivious to them. When interviewing programmers, I look for an understanding of these types of vulnerabilities, and how to prevent their programs from being vulnerable to them.

Aside from register globals issues, most of these attacks are not specific to PHP.

Register Globals issues
From early on, the developers of PHP had this great idea: accept any parameters passed from the browser, and automatically turn them into variables available in the code. Well, it turned out to not be such a great idea–it meant that improperly initialized variables could be seeded by attackers to potentially do all sorts of damage. Worse, sometime after PHP 5 came out, someone figured out that you could pass a particular variable that would load and execute any PHP file before running the actual code–and this file could be on a completely different server, in a regular PHP installation.

Most other web languages never offered this convenience–you have to retrieve parameters from a browser through a specific module or array. PHP now provides arrays like $_GET, $_POST, $_REQUEST that are simple to use, but make it so you need to specifically request the variable you want from your code.

Any code that depends on register_globals being set is completely broken, as far as I’m concerned. If it’s on a server with an older version of PHP, it’s just waiting to get cracked. Any developer that relies on registered globals is programming for 10 years ago, and needs some serious education.

The main point here is that software should never trust data coming from the browser. I don’t care how much validation you do with Javascript; you’d better double-check the request on the server, and make sure either you set variables before you use them, or work in functions/classes that are not in the global scope.

SQL Injection vulnerabilities
This is the next most serious issue, and it affects pretty much all web languages, not just PHP. The most common way to interact with a database is to use a language called “structured query language” (SQL) to select rows of data from the database, update data, insert new data, or delete things. Once you learn the basic syntax and structure, it’s very easy to use. The problem is, you nearly always depend upon the user to identify what data to retrieve, or to provide the data to add or change.

Once again, we can’t ever trust data from the user. Most databases accept more than one query at a time, and most information used to select rows in a database is wrapped in single quotes:

SELECT first_name, last_name, salary FROM employees WHERE first_name LIKE 'John';

Beginner programmers drop the variable containing the search from the browser into the query, wrapping it in single quotes: LIKE ‘$firstname’;

Attackers simply put a single quote in the field, and then add another SQL command to do something malicious. Like delete the entire database.

Now, when you know there might be a quote in the variable, you can escape it by adding a backslash in front of it. PHP actually does this for you automatically if you have an evil setting called magic_quotes_gpc turned on. That’s why you often see a lot of backslashes in forums, blog comments, etc by the way. But there are ways of getting around that, as well.

At a minimum, all variables used in a query should be escaped using a function known to handle all possibilities, usually those provided for the specific database engine. What I look for in code is someone using a database abstraction layer or interface that allows for parameterized queries: instead of putting the variables directly in the query, you create a query with placeholders (usually a question mark, ?) where variables are to be substituted, and then pass an array of variables. The abstraction layer handles all of the escaping for you, and you end up with much cleaner code.

We use PEAR::DB as a database abstraction layer in most of our projects. Others include ADODB, or PEAR::MDB. PHP5 provides a mysqli interface capable of this, as well. If I see a mysql_query command in general application code, it gets marked way down in my book.

Mail Header Injection
Many programmers don’t realize it’s not safe to use the PHP mail() function without special protection. I didn’t believe this was a vulnerable function until one of our clients got attacked with it. Basically, the mail() function on a Linux system is a wrapper to the system sendmail command. Sendmail takes a plain text email, looks for a To, CC, and BCC addresses, and sends the message on its way. The problem is, attackers can inject fake headers into the message that basically hijacks your server to send spam. Any field that ends up in the header of a message–to, from, subject, or any other arbitrary header you collect could be used for this purpose.

I haven’t tested to or subject recently–there may be some built-in protection for these fields now. But to set the from address of a message, you pass it in an array or a string to the “header” parameter of mail(). This is ripe for exploit. All the attacker has to do is insert a newline, and then they can supply their own bcc field with hundreds of email addresses to spam. PHP and the sendmail binary will happily spew your attacker’s message to hundreds of users at a time. The next thing you know, your server will get on a blacklist for spamming, and nobody on that server will be able to send mail to domains like AOL or Comcast and other places that actively reject mail from known spammers.

Some kind soul posted a function to filter headers and ignore anything after a newline character to the comments section of the PHP documentation for the mail() function (the PHP documentation, and the comments, are a fantastic resource, and one of our favorite features of PHP). We have a simple safe_mail function that runs all the headers through this function, which also makes for a convenient way to intercept mail on a test environment.

This one isn’t talked about that much, but a programmer that protects a mail function properly is an indication of an experienced PHP developer.

Cross-site scripting (XSS)
Cross-site scripting is the current favorite exploit of attackers. Unlike the other attacks, they’re not attacking your site directly but exploiting it to attack your visitors. Of course, if your visitors have access to an administrative interface on the site, they could then use this to attack your site.

The real problem is that cross-site scripting is a great way to spread spyware, and so many sites are vulnerable to it. MySpace was long a victim of XSS. Ebay, too. Basically any site that allows users to add content that is shown to other users is vulnerable to XSS, unless the application developer has taken specific measures to prevent this. In this age of social networking, that is a huge number of sites.

If an attacker can find a way to get a script into a page shown to others, there’s lots of things he can do. Sometimes it’s as simple as adding <script> and a chunk of javascript or a location to load a javascript from. Other times they will attach a mouseover event or some other devious place. Sometimes they insert an object or iframe containing their malicious content.

If they can load an arbitrary script of their choosing, they can view anything on that page and watch anything the visitor types into that window. That’s expected, defined behavior, and that’s not going to change. So at a minimum, they can get passwords to your site and from there, they can do anything on your site that an attacked user can do.

But they don’t start there. Both Internet Explorer and Firefox have contained vulnerabilities that allow an attacker to escape the sandbox of that browser window to be able to monitor other windows, or even at worst install malicious software on the user’s computer. That is how spyware is spread. And once they have their own malicious software installed on your computer, they own it–they can monitor every mouse movement and keystroke, they can use it to send spam or attack other computers or do whatever they want.

Cross-site scripting is diabolical. It doesn’t usually harm your site, because attackers don’t want you to know you’re carrying their malware. Application developers ignore these issues to the peril of the entire Internet…

Session Hijacking
Web applications differ from most other applications in that they are considered “stateless”. That is, the server does not know the state of anything the user is doing, and starts in exactly the same condition for every request. In most applications, however, you are working through some sort of process and what you do next depends on the action you take when you’re in a particular state. What actions you have available to you depends upon the state of the object you’re working with.

For example, if you’re working with a user object, it might have several states: “unconfirmed”, “logged in”, “not logged in”, “suspended”. For users that are suspended, the application would prevent access to private data. For users who are unconfirmed, the application might offer to resend a confirmation link. For users who are logged in, the application would provide access to appropriate parts.

In a web application, it’s up to the programmer to define these states and handle them appropriately–PHP has no internal concept of state at all. Every request coming into your application must do all the work of loading the appropriate objects, defining what state they’re in, and doing whatever action is necessary.

PHP and other languages do however provide a mechanism for keeping track of users, with something called a session. PHP basically provides an automatic mechanism for storing variables associated with a user session on the server, instead of the browser. Since as we know well by now, you can’t trust anything coming from a browser, a session is a much safer place to store critical data to help you determine the state of your application and not have to reconstruct it completely on every page. It’s especially used for logins.

The problem is, sessions can be hijacked. PHP and other languages use a cookie to store a simple unique identifier for the session in the browser, which the browser helpfully returns on every request. If the browser has been compromised (by a cross-site scripting attack, or spyware, etc) an attacker can read these cookies and pass somebody else’s session identifier into your application, and if you don’t protect against this, hijack the original user’s session.

That takes some effort, however. Much more of the problem is when a user turns cookies off. Back in the late 1990s/early 2000s, many users got completely paranoid that cookies identified them wherever they went on the Internet, and many applications help users manage their cookies. So this general paranoia about cookies actually makes the situation worse, because if the user turns off cookies, your application either needs to force them to reauthenticate, or allow the browser to pass their session identifier through another means.

PHP has yet another configuration parameter to automatically allow session ids to be passed via a GET request instead of a cookie. The problem is, when this is done, the session identifier becomes part of the URL in the browser address bar. Users then bookmark their session id, post it to their blog or a forum, do whatever with it they want. And if your application is not written to handle this, other completely innocent users may find themselves logged into your application under a hijacked session id!

Applications using sessions must use some other source to verify that the session corresponds to the right user. In some cases, it may be enough to just require cookies and not allow session identifiers to come through any other vector. In others, programmers may need to consider using http authentication or other methods to verify that they have the right user.

Session hijacking is one of the toughest vulnerabilities to manage, if you need to protect any sensitive data. Even if you don’t, the application should deal appropriately with accidental session hijacking, because it’s very common and easy for users to do.

Other vulnerabilities
The list doesn’t stop there, but those are the serious mistakes I see, sometimes on a weekly basis. It’s hard to write secure code, but starting with security as a mindset goes a long way towards preventing problems down the road.

To summarize, here are some general tips to keeping applications safe from these types of attacks. If I’m interviewing you for a programmer position, I will be asking you about these:

  1. Never trust input from the browser.
  2. Turn off register_globals, but always assume it’s on and protect your variables anyway.
  3. Use a database abstraction layer, and parameterized queries.
  4. Be extra careful with database statements that cannot be parameterized.
  5. Strip all script, object, and iframe tags out of user inputs. Strip all Javascript and event attributes from any HTML you do allow.
  6. Never trust input from the browser.
  7. Use wrapper functions to add extra protection to common functions like mail().
  8. Be extremely careful with sessions that are used to authenticate users.
  9. Provide an appropriate level of protection for private data.

Any other vulnerability types you care about, when writing or reviewing web application code?

Quality Code: How do you judge?

Friday, January 11th, 2008

We’re hiring programmers, over at Freelock. I’ve been going through lots code samples to try to identify how experienced and competent a particular developer is. I also do this on a regular basis to evaluate how solid a particular open source project is.

I’ve seen a lot of code in various languages. As a technical writer, I used to write documentation for programmers teaching them how to use a particular interface or system. I’ve been involved with traditional software development projects at large software companies and startups. And I’ve done my share of actual programming of web applications.

I’m finding there are several indicators I look for when evaluating code, specifically for PHP, our language of choice. I’ll go in more depth on each of these qualities in future posts, but for now just thought I’d capture them while they’re fresh in my mind. So when I review code of a web application, here are some qualities I’m looking for:

  • Secure. Does the application trust users to provide good data? Does it protect its internals to prevent all the various types of exploits out there? Does it protect data from malicious users?
  • Fast. This could mean many things, but I’m looking for efficiency across layers. Is there a database call inside a loop that gets called a couple hundred times? That’s a huge speed killer. I look for code that has an appropriate level of abstraction to the size of the problem–and makes sensible choices about how much data to load for each request.
  • Powerful. This one is stolen from Paul Graham. Does the code use object-orientation and inheritance in a powerful way? I like seeing utility methods on base classes, which can then be leveraged to make very short, easy-to-understand final classes. Are the methods attached to the appropriate level of the class hierarchy? How short can you make the main logic of the application?
  • Clear. Going hand-in-hand with power, clarity is about making it apparent what each chunk of code is for, and how to go about changing it to make it work the way you want. Clear code is maintainable, well-documented, easy to customize.
  • Customizeable. Was the program designed in a way that’s easy to override, easy to customize, easy to run in other environments? Can it be managed effectively, and work broken up into different units?
  • Reliable. Does each function or method cover all possible scenarios? Is there proper error-handling in the code? When an end user hits upon some combination of things that the programmer never anticipated, does the program die ungracefully, or provide useful feedback?

Very few programmers hit all of these. My biggest weak area is the reliability one–after reviewing other people’s code, I find a lot less exception handling in my code. We’ve all got something to learn. But reviewing other people’s code can help you spot weaknesses in your own, and develop a much stronger sense of how to do it right.

[Edit: Adding links to more detailed posts as I publish them]