Web Development Software
10/7/05
<< 72 Squared Weblog

Database connection pooling in PHP

It's all about the numbers. Take 150 webservers. Each of those webservers are processing about 60 page requests per second. On average, each page load requires anywhere from 3 to 30 queries, depending on the complexity of the page.

All of this starts to add up. Now, distribute your data out across a farm of database services --- say around 20 different databases. In PHP's share-nothing architecture, persistent connections to your databases simply aren't efficient. I have crunched the numbers a hundred different ways. If all of my webservers used persistent connections, that would mean as many as 40 connections to each database from each webserver. multiply by hundreds, and you see what i mean.
So as an alternative, we carefully open and close connections to the databases on servicing each page request. This has worked well for us so far, believe it or not. Our database servers can actually keep up, and at any one time we may only have couple hundred processes open on our databases at any given time. The whole trick was to release the connection as soon as possible. It may sound counter-intuitive, but we bench-marked it both ways, and non-persistent connections were way faster than persistent connections.
But this kind of approach has it's trade-offs. We found that we could get greater performance out of our databases by dividing them up into discreet sets of services. Unfortunately, this means that the average page load may need to open more and more connections to databases to do its processing. I found myself trying to find a way to get true database connection pooling while still maintaining the integrity of PHP's share nothing architecture.
The answer? A connection pooling server!
Ok, call me crazy. But it works. What i did was built a farm of database proxy servers that maintain persistent connections to a selected pool of databases. Then instead of connecting directly to the database and executing my query, i pass the query through the socket server that accepts my query and passes it off to the correct database. The database processes my request and returns its response. In turn, the proxy server passes that result back to the webserver and moves on to handle the next incoming request. Now, my webservers never need to open a connection with a multitude of databases ... instead they connect to a random database proxy server that will relay all incoming data requests to the database and return the response.
Taking this approach requires a well developed database abstraction layer in place. Fortunately, such an infrastructure was already in place. I was able to place my new theory into production in only a day or so of work. The result? Our databases only had to handle a few connections and passed all of the information through these processes. The overall speed of the queries increased without the overhead of establishing multiple connections to datasources, and our total connections to the database dropped to a mere fraction of what it was before while still maintaining the same volume of data.
The results astounded me too ... I thought, hey, haven't I just shifted my bottle-neck from the database to a connection pool server? However, the trick is that now my database can answer requests faster since it does not have to deal with the problem of managing a huge range of clients ... it only has to deal with the actual data. The socket server can make full use out of it's database connections as well, since it is not doing any real processing either besides channeling requests to the appropriate targets and deliviering the results back to the webserver. In addition, the webserver does not have to wait around to connect to a bunch of different databases.
Think about it this way ...  if i open a connection to the database on a page load, i may execute one or two queries against it, and then i am done with it. However, i am not always free to destroy my connection to the database immediately since i may need to run one query against one database and then connect to a different database before I run another query against the first database. This means that the connection is not being used to its full potential. But if I put that database connection on my proxy server, the connection is free to immediately do another task as soon as it finishes the previous query, maximizing the use I get out of it.
I know this solution may sound a lot like very simplified version of Web Services, and my original idea may have been born out of reading some ideas for web services, but I think my connection pooling solution is really a different beast altogether. The connection pool server has no real logic in it. All it does is act as a manager of a bunch of pooled database connections.
Who knows if this will be an actual long term solution for anything, but it certainly looks promising right now. I will let you know well it performs once our site traffic doubles and triples again.