10/7/05
<< 72 Squared WeblogDatabase connection pooling in PHP
It's all about the numbers. Take 150 webservers. Each of those webservers are processing about 60 page requests per second. On average, each page load requires anywhere from 3 to 30 queries, depending on the complexity of the page.
All of this starts to add up. Now, distribute your data out across a
farm of database services --- say around 20 different databases. In
PHP's share-nothing architecture, persistent connections to your
databases simply aren't efficient. I have crunched the numbers a
hundred different ways. If all of my webservers used persistent
connections, that would mean as many as 40 connections to each database
from each webserver. multiply by hundreds, and you see what i mean.
So as an alternative, we carefully open and close connections to the
databases on servicing each page request. This has worked well for us
so far, believe it or not. Our database servers can actually keep up,
and at any one time we may only have couple hundred processes open on
our databases at any given time. The whole trick was to release the
connection as soon as possible. It may sound counter-intuitive, but we
bench-marked it both ways, and non-persistent connections were way
faster than persistent connections.
But this kind of approach has it's trade-offs. We found that we could
get greater performance out of our databases by dividing them up into
discreet sets of services. Unfortunately, this means that the average
page load may need to open more and more connections to databases to do
its processing. I found myself trying to find a way to get true
database connection pooling while still maintaining the integrity of
PHP's share nothing architecture.
The answer? A connection pooling server!
Ok, call me crazy. But it works. What i did was built a farm of
database proxy servers that maintain persistent connections to a
selected pool of databases. Then instead of connecting directly to the
database and executing my query, i pass the query through the socket
server that accepts my query and passes it off to the correct database.
The database processes my request and returns its response. In turn,
the proxy server passes that result back to the webserver and moves on
to handle the next incoming request. Now, my webservers never need to
open a connection with a multitude of databases ... instead they
connect to a random database proxy server that will relay all incoming
data requests to the database and return the response.
Taking this approach requires a well developed database abstraction
layer in place. Fortunately, such an infrastructure was already in
place. I was able to place my new theory into production in only a day
or so of work. The result? Our databases only had to handle a few
connections and passed all of the information through these processes.
The overall speed of the queries increased without the overhead of
establishing multiple connections to datasources, and our total
connections to the database dropped to a mere fraction of what it was
before while still maintaining the same volume of data.
The results astounded me too ... I thought, hey, haven't I just shifted
my bottle-neck from the database to a connection pool server? However,
the trick is that now my database can answer requests faster since it
does not have to deal with the problem of managing a huge range of
clients ... it only has to deal with the actual data. The socket server
can make full use out of it's database connections as well, since it is
not doing any real processing either besides channeling requests to the
appropriate targets and deliviering the results back to the webserver.
In addition, the webserver does not have to wait around to connect to a
bunch of different databases.
Think about it this way ... if i open a connection to the
database on a page load, i may execute one or two queries against it,
and then i am done with it. However, i am not always free to destroy my
connection to the database immediately since i may need to run one
query against one database and then connect to a different database
before I run another query against the first database. This means that
the connection is not being used to its full potential. But if I put
that database connection on my proxy server, the connection is free to
immediately do another task as soon as it finishes the previous query,
maximizing the use I get out of it.
I know this solution may sound a lot like very simplified version of
Web Services, and my original idea may have been born out of reading
some ideas for web services, but I think my connection pooling solution
is really a different beast altogether. The connection pool server has
no real logic in it. All it does is act as a manager of a bunch of
pooled database connections.
Who knows if this will be an actual long term solution for anything,
but it certainly looks promising right now. I will let you know well it
performs once our site traffic doubles and triples again.