How to serve 800 million users without database sharding?

Hello! If you are reading this blog, I probably don't need to explain what ChatGPT is. But the numbers behind it are impressive even to old hands in the industry. We are talking about over 800 million users per week. Now imagine the RPS (requests per second) that this generates, and add to that the requirement for ultra-low latency, because no one likes to stare at a blinking cursor.

How does OpenAI manage this? Do they have some kind of magical, secret technology? Well, no. It turns out that the secret to their success lies in the extreme optimisation of proven solutions and a few very bold architectural decisions that would give many architects a heart attack. Today, we'll take a look under the hood of this monster. They published an interesting article, and I want to share the most interesting insights.

Hybrid cloud and inference pipeline

At the very top of the infrastructure, we have an interesting hybrid approach. OpenAI does not put all its eggs in one basket. Their system runs on a distributed cloud infrastructure, using Microsoft Azure resources mainly for databases and AWS for heavy computing.

When a user enters a prompt, magic happens in what is called the 'Inference Pipeline'. The input is tokenised, packaged together with the context of the entire conversation, and sent to the servers running the model. What you see as a response is a stream of new tokens generated on the fly. The key here is the use of WebSockets, which allow the response to reach the client in real time, giving that characteristic 'live typing' feel.

A monolith that works: single primary Postgresql

This is where it gets really interesting. At this scale, the first thought of any engineer is 'database sharding'. Let's split it into pieces, otherwise it will crash. OpenAI went in a completely different direction. Instead of splitting the database into fragments, they focused on deeply optimising a single, powerful PostgreSQL instance.

Yes, you read that right. The main database supporting ChatGPT and the API is undivided (Single Primary) and handles all writes. How is it possible that it doesn't crash? A clever trick was used here. The most intensive write operations that are not critical to the main flow are transferred to specialised database systems, such as Azure Cosmos DB. The main Postgres simply has a lighter load.

Of course, High Availability has been taken care of. The main server runs in Hot Standby mode with a replica that is virtually fully synchronised. If the main node fails, the replica immediately takes over its role. Importantly, a master failure only blocks writes, but the users' work so far is protected, and reads can continue on the other replicas.

Combating 'thundering herd' and caching

With a monolith, the database is sacred and must be protected at all costs. That is why most reads, such as checking call history or user settings, do not touch Postgres at all. They are served from cache, probably Redis or a similar solution.

But caching on this scale creates a problem known as 'Thundering Herd'. Imagine that a very popular key in the cache expires. In the same fraction of a second, thousands of queries realise that the data is not in Redis and all hit the database at once to retrieve it. This is a simple recipe for killing the database.

OpenAI's solution is a mechanism for blocking keys in the cache. When multiple queries encounter a 'cache miss', only one of them receives a 'lock' and the right to retrieve data from the database. The rest politely wait for the cache to update.

Read scaling and cascading replication

Since we have one master for writes, how do we handle those millions of reads per second? This is where good old replication comes in. OpenAI maintains nearly 50 geographically distributed read-only replicas of the PostgreSQL database. This allows queries to be directed to the nearest copy, reducing latency and ensuring global availability.

An interesting fact is how they deal with the load on the main server, which has to feed all these replicas with transaction logs. They are testing a solution called 'Cascading Replication'. In this model, some of the replicas serve as intermediaries that send data to subsequent layers of replicas. This means that the main server does not have to maintain a direct connection to each of the 50 copies, allowing it to scale to hundreds of instances without overloading the master.

Connection pooling is essential

Finally, an infrastructure tidbit that shows how important the fundamentals are. At this scale, opening a new connection to the database for each query is suicide. OpenAI uses extensive connection pools running on Kubernetes for each database replica (probably something like PgBouncer in a sidecar architecture).

The effects are dramatic. Thanks to pooling, the average time to establish a connection to the database has dropped from 50 ms to just 5 ms. It is these 'minor' optimisations, multiplied by millions of queries, that give the final effect of smooth operation.

The ChatGPT architecture is proof that you don't always need the latest, shiny NoSQL toys to achieve global scale. Sometimes it's enough to understand good old Postgres very well, surround it with a solid cache, and manage traffic wisely.

Happy postgres-ing!