OpenAI recently shared insights into its database architecture, revealing how they manage a staggering 800 million users. Despite the immense scale, the backbone of ChatGPT remains a single PostgreSQL cluster, though it has undergone significant architectural shifts to sustain such high demand.
The "Single Primary" Strategy
OpenAI utilizes Azure Database for PostgreSQL (Flexible Server). Surprisingly, they rely on a single primary node for all write operations, supported by nearly 50 read replicas distributed globally. Rather than implementing complex database sharding, the team chose to optimize their current infrastructure through several strategic moves:
Offloading High-Write Traffic: PostgreSQL’s Multiversion Concurrency Control (MVCC) can lead to performance degradation during heavy concurrent writes. To mitigate this, OpenAI migrated high-write tables to Azure CosmosDB, which handles sharding natively. Currently, no new tables are added to PostgreSQL, and existing heavy-load tables are gradually being moved to CosmosDB.
Read-Write Separation: The primary node is reserved almost exclusively for writes. Only reads essential for write transactions are permitted on the primary. For high availability, a hot standby is maintained at all times to take over in case of failure.
Solving Query Complexity & Connection Bottlenecks
The team identified that complex SQL joins and inefficient queries were often caused by Object-Relational Mapping (ORM) frameworks. To combat this, they implemented:
Strict SQL Reviews: Every query generated is reviewed to ensure efficiency.
Isolated High-Load Queries: Extremely heavy queries are offloaded to dedicated instances.
PgBouncer Implementation: Since their read replicas support only 6,000 concurrent connections, PgBouncer is used as a connection pooler to manage thousands of application requests efficiently.
Advanced Performance Tactics
Cache Locking (Request Coalescing): If multiple application instances request the same cached data simultaneously, only one actual database query is executed.
Cascading Replication: To prevent the primary from being overwhelmed by replication traffic, read replicas sync data in a "daisy-chain" fashion rather than all pulling from the primary at once.
Strict Schema Management: Schema migrations are restricted to those that can be completed within 5 seconds to prevent long-duration locks.
Granular Rate Limiting: Every query type is controlled by a detailed rate-limiting policy.
The Result
Despite the complexity, the system delivers incredible performance:
99th Percentile Latency: Under 100ms.
Reliability: 99.999% uptime.
Incidents: Only one SEV-0 (Critical Severity) outage to date.
What OpenAI does is apply the principles of "Polyglot Persistence," which involves using multiple database types appropriately for the task. PostgreSQL (Relational) is used to store metadata and complex structured data, while CosmosDB (NoSQL/Distributed) is used to store massive amounts of data with very frequent writes (high-velocity data).
Database-level sharding often leads to management complexity and difficulty with cross-shard joins. OpenAI therefore chooses to use "Application-level sharding," separating the data and sending it to CosmosDB instead, which is easier to manage in the long run.
Achieving 99.999% success doesn't come from perfect code, but from highly sophisticated monitoring. The OpenAI team has a system that can monitor "Lock Contention" and "Wait Events" in PostgreSQL in real-time, allowing them to prevent SEV-0 failures before they occur.
Using PgBouncer on Azure Flexible Server significantly reduces the overhead of creating new processes in PostgreSQL, a major bottleneck when supporting millions of concurrent users.
Skip Goes Open Source: Swift-to-Android Toolset Scraps Subscription Model

No comments:
Post a Comment