Mysterious Duplicate Key Error
One thing I like about my role at Clustrix is that it allows me to work with real customer use cases and offers a unique opportunity to understand respective deployment architectures and learn something new and interesting from each and every engagement.
I came across an interesting use case recently where the problem was reported as a replication error in a master-master deployment. This particular deployment employs two clusters as masters in RBR mode of replication where both are actively accepting writes from the application hosted in two different geographically dispersed data centers. Based on IP routing logic, application is set to send requests to respective clusters. To accommodate writes on both sides, application is designed to generate unique key values to avoid any sort of id collision for the primary keys. Things were working great for many months and all of a sudden replication started causing a duplicate key error.
On initial investigation I found that quite a few large delete and update transactions were executed on one of the masters that involved millions of rows. These multi-row DML statements were executed as part of an admin application to reflect business data properly. Since this customer uses RBR as default replication mode, it generated a huge amount of binlog events for those multi-row DELETE and UPDATE SQL statements. This inflation of binlog coupled with online write transactions on the slave side introduced slowness in the replication stream resulting in a slave lag situation.
RBR Inefficiencies, but the Duplicate Key Error Mystery is Unsolved…
Too many bin log events substantiated for the slave delay, but these were not directly related to the duplicate key error that the slave was complaining about and stopping. So, the riddle was still unsolved!
I examined all relevant binlogs to understand what could be going wrong and found that the application had been inserting data on both masters into the user domain table by generating unique primary keys for each user and the user column had a unique key constraint defined on the base table. I looked up the DDL for the table in question and it looked perfect just like any other usual domain entity where primary key is simply an auto-incremented synthetic value and domain field is unique key along with a few other indices on other important columns.
It may seem logical to assume that both masters should not be writing for the same user, but there are use cases where this is possible e.g. user accessing his information using mobile instead of a desktop. Depending on network service provider and location, there is a chance of the same user getting redirected to both masters particularly when the redirection is based on GTM dns resolver. In other words, application does not control which cluster to write to and in this particular use case, application managed to insert data on both clusters for the same user causing replication to break.
But now the bigger question was why would the application insert on both sides for the same user? And even if that is the case why would replication stop now and not previously for other users? Could this be because other users never used the application from two different devices? Too many questions started to tickle in. So, I started parsing through earlier binlogs for the table(s) in question and traced records for one user from previous day for which there were no replication errors.
For previous entries I noted that there never was a case where both masters were executing INSERT statements. Mostly it would be UPDATE statements for a user data that already exists. I hypothesized that the application had a built in upsert logic in place for existing user data. In other words, application checks to see if the user data already exists and prepares UPDATE or INSERT statement based on this checking. Subsequently I obtained confirmation about my hypothesis from the customer’s engineering team.
Finally, the Tricky Issue is Found
This clarified why replication was stopping with duplicate key errors now. What happened here was that the user data got inserted on one master and due to the delay in replication stream it was not yet applied on the other master. As a result of this, application check on the other master reported this user to be non-existing and resulted in issuing one INSERT statement using all relevant details instead of an UPDATE statement. When the slave was trying to apply the INSERT statements executed on both sides, it was hitting duplicate key error on the unique key defined on user column on both sides.
What we saw here is that the application expects all data to be available on both masters at all time for it to function properly and does not account for replication lag. Slave lagging behind is not that much of an issue for a master-slave deployment where slave side applications (typically read queries) are tolerant of slave delays. However, for a master-master deployment if application is not made tolerant of slave delays, there is a very high chance of introducing more errors like these whenever there is a delay in replication. More often than not these errors are skipped without investigating the root cause to keep the slaves caught up for production systems that in turn introduces data mismatch between master and slave since the transaction on slave is skipped to keep the replication going.
For RBR mode of replication and especially if the replication topology is master-master, it is extremely important to review all functionalities that entail bulk data processing. Best-case scenario would be to identify these types of functionality and utilize session level binlog_format variable to set the replication mode to SBR before issuing such bulk DML statements. Setting binlog_format alone may not be sufficient for cases where both sides are active as the statement will take the time to complete on both sides impacting OLTP performance for busy tables. It is usually a good practice to break these into smaller chunks and then executing them under SBR mode. Executing these type of statements during lean hours and outside of replication stream is another possibility that many enterprises implement.
Enterprises are adopting master-master topologies in order to maximize hardware capacity as well as to enable business continuity. However, if some of these design and operational considerations are not addressed during implementation it may result in the production systems suffering from data anomalies that can be extremely difficult to identify and resolve.