Wednesday, November 1, 2017

Replication error...yes...transaction replication is alive and well!

Oh boy this was a fun one! So where I work now we have transactional replication set up between an OLTP and DataWarehouse database, in this case spread across two instances but in many cases on the same instance.

I get an urgent call in the middle of the night that client X can't run a report, it's returning 0's instead of information. Yes, this call came at 11:40 PM, yes it has to be fixed right now. *sigh*. OK, it IS what I signed up for. Anyways! Now I've seen this error before when replication is down, so easy fix...let's go look at the Replication Monitor. Hmm...it's up...all green. Let's reinitialize it from a fresh snapshot, shouldn't take too long.

Everything looks green in the Repl monitor.

Re-run the report.

Same thing. What?

Let's dig deeper. 

I didn't see anything in the monitor:




(dummy screenshot from http://www.sqlservercentral.com/articles/Stairway+Series/72451/ to protect the innocent)

I needed to dig in deeper to see the error; it wasn't until I got into the LogReader Agent Properties that I found the issue (open your server in Replication Monitor, go to your publication, right-click on LogReader Agent and then to Properties). Here is what I see on the error details:







What IS this?

So I dug around online to see what I could find to figure out exactly what the issue is. This is what I came up with:

USE distribution
GO

DECLARE @SeqNo varbinary(16),
  @SeqNoChar nchar(30);

SET @SeqNo = 0x000D12E200003C74000700000000;
SET @SeqNoChar = '0x000D12E200003C74000700000000';

SELECT *
FROM dbo.MSarticles
WHERE article_id IN (SELECT article_id
  FROM MSrepl_commands
  WHERE xact_seqno = @SeqNo)

EXEC sp_browsereplcmds
@xact_seqno_start = @SeqNoChar,
@xact_seqno_end = @SeqNoChar

What?!? Well, take the Transaction Sequence # from the error and plug it into the script, this will show you what wasn't able to be transferred over. At this point you can do the following (from https://support.microsoft.com/en-us/help/3066750/how-to-troubleshoot-error-20598-the-row-was-not-found-at-the-subscribe):


  • Manually insert the missing row at the subscriber. This may enable the Distribution Agent to retry the failed command and move forward with the replication. 

    Note There may be other rows that are missing and that have to be manually inserted at the subscriber if there are more failures.
  • Instruct the Distribution Agent to skip this error and continue to replicate the rest of the changes. The Distribution Agent accepts the skiperrors parameter. You can use this parameter to pass error code 20598. This may keep the replication setup intact while you wait for an opportunity to manually synchronize the missing rows. 
As it was very late (early?) at this point, the error had stopped replication for over 3 days (and there was no guarantee that inserting the record would allow replication to catch up), and they needed their reports run (*sigh*) replication was rebuilt. From scratch. Dropped all publishers and subscribers, dropped the Distribution DB, and set it up fresh. All is well now, although I missed out on some much-needed beauty sleep!

No comments:

Post a Comment