Azure Sql Database replication - 9 1/2 hour delay + cant remove the replica -
note
this getting quite long try , re-edit parts through day.
- these databases no long active, means can play them work out going wrong.
the thing left answer: given 2 databases running on azure databases @ s3 (100 dtu). should secondary ever behind primary database? while dtu hammered 100% on half day. reason dtu being hammered being io writes mostly.
the start: few problems.
dtu limits hit on monday, tuesday , extent on wednesday significant amount of time. 3pm utc - 6am utc.
problem 1 (lag in data on secondary): had appeared have caused lag of data in secondary of 9 1/2 hours. servers being spammed updates causing lot of io updates. 6-8 million records on 1 table 24 hour period example. problem drove reason post:
shouldn't these more in sync?
the data became out of sync on monday morning , continued out of sync until friday. on thursday new databases started replace these standard sql databases , left rot. well, me experiment @ least.
the application causing redundant queries couldn't fixed few days (i'm doubting ever fix now) leaves changing instance type. action attempted on current instance but, instance must disconnect standard replicas increase performance tier. led second problem (see below). replica taking time removed. began on wednesday morning , did not complete until friday.
problem 2 (can't remove replica):
(solved after 2 days)
disconnecting secondary database process began ~ wed 8utc (when primary @ 80gbs). secondary being 11gb behind in size @ point.
setup
the databases (primary , secondary) s3 geo-replicated (north + west europe).
it has audit log table(which read secondary - sql query), 9 1/2 hours behind last entry primary database. running query again on secondary few seconds later catching up, appears relative refresh rather playing catch-up.
both primary , secondary (read-only) databases s3 (about bumped p2).
the azure documentation states:
active geo-replication (readable secondaries) available databases in service tiers. in april 2017, non-readable secondary type retired , existing non-readable databases automatically upgraded readable secondaries.
how has secondary has got far behind? seconds minutes acceptable. hours not much. link above describes slightly:
while @ given point, secondary database might behind primary database, secondary data guaranteed transactionally consistent changes committed primary database.
given secondary destroyed , replaced higher level (need remove replicas when upgrading standard -> premium). i'm curious know if happen again definition of slight might in instance?
notes: primary did reach maximum dtu few hours didn't harm performance significantly, 9-hour difference noticed.
stats
update thegameiswar: can't query right started removing replica (to able move primary p2 level, began hours ago @ ~8.30utc , 5 hours later still going). think it's quite broken now.
query - nothing special:
select top 1000 [id] ,[datecreated] ,[srcid] ,[hardwareid] ,[ip_port] ,[action] ,[responsetime] ,[errorcode] ,[extrainfo] [dbo].[audit] order datecreated desc
i can't compare tables anymore it's quite stuck , refusing connections.
the 586 hours (10-14gb) inserts primary database audit table. similar yesterday when noticing 9 1/2 hour difference in data.
when attempt remove replica (another person stated process) had 10gb difference in size.
cant compare data can show db-size @ equivalent time
primary db size (last 24 hours):
secondary db size (last 24 hours):
primary database size - week view
secondary database size - week view
as mentioned ... being removed replica... still playing catch db size if observe charts above.
stop replication errored servername: ---------, databasename: cloud4 errorcode: 400 errormessage: terminate operation in progress database 'cloud4'.
update 2 - replication - dm_continuous_copy_status
replication still removing ... moving on...
select * sys.dm_continuous_copy_status
sys.dm_exec_requests
querying thursday appears quite empty. record being
replica removed @ last.
the replica has removed after 2 days. @ 80gb mark predicted. waited replay data in transactions (till point removed replica) before remove replica.
a week after on p2 databases
dtu holding between 20-40% @ busy periods , performing ~12 million data entries every 24 hours (a similar amount reads, writing worse on indexes , table). 70-100% inserts in week. time, replica not struggling keep up, due not reaching 100% dtu.
conclusion
the replicas useful not in case. 1 caused degraded performance several days have been averted. simple increase performance tier until cause of problem fixed. if replica looks dragging behind , on border of basic -> standard or standard -> performance safe remove replica possible , increase different tier.
now on p2. database increasing @ 20gb day... , have fixed problem sends 15 thousand redundant updates per minute. query performance insight highlight querying table extremely painful on dtu (even querying last minute of data in table bad on dtu. ~15 thousand new records every minute).
62617: insert ... 62618: select ... 62619: select ...
a positive above it's moved 586 hours combined time insert statements (7.5 million entry rows per day) on s3 3 hours on p2 (12.4 million row rows per day). extremely significant decrease in processing time. did start empty table on thursday has surpassed previous size in week whereas previous 1 took few months there.
it's doing on new tier. should ~5% if applications using database responsibly , secondary date.
spoke soon. on p2 thought idea run sql query repeats deletes thousand rows @ time. 12 million new rows day. 10am - 12am it's managed remove 5.2 million rows. database showing signs of being in same state last week. im curious if happened now.
Comments
Post a Comment