A story about slow DNS & (a)sync dns.lookup


It's always good to remember how the event loop works (good high level figure above)

We were investigating a performance issue which initially looked like a simple downstream service slowness (e.g. a service you call, simply becomes slower and slows you down)

When following the trail of request (we used New Relic and our logs) we saw that

  1. Service A calls Service B and the average latency of response grew up to steady 500ms per call
  2. Service B shows that average latency of all calls to him as average of 150ms per call
  3. Service B calls MySQL and shows that average query execution time is 145 ms
  4. On MySQL the average query execution time is estimated as 15ms

Few additional facts
  1. Calling Service is is done using DNS name
  2. Calling MySQL is done using IP
  3. One of the traces for the service calls shows that majority of time was taken by dns.lookup
  4. The entire system became slow at portion of times
When looking at above it was almost clear to me that the DNS was to blame since
  1. DNS lookup is not cached by NodeJs and we don't have system local cache
  2. I couldn't find any other delay between services 
  3. It's well known that if DNS is not behaving --> system slowly dies ...
  4. I looked at a network tool and saw plenty of DNS delays and timeout :)
I read few articles and one of them reminded me that DNS.Lookup is async but in essence it's a blocking call that gets trigged in the worker thread on NodeJS (Thread pool).
Additionally the pool size is actually defaulted as 4 ! which means that if we have 4 slow running DNS queries running we actually block the IO pool and our entire NodeJS service is in danger of severe latency.

<Party-Pooper> Than a friend came to me and reminded me that we don't use DNS when calling DB <Party-Pooper />

All above 'facts' lead me to believe that we are dealing with slow network problem (potentially some device in the middle is mis-behaving)

Next step is to work with network expert and see the source of culprit 

* The default 4 looks good as starting point, but I am sure that we can actually increase our system performance if we monitor and tweak

* Any one knows how to monitor the thread queue length ?

A story about slow DNS & (a)sync dns.lookup A story about slow DNS & (a)sync dns.lookup Reviewed by Ran Davidovitz on 4:24 PM Rating: 5

No comments:

Powered by Blogger.