Restarting service takes too much time - nginx-dev-email-history

G

1 Dec '22

Hi, please could anyone tell me why the nginx service takes too much time
to restart? As we know, after some change with nginx the service must be
restarted, I do this with "nginx -s reload" or "systemctl restart nginx",
and this takes some three minutes or more. This process is happening in
servers with many websites (eg 200 sites). So, with a new nginx this
restart of nginx is immediate but with many sites this restart process of
nginx is very slow, I am using Debian with nginx and OWASP.

Thanks for your help.

-- 
*Gus Flowers*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20221201/21f7833a/attachment.htm>

share

B

blason

1 Dec '22

Hi,

Did you check error log or syslog? Is that spitting out any errors? Do you
have SSL_OCSP settings configured and it might not be able to reach to the
protocol? 

I mean I had 45 portals and was facing a same issue. Later when I done the
debug I found that ocsp.godaddy.com was not reachable and it verifies every
time we reload the service.

Just a heads up though.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,295945,295946#msg-295946

share

M

Maxim

1 Dec '22

Hello!

On Thu, Dec 01, 2022 at 12:40:08AM -0300, Gus Flowers Starkiller wrote:

> Hi, please could anyone tell me why the nginx service takes too much time
> to restart? As we know, after some change with nginx the service must be
> restarted, I do this with "nginx -s reload" or "systemctl restart nginx",
> and this takes some three minutes or more. This process is happening in
> servers with many websites (eg 200 sites). So, with a new nginx this
> restart of nginx is immediate but with many sites this restart process of
> nginx is very slow, I am using Debian with nginx and OWASP.

The most obvious thing I would recommend to check is if the system 
resolver is functioning properly.  If nginx configuration uses 
domain names rather than IP addresses, and there are issues with 
system resolver (for example, one of the configured DNS servers do 
not respond), loading the configuration might take a lot of time.

-- 
Maxim Dounin
http://mdounin.ru/

share

B

blason

4 Dec '22

Yes - He is right; everything is revolves around DNS and even my error is
with DNS resolving as it was not able to resolve the ocsp.godaddy.com hence
please troubelshoot from DNS perspetive.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,295945,295963#msg-295963

share

A

A.

4 Dec '22

Am 04.12.22 um 08:04 schrieb blason:
> Yes - He is right; everything is revolves around DNS and even my error is
> with DNS resolving as it was not able to resolve the ocsp.godaddy.com hence
> please troubelshoot from DNS perspetive.

Hello List,

To avoid this problems I prefer https://nginx.org/r/ssl_stapling_file

Some years ago I run a nginx instance handling thousand of vhosts.
The - in practice not notable - reload time was amazing!

attached a simplified 'update_ssl_stapling_file'

It should be run once a day.
The operator should monitor, every 'sll_stapling_file.der' isn't older then 3-4 days

Andreas
-------------- next part --------------
#!/bin/sh

set -u

# used files:
#
# cert.pem
# - contain only the server certificate itself
#
# intermediate.pem
# - contain one or more intermediate certificates excluding the root itself
# - may be empty
# - this script assume exactly one intermediate
#
# root.pem
# - the root, unused in this example
#
# cert+intermediate.pem
# - created by 'cat cert.pem intermediate.pem > ssl_certificate.pem'
# - used as https://nginx.org/r/ssl_certificate
#
# key.pem
# - used as https://nginx.org/r/ssl_certificate_key
#
# ssl_stapling_file.der
# - created by this script
# - used as https://nginx.org/r/ssl_stapling_file

_ocsp_uri="$( openssl x509 -in cert.pem -noout -ocsp_uri )"

failed() {
  echo >&2 "$0 failed: $1"
  rm -f ssl_stapling_file.tmp
  exit 1
}

if ! _r="$( openssl ocsp                     \
              -no_nonce                      \
              -respout ssl_stapling_file.tmp \
              -CAfile  intermediate.pem      \
              -issuer  intermediate.pem      \
              -cert    cert.pem              \
              -url     "${_ocsp_uri}"        \
            2>&1 )"; then
  failed "${_r}"
fi

if ! echo "${_r}" | grep --text --silent -e 'Response verify OK' \
                                         -e 'cert.pem: good2' >/dev/null; then
  failed "${_r}"
fi

mv ssl_stapling_file.tmp ssl_stapling_file.der
echo 'ssl_stapling_file.der updated, "nginx -s reload" is recommended'

share

B

blason

4 Dec '22

Yes - He is right; everything is revolves around DNS and even my error is
with DNS resolving as it was not able to resolve the ocsp.godaddy.com hence
please troubelshoot from DNS perspetive.

Posted at Nginx Forum: https://forum.nginx.org/read.php?2,295945,295964#msg-295964

share

C

Charlie

5 Dec '22

I know the problem also from an environment with many sites and thousands
of ips to bind to. for us the problem  is that nginx binds every worker to
every ip sequentially - leading to a restart time of 10-15 minutes. the
problem can easily be observed using strace on the master process during
startup.. we couldn't find an easy solution so far.

Gus Flowers Starkiller <relectgustfs at gmail.com> schrieb am Do., 1. Dez.
2022, 04:42:

> Hi, please could anyone tell me why the nginx service takes too much time
> to restart? As we know, after some change with nginx the service must be
> restarted, I do this with "nginx -s reload" or "systemctl restart nginx",
> and this takes some three minutes or more. This process is happening in
> servers with many websites (eg 200 sites). So, with a new nginx this
> restart of nginx is immediate but with many sites this restart process of
> nginx is very slow, I am using Debian with nginx and OWASP.
>
> Thanks for your help.
>
> --
> *Gus Flowers*
>
>
> _______________________________________________
> nginx mailing list -- nginx at nginx.org
> To unsubscribe send an email to nginx-leave at nginx.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20221205/947528f2/attachment.htm>

share

M

Maxim

6 Dec '22

Hello!

On Mon, Dec 05, 2022 at 09:43:18PM +0100, Charlie Kilo wrote:

> I know the problem also from an environment with many sites and thousands
> of ips to bind to. for us the problem  is that nginx binds every worker to
> every ip sequentially - leading to a restart time of 10-15 minutes. the
> problem can easily be observed using strace on the master process during
> startup.. we couldn't find an easy solution so far.

Could you please share some numbers and details of the 
configuration?  Some strace output with timestamps might be also 
helpful (something like "strace -ttT" would be great).

While binding listening sockets indeed happens sequentially, it is 
expected to take at most seconds even with thousands of listening 
sockets, and even under load, not minutes.  It would be 
interesting to dig into what causes 10-15 minutes restart time.

In particular, in ticket #2188 
(https://trac.nginx.org/nginx/ticket/2188), which was about 
speeding up "nginx -t" with lots of listening sockets under load, 
opening 20k listening sockets (expanded from about 1k sockets in 
the configuration with "listen ... reuseport" and multiple worker 
processes) was observed to take about 1 second without load (and 
up to 15 seconds under load, though this shouldn't affect restart).

Also note that nginx provides a lot of ways to actually do not 
open that many sockets (including using a single socket on a 
wildcard address for a given port instead of a socket for each IP 
address, and not using reuseport, which is really needed only if 
you are balancing UDP).  If the issue you are observing is indeed 
due to slow bind() calls, one of the possible solutions might be 
to reduce the number of listening sockets being used.

-- 
Maxim Dounin
http://mdounin.ru/

share

C

Charlie

10 Dec '22

Hi Maxim,

we have roundabout 7k ips in use, 3k ipv6, 4k ipv4 and 52 workers.
that results in ~364000 ips which need to be bound - twice that in sockets
if i count port 80 and 443.

we have indeed reuseport active - we already thought about using a
wildcard-address on a socket, but didnt have time to investigate and test
thoroughly..
if its really only useful for balancing udp we might be able to get rid of
it.

we are aware of the need to reduce the number of listening sockets and
config-size per server, however this will be challenging and involve
changes on a lot of levels..
i'll have to look into that again..

thank you for your suggestions in any case!

On Tue, Dec 6, 2022 at 1:34 AM Maxim Dounin <mdounin at mdounin.ru> wrote:

> Hello!
>
> On Mon, Dec 05, 2022 at 09:43:18PM +0100, Charlie Kilo wrote:
>
> > I know the problem also from an environment with many sites and thousands
> > of ips to bind to. for us the problem  is that nginx binds every worker
> to
> > every ip sequentially - leading to a restart time of 10-15 minutes. the
> > problem can easily be observed using strace on the master process during
> > startup.. we couldn't find an easy solution so far.
>
> Could you please share some numbers and details of the
> configuration?  Some strace output with timestamps might be also
> helpful (something like "strace -ttT" would be great).
>
> While binding listening sockets indeed happens sequentially, it is
> expected to take at most seconds even with thousands of listening
> sockets, and even under load, not minutes.  It would be
> interesting to dig into what causes 10-15 minutes restart time.
>
> In particular, in ticket #2188
> (https://trac.nginx.org/nginx/ticket/2188), which was about
> speeding up "nginx -t" with lots of listening sockets under load,
> opening 20k listening sockets (expanded from about 1k sockets in
> the configuration with "listen ... reuseport" and multiple worker
> processes) was observed to take about 1 second without load (and
> up to 15 seconds under load, though this shouldn't affect restart).
>
> Also note that nginx provides a lot of ways to actually do not
> open that many sockets (including using a single socket on a
> wildcard address for a given port instead of a socket for each IP
> address, and not using reuseport, which is really needed only if
> you are balancing UDP).  If the issue you are observing is indeed
> due to slow bind() calls, one of the possible solutions might be
> to reduce the number of listening sockets being used.
>
> --
> Maxim Dounin
> http://mdounin.ru/
> _______________________________________________
> nginx mailing list -- nginx at nginx.org
> To unsubscribe send an email to nginx-leave at nginx.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nginx.org/pipermail/nginx/attachments/20221210/9790ab9b/attachment.htm>

share

M

Maxim

11 Dec '22

Hello!

On Sat, Dec 10, 2022 at 09:52:37AM +0100, Charlie Kilo wrote:

> we have roundabout 7k ips in use, 3k ipv6, 4k ipv4 and 52 workers.
> that results in ~364000 ips which need to be bound - twice that in sockets
> if i count port 80 and 443.
> 
> we have indeed reuseport active - we already thought about using a
> wildcard-address on a socket, but didnt have time to investigate and test
> thoroughly..
> if its really only useful for balancing udp we might be able to get rid of
> it.

Thanks for the details.  Running with 700k listening sockets 
indeed might be a challenge.

Further, it looks like Linux isn't very effective when handling 
lots of listening sockets on the same port.  In my limited 
testing, binding 10k listening sockets on the same port takes 
about 10 seconds, binding 20k listening sockets takes 50 seconds, 
and binding 30k listening sockets takes 140 seconds.

The most simple and effective solution should be to use listen on 
the wildcard address on the relevant port somewhere in the 
configuration, such as "listen 80;" (with "reuseport" if needed, 
see below), so nginx will open just one listening socket and will 
distribute connections based on the local address as obtained by 
getsockname(), see the description of the "bind" parameter of the 
"listen" directive (http://nginx.org/r/listen).  The only 
additional change to the configuration this requires is removing 
all socket options from the per-IP listen directives, so nginx 
won't try to bind them separately.

Not using "reuseport" should be an option too, but keep in mind 
that in nginx versions before 1.21.6 it might be also useful as a 
workaround for uneven distribution of connections between worker 
processes on modern Linux versions As an alternative solution, 
"accept_mutex on;" can be used (see 
https://trac.nginx.org/nginx/ticket/2285 for details).

-- 
Maxim Dounin
http://mdounin.ru/

share