Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRPE master high load issue #244

Open
younity-ENG opened this issue Oct 19, 2020 · 4 comments
Open

NRPE master high load issue #244

younity-ENG opened this issue Oct 19, 2020 · 4 comments

Comments

@younity-ENG
Copy link

hi all

im using nagios core 4.4.3 with nagios-nrpe-plugin 3.2.1 installed on ubuntu 18.04.
it installed on AWS EC2 type t2.medium (2cpu, 4ram). my server is configured with 3 check_workers due to my 2 CPUs.
it servers as "on-site" with direct host/service checks via VPN and as an NRPE master server.
the external commands are mostly ping and around 4 http/dns checks.
around 100 direct services and 350 NRPE services (one host)
when adding more NRPE agents (400 services each) the master load is rising and I'm getting "localhost load" alerts
localhost/Current Load is CRITICAL: CRITICAL - load average: 1.31, 1.48, 4.01

while monitoring the server with Htop I see that the CPU uses repeatedly reaches to 100%.
I've looked online and found some recommendations that didn't really help.

  • using check_fping instead of check_ping plugin
  • external_command_buffer_slots=512
  • use_large_installation_tweaks=1

using Htop i see the CPU spikes accrues when external commands are executed.

does anyone have any idea why my CPU is so high?
shouldn't Nagios handle thousands of services (with the right configuration) .
ill appreciate any tips and recommendations.

thanks

@younity-ENG
Copy link
Author

hi
ant ideas regarding this issue?

thanks

@sawolf
Copy link
Contributor

sawolf commented Nov 6, 2020

Hi, thanks for reporting this. Can you elaborate on your current system architecture?

It sounds to me like you're saying you have

  • One Nagios Core server
  • 100 direct services not related to NRPE
  • 350 services using check_nrpe, all put on the same host, and possibly interacting with the same remote server.
    and that you're adding additional remote servers, each of which results in you adding ~400 check_nrpe checks to your nagios config.

I guess my question is - how many of these agents are you adding before you see the CPU load increase?

Also, I recommend increasing the number of check_workers, since those will block on network requests. It may not affect anything, but if any of these plugins take a long time to execute, the worker will just be sleeping for that whole time.

If you're adding a lot of these agents (so that you have 5000+ services), you might want to look into something like mod_gearman to distribute the work being done.

@younity-ENG
Copy link
Author

hi Sebastian

thank you for responding.
basically the average load is getting high since the second agent.
i did increase my HW resources and the number of workers (4cores and 6 workers) but it didn't really help.
i understand you recommend trying the mod_gearman for this scale.
im not familiar with this module.
dose it mean that the remote agent will be mod_gearman and not NRPE?

thanks

@younity-ENG
Copy link
Author

are you familiar with NRDP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants