Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chainweb-node sometimes freezes when receiving a lot of traffic #687

Open
edmundnoble opened this issue Nov 11, 2019 · 16 comments
Open

chainweb-node sometimes freezes when receiving a lot of traffic #687

edmundnoble opened this issue Nov 11, 2019 · 16 comments

Comments

@edmundnoble
Copy link
Contributor

edmundnoble commented Nov 11, 2019

This has plagued the network for a couple days now, people are running healthcheck scripts that curl local endpoints and restart chainweb-node if it stops responding.

From what I can tell the process is still alive, or systemd would restart it; it just seems to be hanging.

@moofone
Copy link

moofone commented Nov 15, 2019

Yes this is happening for me within a few hours always. The node stops responding to port 443. It has open connections, but seems to die internally

nc -z -v localhost 443

^ this should reply with socket open indicating the socket server is dead, not responding to tcp at all. Restarting the node makes it work for a little while then freezes again within hours.

ubuntu 18.04 using pre-built binaries 1.0.4.

chainweb-node is only 126 open file descriptors, and the server has very large limits, so its not that.
Also plenty of disk space and free ram (18+GB available)

@fosskers
Copy link
Contributor

fosskers commented Nov 15, 2019

From our end - we've never seen this issue on our bootstrap nodes, including ones getting hit decently hard by mining activity. My theory is that this was due to low file descriptor settings on the nodes, but perhaps people have counter evidence?

For those involved, please paste here the output of the following commands on your node machines:

ulimit -Sn
netstat -tapun | wc -l
lsof -p $(pgrep chainweb-node) | wc -l

@moofone
Copy link

moofone commented Nov 15, 2019

ulimit -Sn

500000

netstat -tapun | wc -l
2151

lsof at the time as I said was 126

@tylersisia
Copy link

Machine 1

ulimit -Sn
1024
netstat -tapun | wc -l
117
lsof -p $(pgrep chainweb-node) | wc -l
217

@tylersisia
Copy link

Machine 2

ulimit -Sn
1024
netstat -tapun | wc -l
111
lsof -p $(pgrep chainweb-node) | wc -l
112

@pkrasam
Copy link

pkrasam commented Nov 16, 2019

ulimit -Sn
1024000

netstat -tapun | wc -l
61

lsof -p $(pgrep chainweb-node) | wc -l
198

@huglester
Copy link

huglester commented Nov 16, 2019

ulimit -Sn
1024
netstat -tapun | wc -l
282
lsof -p $(pgrep chainweb-node) | wc -l
416

I changed ulimit to 65535 - we will see

@xdagx
Copy link

xdagx commented Nov 16, 2019

ulimit -Sn
655350
netstat -tapun | wc -l
15121
lsof -p $(pgrep chainweb-node) | wc -l
15065

@moofone
Copy link

moofone commented Nov 16, 2019

This is likely the problem... This thing is consuming TWENTY THOUSAND TCP PORTS (!!!!!!)

ss -tnp | grep 443 | grep ESTAB | wc -l
20077

@larskuhtz
Copy link
Contributor

larskuhtz commented Nov 17, 2019

@moofone, I have observed large numbers of incoming connections from just a single (or a few) IP addresse(s), too. In those cases it seemed that the miner was creating too many connections.

This line https://github.com/kadena-io/chainweb-miner/blob/033e0c7c27dc50f92ac98a91faeab079ebef2697/exec/Miner.hs#L362 in miner code looks suspicious. The miner is creating a new server event stream each time it receives an update from the server. The body of withEvent doesn't seems to inspect the content of the stream, and I wonder if it leaves the connection in a dirty state so that it can't be reused, or if it may even leak the connection.

In any case, I think, the miner shouldn't open a new server stream on each update.

@larskuhtz
Copy link
Contributor

I wonder if kadena-io/chainweb-miner#7 can cause the miner to leak connections to the update stream.

@Jacoby6000
Copy link

Jacoby6000 commented Nov 19, 2019

> ss | grep ESTAB | grep 35000 | grep 84.42.23.120 | wc -l
813
> ss | grep ESTAB | grep 35000 |  wc -l
845

I wonder if our nodes are being attacked. Most connections come from the same IP.

@Jacoby6000
Copy link

Jacoby6000 commented Nov 19, 2019

After setting up an ebtables rule, I don't have a problem for now. If it's an attacker, I'm sure they'll just switch IPs. need to make a rule that drops same IP with >n connections.

ebtables -A INPUT -p IPv4 --ip-src 84.42.23.120 -j DROP

Run this on your node server to catch your worst offenders

ss | grep ESTAB | grep ":$YOUR_NODE_PORT" | awk '{print $6}' | awk -F':' '{print $1}' | sort | uniq -c | sort

I suppose it could also just be large farms.

@fosskers
Copy link
Contributor

@chessai
Copy link
Contributor

chessai commented Jun 13, 2023

What became of this?

@Jacoby6000
Copy link

This wound up being a mining client in it's infancy misbehaving and imitating a kind of slow-loris attack. Not sure if anything was done to the chain web node to prevent it since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants