-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UDP/TCP port fowarding to a host without setting up a tun #1179
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution! Before we can merge this, we need @cre4ture to sign the Salesforce Inc. Contributor License Agreement. |
Thanks a ton for taking this project on! I took a stab at this before finding your changes here, it's here if you want to take a look. In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test. I was able to improve throughput by ~7x by adding some buffering to the I still think there are several improvements left on the table:
|
Hey, cool. I could use some support especially regarding the performance and the automated testing. But also for other golang specifics as I'm rather unexperienced there. I tried to do some performance profiling with the help of It seems that its doing some dynamic memory allocation. But from a first glance into the relevant source-code I could not see the reason for it. Do you have an idea? I'm actually already failing to upgrade the gvisior to the latest version. something is wrong with it such that it complains about multiple packages in a directrory. But how can this be undiscovered by the maintainters of gvisor? I have the feeling that I'm doing something wrong... :-/ I will for sure try your idea with the buffered queues. |
@akernet I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-) Your performance improvement lead to a improvment of x6 for the test in your testscript-commit on my hardware. I will do a further test where I will use this on a real-world example (the file-copy test with two seperate machines) to confirm this improvment. I'm a bit concerned because the pprof profiling directed my in some different direction. At least this is how I interpret it right now. I will let you know. |
it seems to me that the improvement with the buffered pipes has no impact on the my testing scenario with the file-copy between two different machines. I'm using a "private cloud" server storage called "garage" mounted via "rclone". The two machines are my laptop and a NUC PC ("server"). Both connected via ethernet cable (1GBit/s) to my local network. I use the offical release binary for the laptop and exchange for testing purposes the binary on the server side. I measure the throughput rate (nautilus file copy) on the laptop. And I measure the CPU load on the server side. These are my results:
The last test shall demonstrate the the locally compiled binary has comparable performance as the release binary. In general, the userspace tun/stack has an impact of factor 4. which results from x2 times CPU load and 1/2 throughput rate put together. |
I'm also new to go but I'll try to find some time in the next days! :)
Ofc!
Cool, I've been looking at garage for the past weeks so nice to see that you are using it with Nebula!
Yeah this is a bit strange. How is the system load overall, is it close to max? The buffering should not make Nebula more efficient when it comes to CPU, in fact it's probably gonna be more costly due to the copying. What it does is decoupling the outside and inside (gvisor) parts, allowing them to run in parallel. If the system is at limits already this is unlikely to help, however. I'll try to do some benchmarking in CPU limited scenarios too, gvisor is always gonna use some additional resources but the 66% of your pprof run seems a bit high. |
Update regarding performance:
This brings the copy-speed to almost the reference (99MB/second) - only 10% difference remaining. So overall we are at around 175% of the reference. Which is significant compared to the initial ~ 400%. @akernet I think the main difference comes from a further improvement of your performance tuning "buffering to UserDevice". I achieved it by addtionally avoiding data-copy and dynamic allocation steps (i think). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice! Cool that you got the buffer reuse to work
// Its there to avoid a fresh dynamic memory allocation of 1 byte | ||
// for each time its used. | ||
var BYTE_SLICE_ONE []byte = []byte{1} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving this to a separate PR since it affects normal configs too
port-forwarder/fwd_tcp.go
Outdated
default: | ||
} | ||
|
||
rn, r_err := from.Read(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be replaced with io.Copy()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can. I once tried this. But it didn't improve the performance. Thats why I went on with other experiments. But now, as we solved the issue to a big extent, I will introduce it again as it simplifies the code. Thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a variant with io.Copy().
Problem: It seems to slightly decrease performance. So I'm wondering if I better should keep the original implementation.
e2523ca
to
08c3922
Compare
Thanks for the contribution! Before we can merge this, we need @akernet to sign the Salesforce Inc. Contributor License Agreement. |
9718675
to
3a03eb9
Compare
@akernet please re-review. and it seems that you need to also sign the cla as I cherry picked from your branch to honor your contribution. :-) |
5be9458
to
24169d7
Compare
@johnmaguire I'm also not fully sure about this. An error message like you suggest would make it clear. But I can't explain why the failing test complains about not fitting ca-certificates otherwise.
and also this:
But for the second one, I have an explanation meanwhile: the generated certificate for the test-services is only valid for 5 minutes. as the relevant test runns into a timeout after 10 minutes, its clear that this error occurs after 5 minutes. Still, the first error I can't explain. The ca-key-pair is generated freshly each time the test runs. From the ca-key-pair, the signed service key-pair and certificate is generated each time anew. So there should not be the possibility that these do not match. Expect, if one consideres that tests run in parallel an they accidentially connect with the wrong test-partner. But I'm open for other ideas. Any help is welcome. |
@cre4ture I'm not sure off-hand. But the fact that it succeeded on most platforms and failed on one again makes me wonder if it's a flaky test. I probably would've tried re-running the CI w/o changes to see whether it failed on a repeat run. If not, then we need to think about whether there's any timing issue that could affect that test. (I haven't looked at the test code really, but just as an example, maybe the updated CA bundle hasn't finished writing to disk when the test starts. Or maybe there was a silent error doing so?) |
FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning |
…isor_stack # Conflicts: # examples/go_service/main.go # service/service.go
I merged the changes from main. It seems that the interface is OK. |
udp/udp_linux.go
Outdated
@@ -315,6 +321,10 @@ func (u *StdConn) getMemInfo(meminfo *[unix.SK_MEMINFO_VARS]uint32) error { | |||
|
|||
func (u *StdConn) Close() error { | |||
//TODO: this will not interrupt the read loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnmaguire I think I have the test stable now. At least on linux. The issue was a unclean shutdown of the nebula service at the end of each test which caused confusion in the following tests.
The change here in this file solved the issue.
I will check the next days if there are some sleeps to cleanup, now as I found the issue.
ed51dd9
to
84d1a26
Compare
return s.eg.Wait() | ||
err := s.eg.Wait() | ||
|
||
s.ipstack.Destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this line made the windows test stable. They now run 200+ in a row without issue.
@@ -323,6 +329,14 @@ func (u *RIOConn) Close() error { | |||
windows.PostQueuedCompletionStatus(u.rx.iocp, 0, 0, nil) | |||
windows.PostQueuedCompletionStatus(u.tx.iocp, 0, 0, nil) | |||
|
|||
u.rx.mu.Lock() // for waiting till active reader is done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to run the test 1000 times to achieve stable reproduction on my windows 11. Is fixed with this change.
a2cbb62
to
cd510b3
Compare
@johnmaguire after extensive testing and a few findings I dare to claim that the tests are now stable on windows and linux. Can you please approve another workflow run on this PR? And have a look about what is still do be done to merge this? |
This is intended to implement #1014
Build-Status twin PR: cre4ture#1
How to be used can be read in the example
config.yml
:Till now I only did some basic manual tests with netcat. They resulted in the expected behavior.Further tests where done by @sybrensa and also by myself.
usafe_routesFAILS, but potential fix for the user-tun available here: https://github.com/cre4ture/nebula/tree/feature/enable_unsafe_routes_for_user_tunThese points are still open:a TODODONE.also TODODONE.I thinkDONE.I did file copy tests.
What I got was 1/2 the rate and around 4 times more CPU usage compared to the case with kernel-tun.I tried multiple different improvements on my code, but I fear its actually more a limitation of the gvisor netstack :-/
Any help or idea is welcome.
Update: My file copy tests achieve now around 90% of the speed at 160% of CPU load of nebula. This is a significant improvment and acceptable (at least for my useases). Thanks @akernet for the support here.