Setting up wifi in rural area South Africa

Lol, this is not a “small” one - but just some ideas, I’m sure you will stumble on the cause if you troubleshoot these, at least.

Best I can say is to avoid load balancing, in favor of hash based or failover, until you have the expertise to do it properly - and troubleshoot problems properly. How does your router determine where to route traffic? There are many strategies, and all of them cause different problems… eg. round robin, failover, hash based - it’s just the mechanism by which the router decides which ISP to route which connections to.

Does the same problem still happen if you unplug 2 of the upstreams? Can you identify the broken upstream ISP this way?

Are you using the same public IP address on all upstream hosts? If not, you have to ensure that you have a source routing (or SNAT) rule on your gateway router - so that the traffic will return through the same interface that it was sent from. Many services (such as Netflix) might have security measures to block traffic if it looks like it was “hijacked” - ie. a connection is running from one IP address and then the same connection suddenly continues from another. It sounds like you have an SNAT setup with different outgoing IP’s, but with a local host-based hash - so that the FireTV box is getting its connection to Netflix through another ISP than the other devices. (Even if you load a site like http://checkip.dyndns.org from each device, to check - it might hash those differently from Netflix, for example.)

Also, your main responsibility is establishing end-to-end connectivity - so a good first step is to ensure that the “not found” page is coming from the remote site, and not from one of your local devices - eg. Netflix might have policies in place to block certain devices - the way to deal with that might require more knowledge about their policies and how to work around them with - by routing through another ISP, or using a VPNs and so forth - but as a “network service provider” you’re not obliged to help with that - unless it is actually your network setup that is interfering with the connection.

So you can start by adding firewall logging rules on your exit point - and to confirm that the traffic is actually leaving your network, and returning from the remote network, and also that it’s doing so on the right connections.

Some networks might have explicit or transparent proxies, and they could be misconfigured, or set to only allow traffic to- or from- certain devices. The upstream ISP might have the same - in which case you need to raise the issue with them.

Then, there might also be MTU problems - some routers or devices might be outdated and dropping certain connections because of the packets being too big, and some responses might not be coming through. So a last resort would be to check the path MTU, and lowering the MTU on some of the ports if you see that certain connections are getting broken.

Thanks for the suggested troubleshooting ideas.
The same problem still happens when the other 2 upstreams providers are unplugged which leads me to a conclusion that the problem is not with either of the upstream providers.

It’s definitely our network setup interfering with the connection because the problem suddenly disappears when a mobile network connection is used.

Will look into the firewall logging rules in depth and advise on the outcome. Resolving such issues take time unless you have the expertise off course and the experience, during that period the network usage will drop resulting to cancellation of services.
You might be right regarding the MTU problems, come to think of it the problem actually started post upgrading from using a Mikrotik RB3011 to using a MikrotikCCR 1036. Although it doesn’t make sense because the CCR should be capable of handling higher traffic volumes than the RB3011.
Will troubleshoot further and share the outcomes, hope someone in our Community network has come across the same issues and advise how the issue was resolved.

Thanks again for the advise, “babysitting a network” is no easy job especially when the demand for services increases rapidly. Community networks have a potential to connect masses of communities and this can only mean that our responsibilities as technicians and network providers becomes heavy. Skills development will be a critical component…

1 Like

You’ve tried this with each upstream individually and it does the same on all 3? I know that’s a tedious way to make sure it is your equipment… but at least you can be sure then!

They are based on different hardware, so a different driver is used for each hardware. The problem can be fixed in software without changing the MTU - and might be - so also make sure you are on the latest versions.

Even if not resolved, you can work around it with firewall rules - try pinging with the DF flag set to find the path MTU, then clamp it as per https://forum.mikrotik.com/viewtopic.php?t=130501

I think a key troubleshooting step, is to write down everything simply - and all the permutations. What often happens is that during troubleshooting, things change, or you forget a critical step, then you compeltely miss something basic. I know I have spent hours looking for problems in all the wrong places, when all I did was to forget a . or a , somewhere, or had one digit mistyped in an IP address. Triple check evertyhing in every step and make sure that nothing else changed - eg. I always keep a continuous ping running in the background, to every device that must work for the test I am doing, so that I know if I am being affected by some connectivity issue somehwhere along the specific link. I think just this has saved me hours of troubleshooting.

Had to actually spend the Friday evening at the DC to resolve the issue. After resetting the Mikrotik tried the exercise of testing each upstream provider separately again only to discover the the problem is from the Nap traffic. When the Nap traffic is enabled some of the sites don’t work but when disabled everything works fine.
Will troubleshoot further and share, it might help someone else who’s experiencing the same issues.

Aha! Progress! Friday afternoon/network troubleshooting… :slight_smile:

Hmmm… Also breaking my brain here, I haven’t dealt with anything like this in a year, so I might not be the best person to ask… But I know if you keep trying you will figure it out.

Are you using BGP? Did you give it enough time to catch up with all the routes? (Do your peers have a filter to prevent someone else from using your ASN? Can you access it remotely when your router is down? ) Is your traffic on that connection leaving your gateway router with the right source/return address? If not, it might get filtered at the remote end. Do you have another IP address that you can try? Do you have any ideas of what it could be?

Do try the peering mailing list to ask if anyone else can think of solutions or are having/have had the same or similar issues.