Managing Swarm envi...
 
Notifications
Clear all

Managing Swarm environment with portainer: could not get it to work

8 Posts
2 Users
0 Reactions
74 Views
Posts: 4
Topic starter
(@ifs77)
Active Member
Joined: 2 weeks ago

I'm newbee homelabber and was very impressed by Brandon's video on YouTube called "Best Container Server Setup". It seemes that Swarm + Portainer is really an ideal middle-ground between bare Docker and Kubernetes and thanks Brandon for highlighting this solution.ย 
I tried to reproduce Brandon's setup in my lab working, but gave up after two full days of hard efforts.
Couldn't connect Portainer server to agents on the nodes. It's simply doesn't work. I'm getting "Client.Timeout exceeded while awaiting headers" every time when I'm pushing "Connect" button in environment creation dialog. The only way I found possible is to connect Swarm by socket option, but this approach doesn't give you a beauty of Cluster Visualizer and thus becomes mostly useless.
I did numerous attemts to bootstrap it, starting with carefully repeating all the steps from the video then going to Portainer instance (which is on a node outside of the cluster), copying the commands provided by wizard to a destination node, then, after deploying Agents, trying to connect Portainer to them. A soon as it didn't work, I moved further, searching for a cause: fiddling with UFW in Ubuntu 24.04, iptables, DNS, trying to re-deploy nodes with and without Keepalived, reinstalling few Portainer versions, installing Portainer outside and inside of a cluster, re-creating VMs from Ubuntu full image instead of cloud-image etc. I ended up with a try to roll this up on Debian instead of Ubuntu, but it didn't work either.
As far as I can tell by some of the comments on YouTube video, this is a common problem, may be a bug in Portainer. And I found such complaints on their GitHub and somewhe else on the internet, but no suitable solution or explanation, unfortunately.
Assuming all that said, I would consider it a bug and totally gave up on it, but I saw it working in video.
Could somebody give me any advice, what I might be doing wrong and what is the right path?
---
Related links, that I've used:
-

-

-

-

-

-

7 Replies
Brandon Lee
Posts: 439
Admin
(@brandon-lee)
Member
Joined: 14 years ago

@ifs77 welcome to the forums! I noticed a couple of others mention in the comments they were having issues as well. Give me more details on how you set things up. Hopefully we can figure out what is going on there. 👍ย 

Reply
1 Reply
Brandon Lee
Admin
(@brandon-lee)
Joined: 14 years ago

Member
Posts: 439

@ifs77 Also, just re-reading your post, did you setup the Docker Swarm cluster using the service command from Portainer?ย 

This command looks like this:

docker network create \
--driver overlay \
  portainer_agent_network

docker service create \
  --name portainer_agent \
  --network portainer_agent_network \
  -p 9001:9001/tcp \
  --mode global \
  --constraint 'node.platform.os == linux' \
  --mount type=bind,src=//var/run/docker.sock,dst=/var/run/docker.sock \
  --mount type=bind,src=//var/lib/docker/volumes,dst=/var/lib/docker/volumes \
  --mount type=bind,src=//,dst=/host \
  portainer/agent:2.21.5

Also, I am running Keepalived with a virtual IP address that I pointed the wizard to when onboarding my swarm cluster. The above command sets up the agent in "global" mode which means it will run on each of your swarm nodes.

Reply
Posts: 4
Topic starter
(@ifs77)
Active Member
Joined: 2 weeks ago

Hello Brandon,
Thank you for quick reply!
Yes, I can easily give you more details as I was taking notes of all steps that I was doing. I'll attach a file to this post (just change an extension to .md because .md files disallowed on this forum).
Yes, was executing that commands from Portainer's wizard. And I was trying to connect first with virtual IP, then, after it's throwing back an error, I was trying individual IPs of all the nodes. Then I decided to eliminate Keepalived from the chain because suspected it was a source of error. I rolled back all VMs to bare Ubuntu then repeated all steps without Keepalived, using individual nodes IP, but result was the same.

ย 

Reply
2 Replies
(@ifs77)
Joined: 2 weeks ago

Active Member
Posts: 4

Sorry, I didn't describe my infrastructure.
I have 2 hardware servers, one of them running on ESXi, the other one - Proxmox 8.3.2. My first attempt was to distribute Swarm nodes between those 2 servers, then I decided to simplify things and roll them up on a single Proxmox server. Firewall service disabled in Proxmox on datacenter and node levels and I don't have any filtration and restriction rules in my home network, VLANs etc. Just simple network with my router's DHCP with 1 subnet, 1 gateway (router) and single DNS at Pi-hole. All VMs were using DHCP.
When my Swarm node VMs was running, their FQDNs was resolvable by Nslookup when I was calling it from my desktop (Windows and MacOS)

Server:         10.0.0.254
Address:        10.0.0.254#53
Name:   swarm-3.babylon-8.local
Address: 10.0.0.63

, but there was strange nslookup output in Ubuntu on those particular Swarm nodes:

;; Got SERVFAIL reply from 127.0.0.53
Server: 127.0.0.53
Address: 127.0.0.53#53
** server can't find swarm-3.babylon-8.local: SERVFAIL

I don't have this issue in Debian, but my Debian Swarm was also not reachable from Portainer.

Reply
Brandon Lee
Admin
(@brandon-lee)
Joined: 14 years ago

Member
Posts: 439

@ifs77 I just saw this message after I replied to your latest one. I believe your issue is DNS systemd-resolv....your swarm hosts may not be resolving back to your Portainer server and Portainer may not be resolving the swarm hosts. Do you have an internal DNS server for your home lab? I disable systemd-resolved on my Ubuntu hosts as this generally causes me issues across the board. Here are some cheat notes to use if you want to move away from the systemd-resolved and point your hosts to an internal DNS server you have running.

I would get your DNS squared away on your swarm hosts and may also be in play on your Portainer Docker host as well. Once you have them pointed to the same DNS server, make sure you can resolve everything from all nodes.

Stop the service

This stops and disables systemd-resolved

sudo systemctl stop systemd-resolved
sudo systemctl disable systemd-resolved
sudo systemctl mask systemd-resolved

Check for symlink

By default the resolv.conf file is a symlink

ls -l /etc/resolv.conf

Remove symlink

sudo rm /etc/resolv.conf

Create a new file

Just replace with your own DNS server IP and lookup DNS suffix.

bash -c 'echo -e "nameserver 192.168.1.53\nsearch homelab.local" > /etc/resolv.conf'

Set the file to immutable

This step makes sure nothing can change the resolv.conf file

sudo chattr +i /etc/resolv.conf
Reply
Brandon Lee
Posts: 439
Admin
(@brandon-lee)
Member
Joined: 14 years ago

Gotcha @ifs77. Can I ask, when you run this command from one of your Swarm manager nodes, do you see your Portainer service listed?

docker service ls

The service will show up as something like this:

portainer_agent                              global       3/3        portainer/agent:2.21.4                                     *:9001->9001/tcp

Also, you can use something simple like telnet client from a windows machine to try to establish connectivity to port 9001 to see if you have connectivity there:

telnet <swarm ip> 9001

Are your portainer instance and your swarm nodes on the same subnet?

ย 

Reply
Posts: 4
Topic starter
(@ifs77)
Active Member
Joined: 2 weeks ago

Hello @brandon-lee ,
thank you for your advices, it was very helpful! I took me a while to implement this, but I did it. Now I can resolve the domain names in my network from all of my machines without errors.
Unfortunately, it did not help me with Portainer. I'm still getting the same error during Swarm environment creation in both cases when Portainer server instance is running on one of the Swarm nodes or outside of the cluster.
To be more precise, when I'm deploying Portainer inside the cluster, which is the right way as I think, because it's the way that is recommended on the official site (

I can't approach the web GUI despite both Portainer and Portainer-agent services are running and showing proper ports allocation.
When Portainer running on other machine and I'm trying to connect cluster following the wizard steps, it ended up with the error I mentioned in my first post.
I can confirm that all docker containers needed are running, ports are not blocked by firewall, DNS addresses are resolving.
How do I check this?
Before installing Portainer agents on the node, I'm doing

nc -l -p 9001

then from another computer on Windows I run PowerShell command

Test-NetConnection -ComputerName <FQDN> -Port 9001

ย and get

ComputerName : <FQDN>
RemoteAddress : 10.0.0.61
RemotePort : 9001
InterfaceAlias : tun2
SourceAddress : 10.33.0.2
TcpTestSucceeded : True

Then, after installing Portainer on that nodes, I can't do port listening on them because the agents took port 9001, as I suppose ("nc: Address already in use" on the machine where only agent is running and "Can't grab 0.0.0.0:9001 with bind" on the machine where Portainer server and agent are deployed).
Whrapping up, annoying error stil exists and I can't connect Portainer to the cluster.
Any suggestions?

Reply