Border Gateway Protocol (BGP) and Kubernetes


On this post I want to write a bit about Border Gateway Protocol, Calico and Kubernetes, in the examples we are going to see an use case with OpenBSD/openbgp and after Project Calico with BGP mode to illustrate how this protocol can be used in the CNI to provide automatic routing across nodes.

After reading leave comments in the end and share the post. Nuff said.

BGP protocol

The RFC of BGP-4 is pretty clear in the abstract about the primary function of BGP being is to exchange network reachability information with other BGP systems. This network reachability information includes information on the list of Autonomous Systems (ASes) that reachability information traverses.

This information is sufficient for constructing a graph of AS connectivity for this reachability from which routing loops may be pruned, and, at the AS level, some policy decisions may be enforced.

BGP-4 provides a set of mechanisms for supporting Classless Inter-Domain Routing (CIDR). These mechanisms include support for advertising a set of destinations as an IP prefix, and eliminating the concept of network “class” within BGP. BGP-4 also introduces mechanisms that allow aggregation of routes, including aggregation of AS paths.

Meaning we can create automated routing across routers that can communicate with each other.

From the Wikipedia: An autonomous system (AS) is a collection of connected Internet Protocol routing prefixes under the control of one or more network operators on behalf of a single administrative entity or domain, that presents a common and clearly defined routing policy to the Internet.

The reason for this separation is the lack of scalability and routing policies if one entity only was created, for this reason routers inside an AS can have their own protocol specifications. IGP route between routes, EGP route between AS and BGP route among/between AS. ASN (numbers) are unique identifiers with 1-64511 being public ones.

There is no requirement for networks running BGP to have an IGP as well. Simple multihomed networks with two routers run just fine without an IGP: a few static routes are all that’s needed, because all traffic goes to a directly connected network, to the rest of the world, or to the other router. For larger networks, IGPs are a fact of life. Your only choice for an EGP is BGP-4, but IGPs let you use any interior routing protocol you desire. On a Cisco router, you have the following choices: RIP, IGRP, EIGRP, OSPF, and IS-IS.

About the procotol and accordingly with this book BGP uses TCP on port 179 for communication between neighbors. This is unusual: all other routing protocols either run directly on top of IP or use UDP.

This makes it possible to send broadcasts or multicasts to discover neighboring routers. When BGP neighbors establish a TCP session, they start exchanging BGP information in the form of “messages.”

These messages can have a few formats, that we will analyze better later:

  • Open - After a TCP connection is established, the first message sent by each side is an OPEN message, if the Open message is acceptable, a KEEPALICE message confirming the OPEn is sent back.
  • Update - Are used to transfer routing information between BGP peers. The information in the UPDATE message can be used to construct a graph that describes the relationships of the various Autonomous Systems.
  • Keepalive - BGP does not use any TCP-based, keep-alive mechanism to determine if peers are reachable. Instead, KEEPALIVE messages are exchanged between peers often enough not to cause the Hold Timer to expire.
  • Notification - A message is sent when an error condition is detected. The BGP connection is closed immediately after it is sent.

Besides that BGP has Finite State Machine with Idle, Connect, Active, OpenSent, OpenConfirm, Established.

OpenBGP and OpenBSD

Enough theory, bootup your install71.iso, for this lab you will need 2 machines that can at least ping each other.


Machine IP address ASN
r1 65001
r2 65002

The following /etc/bgpd.conf files from both OpenBSD routers

# r1
prefix-set networks {

include "/var/db/rpki-client/openbgpd"
network prefix-set networks set large-community $ASN:1:1
group "upstreams" {
        neighbor {
                remote-as 65002
                descr "r2"
# r2
prefix-set networks {
prefix-set network {
network prefix-set network set large-community $ASN:1:1
group "upstreams" {
        neighbor {
                remote-as 65001
                descr "r1"

Respectively we have as outside networks from the AS, connected on R1 and on R2, as you can see there’s the need to create groups and add neighbors, the protocol won’t multicast to find the peers.

Start the services /etc/rc.d/bgpd start and check the routes added, you need to add the route to the outside network on each end. and are the gateways for each network.

r1> route add -inet
add net gateway
r1> route -n show
Routing tables
Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
default       UGS        6    18731     -     8 em0
172.16/16     UG         0        0     -    48 em0
172.15/16       UG         0        0     -    48 em0
r2> route add -inet
add net gateway
r2> route -n show
Routing tables
Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
default       UGS        5   125326     -     8 em0
172.15/16     UG         0        0     -    48 em0
172.16/16      UGS        0        0     -     8 em0
r2> bgpctl show rib neighbor "r1" in
flags: * = Valid, > = Selected, I = via IBGP, A = Announced,
       S = Stale, E = Error
origin validation state: N = not-found, V = valid, ! = invalid
origin: i = IGP, e = EGP, ? = Incomplete

flags ovs destination          gateway          lpref   med aspath origin
*       N    100     0 65001 i

That’s great both hosts added the routes, and you can see the messages been exchanged from this wireshark diagram, basically r2 started a connection and the daemon on r1 wasn’t running (that’s why it receive a RST/ACK), a few seconds later r1 started the daemon after fixing the configuration file, the first PSH is an Open message from r1, the message is acked and a keep alive was received.


The UPDATES are for route exchange, you can see the second UPDATE decoded here, r2 exchange with AS_PATH attribute 65002, NEXT_HOP being it’s IP. The RFC has more details of these properties:


Calico CNI and BGP mode

What about Kubernetes? Where does BGP is used? Well, it’s possible to configure BGP (Border Gateway Protocol) between Calico nodes or peering with network infrastructure to distribute routing information. When BGP is enabled, Calico’s default behavior is to create a full-mesh of internal BGP (iBGP) connections where each node peers with each other.

This allows Calico to operate over any L2 network, whether public cloud or private cloud, or, if IPIP is configured, to operate as an overlay over any network that does not block IPIP traffic. Calico does not use BGP for VXLAN overlays And for this propose no encapsulation will be used.

Too much? not yet… Josh explains on this video the modes and encapsulation of Calico. Some other good resources includes, the entire series Understanding Kubernetes Networking is worth. More than 4 hours of free content in the Internet move your lazy ass!

How the routing normally works for non BGP CNIs? Nodes have a podCIDR and the CNI can use libraries like this to manage the routing based on these CRDs fields. What is normally achieved in an agent running on a daemonset.

apiVersion: v1
kind: Node

A lot of other projects allow you to do these connections including FRRouting and BIRD, being the second used by Calico in the agent. Make sure you are configuring bgp and BIRD connection, in the example we are using IP in IP as an encapsulation.

`curl -O`

# Use Kubernetes API as the backing datastore.
  value: "kubernetes"
# Choose the backend to use.
  value: "bird"
# Cluster type to identify the deployment type
  value: "k8s,bgp"
# Enable IPIP
  value: "Always"
# Enable or Disable VXLAN on the default IP pool.
  value: "Never"

Debugging routes

On this new scenario we have 3 nodes, in the IPPool range of split as follow:

❯ calicoctl ipam check

Found 4 IPAM blocks.
 IPAM block affinity=host:kind-worker2:
 IPAM block affinity=host:kind-worker:
 IPAM block affinity=host:kind-worker3:
 IPAM block affinity=host:kind-control-plane:
IPAM blocks record 8 allocations.

Loading all IPAM pools...
Found 1 active IP pools.

❯ calicoctl ipam show
| IP Pool  | |     65536 | 8 (0%)     | 65528 (100%) |
❯ kubectl get nodes -o wide
kind-control-plane   Ready    control-plane   8m15s   v1.24.0    <none>        Ubuntu 21.10   containerd://1.6.4
kind-worker          Ready    <none>          7m41s   v1.24.0    <none>        Ubuntu 21.10   containerd://1.6.4
kind-worker2         Ready    <none>          7m41s   v1.24.0    <none>        Ubuntu 21.10   containerd://1.6.4
kind-worker3         Ready    <none>          7m41s   v1.24.0    <none>        Ubuntu 21.10   containerd://1.6.4

If we get the status of the node kind-worker, we can see the neighbors, listing the routes we can get via, that’s the controlplane host as we see the tunl0 interface is used meaning the traffic is being encapsulated, when l2bridge is active it would go via eth0 in the physical interface. The same is true for the other ranges.

root@kind-worker:/# calicoctl node status
Calico process is running.

IPv4 BGP status
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
|   | node-to-node mesh | up    | 23:52:12 | Established |
|   | node-to-node mesh | up    | 23:52:12 | Established |
|   | node-to-node mesh | up    | 23:52:11 | Established |

IPv6 BGP status
No IPv6 peers found.

root@kind-worker:/# ip route
default via dev eth0 via dev tunl0 proto bird onlink via dev tunl0 proto bird onlink
blackhole proto bird via dev tunl0 proto bird onlink dev eth0 proto kernel scope link src

Running a pod on kind-worker3 on, what is in (195.192) range and is given by the IPAM plugin, will redirect the traffic to the node as the next hop, remember this is a container running in the node (on another IP in the network namespace), as soon the node receives the packet it forward to the Pod via caliba2160f804b interface.

> kubectl get pods -o wide
default              nginx3                                       1/1     Running   0          11s   kind-worker3         <none>           <none>

root@kind-worker3:/# ip route via dev tunl0 proto bird onlink via dev tunl0 proto bird onlink via dev tunl0 proto bird onlink dev `caliba2160f804b` scope link

If we check the agent BIRDipv4 configuration on cat /etc/calico/confd/config/bird.cfg, we can find the peers

router id;

# For peer /host/kind-control-plane/ip_addr_v4
protocol bgp Mesh_172_18_0_5 from bgp_template {
  neighbor as 64512;
  source address;  # The local address we use for the TCP connection
  passive on; # Mesh is unidirectional, peer will connect to us.

# For peer /host/kind-worker/ip_addr_v4
# Skipping ourselves (

# For peer /host/kind-worker2/ip_addr_v4
protocol bgp Mesh_172_18_0_2 from bgp_template {
  neighbor as 64512;
  source address;  # The local address we use for the TCP connection

# For peer /host/kind-worker3/ip_addr_v4
protocol bgp Mesh_172_18_0_4 from bgp_template {
  neighbor as 64512;
  source address;  # The local address we use for the TCP connection
  passive on; # Mesh is unidirectional, peer will connect to us.

One last try, listening tshark from the kind-worker3 we will delete the calico-node agent in the control-plane node A connection is kept across all nodes with the bird daemon on this port, and as soon the connection is closed a new retry happens with the protocol OPEN and UPDATE messages.

root@kind-worker3:/# lsof -i :179
bird    12294 root    7u  IPv4 351376      0t0  TCP *:bgp (LISTEN)
bird    12294 root    8u  IPv4 366653      0t0  TCP kind-worker3:bgp->kind-control-plane.kind:58325 (ESTABLISHED)
bird    12294 root    9u  IPv4 351408      0t0  TCP kind-worker3:52953->kind-worker.kind:bgp (ESTABLISHED)
bird    12294 root   10u  IPv4 351409      0t0  TCP kind-worker3:35655->kind-worker2.kind:bgp (ESTABLISHED)

root@kind-worker3:/# tshark -i any "port 179"
    1 0.000000000 ?   TCP 68 58325 ? 179 [FIN, ACK] Seq=1 Ack=1 Win=502 Len=0 TSval=820590145 TSecr=888352057
    2 0.000126600 ?   TCP 68 179 ? 58325 [FIN, ACK] Seq=1 Ack=2 Win=510 Len=0 TSval=888379111 TSecr=820590145
    3 0.000155500 ?   TCP 68 58325 ? 179 [ACK] Seq=2 Ack=2 Win=502 Len=0 TSval=820590145 TSecr=888379111
    4 7.368669600 ?   TCP 76 48333 ? 179 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=820597513 TSecr=0 WS=128
    5 7.368690500 ?   TCP 76 179 ? 48333 [SYN, ACK] Seq=0 Ack=1 Win=65160 Len=0 MSS=1460 SACK_PERM=1 TSval=888386479 TSecr=820597513 WS=128
    6 7.368718300 ?   TCP 68 48333 ? 179 [ACK] Seq=1 Ack=1 Win=64256 Len=0 TSval=820597513 TSecr=888386479
    7 7.368810900 ?   BGP 131 OPEN Message
    8 7.368830600 ?   TCP 68 48333 ? 179 [ACK] Seq=1 Ack=64 Win=64256 Len=0 TSval=820597513 TSecr=888386479
    9 7.368979900 ?   BGP 131 OPEN Message
   10 7.368987300 ?   TCP 68 179 ? 48333 [ACK] Seq=64 Ack=64 Win=65280 Len=0 TSval=888386480 TSecr=820597514
   11 7.369079500 ?   BGP 87 KEEPALIVE Message
   12 7.369096000 ?   TCP 68 48333 ? 179 [ACK] Seq=64 Ack=83 Win=64256 Len=0 TSval=820597514 TSecr=888386480
   13 7.369186800 ?   BGP 87 KEEPALIVE Message
   14 7.369193000 ?   TCP 68 179 ? 48333 [ACK] Seq=83 Ack=83 Win=65280 Len=0 TSval=888386480 TSecr=820597514
   15 7.369343300 ?   BGP 121 UPDATE Message
   16 7.369366400 ?   TCP 68 48333 ? 179 [ACK] Seq=83 Ack=136 Win=64256 Len=0 TSval=820597514 TSecr=888386480
   17 7.369381600 ?   BGP 91 UPDATE Message
   18 7.369393900 ?   TCP 68 48333 ? 179 [ACK] Seq=83 Ack=159 Win=64256 Len=0 TSval=820597514 TSecr=888386480
   19 8.370730900 ?   BGP 121 UPDATE Message
   20 8.370766100 ?   TCP 68 179 ? 48333 [ACK] Seq=159 Ack=136 Win=65280 Len=0 TSval=888387481 TSecr=820598515
   21 8.370781900 ?   BGP 91 UPDATE Message
   22 8.370785500 ?   TCP 68 179 ? 48333 [ACK] Seq=159 Ack=159 Win=65280 Len=0 TSval=888387481 TSecr=820598515
   23 30.522048400 ?   BGP 87 KEEPALIVE Message
   24 30.522091300 ?   TCP 68 35655 ? 179 [ACK] Seq=1 Ack=20 Win=502 Len=0 TSval=2977307332 TSecr=3831078776


And why should I care? Routing is fun and the next time another social media site is shutdown by route withdraws you can at least understand the post-mortem. Go get it!