Have fun with AWS GWLB : I wrote a Geneve router in Python :-)

AWS introduced GWLB (Gateway Load Balancers) a few years ago (Introducing AWS Gateway Load Balancer).

This type of load balancers permits to drastically simplify the way you inspect your traffic among several VPCs, avoids having to use complex routing between VPCs, avoids having to use “sandwich design” with firewalls in each VPC, and also permits (depending on the design) to inspect all public VPCs traffic in a central point while having the possibility to provide each VPC a public elastic IP.

Particular point is that GWLB is a “non-proxy” construct, meaning it does not terminates the client connections. It just encapsulates the traffic (using Geneve protocol) to one of the target-group members, which transparently inspects it and send it back.

As a quick diagram always make things clearer, here is the design I used for my tests (Terraform stack file available in the Github repo) :

  • Public-instance-1 and public-instance-2 are hosting a simple web server, and are on 2 different public subnets on 2 different AZs, on a PUBLIC-VPC
  • 2 inspection instances are on another VPC, member of a target-group used by a GWLB. They are also on 2 different subnets on 2 different AZs, on an INSPECTION-VPC
  • An endpoint service is created on the INSPECTION-VPC (linked to the GWLB)
  • An endpoint is created on the PUBLIC-VPC (2 in fact, as we need one per subnet), linked to the endpoint service of the INSPECTION-VPC

Before going further in the deep technical discussions, let’s discuss about the reason which made AWS choose Geneve for the encapsulation between the GWLB and the inspection instances :

  • As for VXLAN, it uses UDP as a transport protocol (standard destination port : 6081). Source port is (generally) computed as an hash of the encapsulated flow, which permits to induce a good entropy in the “underlay” network load-balancing mechanisms (which would not be the case with GRE, for example).
  • Geneve has been designed to be flexible and extensible. Among the VNI field (which is the same than for VXLAN), the Geneve header can include several options (as form of TLV) which permits to add as much information as we want in each transported packet. This is probably the main raison why it has been chosen among VXLAN, as AWS uses it to add useful options.

Last information before we move forward to the next parts of this article : don’t expect it to be structured. I’ll just share here some interesting information I learned or had to take into account to make this piece of code working.
There’s nothing very complex (surprisingly !), and I hope this will inspire you to work more with Python sockets or GWLB.

Geneve generalities and AWS options

As explained above, Geneve is an extensible encapsulation protocol.
Some quick comparisons can be done with VXLAN :
– Geneve uses UDP port 6081, while VXLAN uses UDP port 4789
– Geneve can encapsulate any kind of traffic (thanks to the “Protocol Type” field on the Geneve header), while VXLAN is intended to do only “MAC-in-UDP” encapsulation (it transports only L2 Ethernet frames)
– each Geneve packet can embed up to 256 bytes of variable-length options, while the VXLAN header does not have this capacity

Let’s take a look at the Geneve header :

|1 2 3 4 5 6 7 8|1 2 3 4 5 6 7 8|1 2 3 4 5 6 7 8|1 2 3 4 5 6 7 8|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Ver| Opt Len |O|C| Rsvd. | Protocol Type |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Virtual Network Identifier (VNI) | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Variable-Length Options ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Some interesting things on this header :
– the basic Geneve header length is 8 bytes (same as VXLAN)
– the “Option Length” field indicates the length (in terms of 32 bits words) of the options following the header. This field is 6 bits in length, meaning that we can have up to 2 ^ 6 * 4 = 256 bytes of options on each header
– The “O” (Control) tag means that this is a control packet that the receiver of this Geneve packet should not forward
– The “C” (Critical) tag means that there are critical options on the header. If the receiver is not able to parse those options, it should drop the entire packet
– “Protocol Type” indicates the encapsulated protocol, according to the IANA IEEE 802 numbers
– The “VNI” : it identifies a unique element of a virtual network. The control plane decides of its meaning. This is not used by AWS in the GWLB context (value is kept to 0)

The options which can be appended to the header must follow this format :

|1 2 3 4 5 6 7 8|1 2 3 4 5 6 7  8|1 2 3 4 5 6 7 8|1 2 3 4 5 6 7 8|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Option Class | Type |R|R|R| Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
~ Variable-Length Option Data ~
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The different fields are :
– the “Option Class” : the IANA Geneve option type identifier (a namespace for the Type field value). Amazon has 0x0105 and 0x0108 to 0x0110, but uses 0x0108 for the GWLB options
– the option “Type” : this is the option meaning, in the context of the Option Class namespace. I’ll detail below the option types used by the AWS GWLB. Note that the high-order bit of the option type indicates wether this is a critical option or not (then impacting the “C” flag value in the header).
– The “Length” field : the data portion length of the option (excluding the option header, which is 4 bytes), in terms of 32 bits words. Notice that option data must be padded so that it always has a length multiple of 4 bytes.

Let’s take a look at the options used by the AWS GWLB :

Each GWLB forwarded packet will include the 3 following Geneve options (Class 0x0108) that you can see in the capture above :
– 0x01 : GWLB Endpoint ID. This is the identifier of the VPC endpoint (type GWLBE) which forwarded the following packet to the GWLB
– 0x02 : Customer attachment ID (not used for now) : This is the attachment ID (VPC ID) of the VPC where the packet comes from when the GWLB is attached to a Transit Gateway
– 0x03 : Flow cookie. This one is very interesting : this is a randomly generated number (based on what I know, it is not linked to the encapsulated packet characteristics), unique to each flow, generated by the GWLB.

What is very important regarding this last option (flow cookie) is that a mapping between this random number and each flow forwarded by the GWLB (5 tuple Src IP, Dest IP, Src Port, Dest Port, Protocol for TCP and UDP flows, 3 tuple Src IP, Dest IP, Protocol for other protocols) is maintained as an entry in a table.
When a packet is sent back to the GWLB by the appliances in the attached target group, this value must have been copied in the “return” Geneve header.
When the packet is going back to the GWLB, it checks if the flow cookie value matches the one it generated for the concerned flow. If not, the packet is immediately dropped by the GWLB, and is not sent back to the VPC endpoint.

It also means that no flow can be initiated by the target group side of the GWLB.

Let’s imagine for example that you would like to ping one of the “Public instances” from one of the “Inspection instances”, encapsulating this ping in a Geneve packet.

As there would not be any known flow cookie value chosen by the GWLB for this flow, this packet would be immediately dropped by the GWLB, without any chance to reach your public instance.

Finally, and before moving into the details of the “router / inspection stuff” I wrote, interesting thing to have in mind also if you want to play with Geneve (and VXLAN, as they have this same behaviour, but I’ll detail the differences here) :
– the source port in the outer UDP header is calculated using a hash of the inner flow characteristics (using for example a 5 tuple). That way, the source port is the same for all encapsulated packets of a given flow, which permits to increase the probability they’ll use the same path on the underlay network
– the destination port is always 6081 (for Geneve) or 4789 (for VXLAN) : source and destination ports should not be swapped, in our example, by the inspection instances when they send back to the GWLB an inspected packet. This has a (huge) impact on the way you can implement it on your code.
– one difference between Geneve and VXLAN : Geneve expects the UDP checksum to be calculated, and accurate. An invalid checksum will (of course) cause the packet to be discarded. While VXLAN encourages to use a checksum of 0 (thus not protecting the IP/UDP outter header), but will still check the validity of the checksum if it has a non-null value.

Let’s build a “Geneve router” !

Knowing how the Geneve protocol / GWLB works, and if you are a bit familiar with socket programming, it’s not that hard to have some fun by building a “flow inspection and routing” program.

You can get mine here : https://github.com/AnthoBalitrand/geneve-router

How is it built ?

main.py : Just the basic and core stuff here. Command-line arguments parsing, check if you have enough permissions to make it run (let’s speak of that below), and then creates 2 sockets :


– health-check socket : this is a simple TCP socket, listening on port 80 (by default), used by the GWLB target-group to check the health status of the process.
It just answers “Healthy” to any incoming TCP payload with the following function :

def http_healthcheck_response():
    body = "Healthy\n"

    header = f"HTTP/1.1 200 OK\nContent-Type: text/html; charset=utf-8\nContent-Length: {len(body)}\nConnection: close"
    return header + '\n\n' + body

– main socket : this is a raw UDP socket (which will thus get any UDP datagram + IP header thanks to the use of setsockopt(socket.IPPROTO_IP, socket.IP_HDRINCL, 1)). This is the reason why you need elevated privileges to start this process. Why do we need this ? Because we don’t swap the source and destination UDP ports when returning the packet to the GWLB.
IE : we receive a packet from the GWLB with src-port = 60372 / dest-port = 6081. Our process is supposed to inspect / filter the inner packet, and then return the Geneve packet with src-port = 60372 / dest-port = 6081.
If we were using a bind UDP socket (socket.SOCK_DGRAM) on port 6081, we could only sent datagrams with source-port = 6081.
Which means that we would send back the packet to the GWLB with src-port = 6081 / dest-port = 6081

Spoiler alert : it works. Even if AWS expects that you send back the packet with the same source port that was used by the GWLB (see here), it seems to be working well with my small lab if you send it back with source port 6081. Then I added a mode where you can start the geneve-router using the –udp-only argument (not using raw socket), which you can run without root privileges.
However, I have some doubts that it would work properly with large-scale / high-bandwidth usage because of the way AWS VPC / Hyperplane would handle this kind of flows.

Example of what the flow looks like with a Raw socket where you can build your own IP / UDP headers :

And here’s what it looks like with an UDP socket :

The socket management is done through the select Python module which makes it quite easy to work with multiple sockets in a non-blocking / asynchronous way.

Note that when using a raw socket, you’ll receive all the packets the system is actually receiving with the IPPROTO value you associated to the raw socket (in our case IPPROTO_UDP).
It’s then up to you to filter the packets received by the socket according to their L4 information.

rawpacket.py : It defines RawPacket, a class instantiated each time a new packet is received. It then uses the different layers parsers to get information about the outter and inner packets information (eventually raising an error to stop packet parsing if the outter UDP header is not having destination port equals to 6081).

Extract :

class RawPacket:
  def __init__(self, logger, raw_geneve_packet, flow_tracker, udp_only):
    self.raw_data = raw_geneve_packet
    [...]
    
    if not udp_only:
      self.outter_ipv4 = ipv4.IPv4(self.raw_data)
      self.outter_udp = udp.UDP(self.raw_data, self.outter_ipv4.header_end_byte)
      if not self.outter_udp.dst_port == config.GENEVE_PORT:
        raise UnmatchedGenevePort
    
    # The Geneve header is then unpacked, as well as the IP / L4 proto inner headers
    self.geneve = geneve.Geneve(self.raw_data, 0 if udp_only else self.outter_ipv4.header_length_bytes + 8)


    # Each of the headers is an object for which we have a __repr__ method defined 
    # We can then use the following method to display all the parsed information
    [...]
    logger.debug(f"GENEVE - {self.outter_ipv4} {self.outter_udp} {self.geneve} {self.inner_ipv4} {self.inner_l4}")

An example output of the logger.debug line above is something like that (with a bit of formatting) :

2023-02-17 19:38:22,458 - geneve-router - DEBUG - GENEVE - 
[IPv4 Total length:108 ID:0 DNF:0 Frag offset:0 TTL:255 SRC:192.168.10.60 DST:192.168.10.4 ] 
[UDP SRC port:60430 DST port:6081 Length:88 ] 
[Geneve Protocol type:2048 VNI:000000 [
  [ Opt class:264 Opt type:1 Value:2fc2a110a489f908 ], 
  [ Opt class:264 Opt type:2 Value:0000000000000000 ], 
  [ Opt class:264 Opt type:3 Value:db2eb84b ]] 
[IPv4 Total length:40 ID:24337 DNF:0 Frag offset:0 TTL:233 SRC:92.63.197.83 DST:10.0.10.4 ] 
[TCP SRC port:54209 DST port:8558 SEQ/ACK:180243822/0 Flags:S Window:1024 ]

We can easily see the information regarding the outter headers (IP/UDP), where here 192.168.10.4 is the IP address of the inspection instance (receiving Geneve packets on port 6081), and 192.168.10.60 is the GWLB interface in this inspection instance subnet.
We can see the flow cookie value generated by the GWLB for the inner flow (db2eb84b).
As well as the inner flow info (src/dst IP and TCP ports, seq and ack numbers, flags, window value).

The headers/ folder contains the different header classes used to get such a result, I’ll not detail them here (but take a look if you want to know how to use Python C structs unpacking).

Interesting thing to note however :

When using the router with a raw socket (normal mode), the following things happen before forwarding the Geneve packet back to the GWLB :

# Let's revert the source and IP addresses on the outter IP header 
self.outter_ipv4.swap_addresses()
# And decrease the TTL field (not really useful, but that's how it's supposed to be !)
self.outter_ipv4.ttl -= 1

[...]

# We need to repack (reconstruct the bytes array) of the IP header after changing it 
# and we just concatenate the rest of the original packet raw value after 
return b''.join([
  self.outter_ipv4.repack(),
  self.raw_data[self.outter_ipv4.header_length_bytes::]
])

Note that we would normally have to recompute the outter TCP and UDP headers checksum :
We swapped the source and destination IP addresses, and the IP header info (in the form of a pseudo-header) is used for the UDP header checksum calculation. If they are changed, the checksum is not valid anymore.
However, nowadays the calculation of the IP, TCP and UDP headers checksums is offloaded to the NIC. Then I let the instance NIC recalculating the right value and I don’t spend time on it !

BTW, here’s the command to check the status of checksum offloading on your NIC :

[root@ip-192-168-10-4 ec2-user]# ethtool -k eth0 | grep checksum
rx-checksumming: on [fixed]
tx-checksumming: on
	tx-checksum-ipv4: on [fixed]
	tx-checksum-ip-generic: off [fixed]
	tx-checksum-ipv6: on
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]

Add some gimmicks : Flow Tracker

In order to add some gadget but interesting features to my Geneve router, I added a “flow-tracker” on it.
What is it ? It is a small inspector which will keep track of the different flows (entirely based on the AWS GWLB flow-cookie value, then it will not work on any other kind of implementation), so that it can display information about the flows once they are terminated (either CLOSED for a TCP session, or timed out for UDP or ICMP).

Example of outputs :

2023-02-17 20:05:58,269 - geneve-router - INFO - Flow 24a150d0 - IP 6 - SRC 35.180.255.182:44748 - DST 10.0.10.4:80 - Pkts/bytes sent 8/76 - Pkts/bytes received 6/3914 - State CLOSED

TCP sessions are tracked by analyzing the flags in the inner TCP headers.
The flow-tracker logs are displayed as soon as the session moves to the CLOSED state, or reaches the config.TCP_FLOW_TIMEOUT value (300 seconds by default) which is on the config.py file.

UDP / ICMP sessions infos are displayed when they timeout, after the config.FLOW_TIMEOUT duration (default 30 seconds).

I also added the ability to drop TCP flows for non-initialized sessions (which you can test for example by opening an SSH session to the public instances, and restarting the Geneve-router on the associated (same subnet) inspection instance).
The flow tracker will now drop all the packets on this SSH session as it will considered non-SYN TCP.

This is enabled by default, but you can turn it off by changing the value of config.TCP_NONSYN_BLOCK (in the config.py file, again).

2023-02-17 20:15:40,609 - geneve-router - DEBUG - GENEVE - 
[IPv4   Total length:156 ID:0 DNF:0 Frag offset:0 TTL:255 SRC:192.168.10.60 DST:192.168.10.4  ]
[UDP   SRC port:60907 DST port:6081 Length:136  ] 
[Geneve   Protocol type:2048 VNI:000000 [
  [ Opt class:264 Opt type:1 Value:2fc2a110a489f908 ], 
  [ Opt class:264 Opt type:2 Value:0000000000000000 ], 
  [ Opt class:264 Opt type:3 Value:4c629521 ]] 
[IPv4   Total length:88 ID:0 DNF:1 Frag offset:0 TTL:47 SRC:90.1.209.85 DST:10.0.10.4  ] 
[TCP   SRC port:64899 DST port:22 SEQ/ACK:463573958/1816716500 Flags:AP Window:2048  ]

2023-02-17 20:15:40,609 - geneve-router - WARNING - FLOW-TRACKER - First packet for un-initialized TCP flow is not a SYN !

OK, cool, but how can I play with it ???

Ok, I know that you can’t wait for that.
So, first, you can get the code here (I’ll probably add some stuff on it if I have some time, but feel free to fork it if you want)

https://github.com/AnthoBalitrand/geneve-router

Then first thing is :

git clone https://github.com/AnthoBalitrand/geneve-router.git
cd geneve-router
pip3 install -r requirements.txt
python3 main.py --help

usage: geneve-router [-h] [--no-daemon] [-l LOG_LEVEL] [-f LOG_FILE] [-t] [-u]

Geneve router for AWS GWLB

optional arguments:
  -h, --help            show this help message and exit
  --no-daemon           Do not start the Geneve router as a daemon
  -l LOG_LEVEL, --log-level LOG_LEVEL
                        Log level. If used without --no-daemon, will force logging to logging.log
  -f LOG_FILE, --log-file LOG_FILE
                        Logging file. Overwrites the config.LOG_FILE parameter
  -t, --flow-tracker    Enables flow tracker, which provides only start/stop flow logging information
  -u, --udp-only        Start without using raw socket (only UDP bind socket)

by Antho Balitrand

You can start it attached (not running as background) with the –no-daemon argument.
By default, it will not log at all if started as a background process, but you can force logging to an output file by using the –log-level argument (values : info/warning/error/debug).
When starting with –no-daemon mode, the logging level is warning (you can change it in this mode also with –log-level)
To enable flow tracker, just add the –flow-tracker (or -t) argument.
And finally, if you want to make it run with a Geneve UDP socket (instead of a raw socket), use the –udp-only argument. (Remember that it will not work as it is supposed to be, and it could be blocked by AWS at some point).

Even simpler with a Terraform file

To make it even simpler for you, I added a terraform stack file in the terraform-files/ folder.
The deployed architecture will be this one :

All you have to do is :

cd terraform-files
export AWS_ACCESS_KEY_ID = "<your key ID>"
export AWS_SECRET_ACCESS_KEY = "<your secret key>" 
export AWS_DEFAULT_REGION = "eu-west-3" #replace with your favorite region
terraform apply 

The deployment will take a few minute.
Once it’s OK, you should get a Terraform output with the public IP addresses of the different instances to be able to remotely connect to it.
A key-pair has been generated, you can extract the key with

terraform output -raw private_key > /tmp/geneve-router-lab.key
chmod 600 /tmp/geneve-router-lab.key

ssh -i /tmp/geneve-router-lab.key ec2-user@<intance-public-ip>

You should normally see both inspection instances as healthy in the GWLB target group, that’s a good sign !

At this point, both public instances should have a running Apache web server (with its default welcome page), and both inspection instances should have a running geneve-router process (with flow tracker enabled, and no logging).

Feel free to connect to the inspection instances using the extracted private key and change the mode the geneve-router process is running (moving it to –no-daemon mode or adding level to the logging.log file).

Note that while the geneve-router process will not be running on the inspection instances, you will not be able to remotely connect to the public instances (even using SSH), as this flow will not be forwarded by the GWLB endpoint.

I hope this article has been interesting for you, feel free to give me some feedbacks by commenting my post on LinkedIn : https://www.linkedin.com/in/anthonybalitrand/