Sending of large file through sockets boosted with Zero-Copy !

Here we are for a new interesting (I hope !) technical topic.

To be honest, I’ll talk here about something I discovered very recently, by looking at a very interesting presentation from Netflix, talking about their OCA (Open Connect Appliances).
Those are powerful cache appliances Netflix uses on its pops and which they also offer to ISP, which build their CDN (called Open Connect).

Of course, those appliances are intended to store as much content as possible, while delivering it at maximum speed to the largest amount possible of users. Working at this scale, each performance increase is infinitely geared down.
And one of the “boosters” they use to increase the speed at which they deliver their content is the use of a feature called “ZeroCopy”.

What is Zero-Copy ?

If you take a look at some articles talking about what we are about to see here, you’ll often see the terms “Zero-Copy” and “sendfile” used jointly.

In fact, Zero-Copy is a technique which is intended to avoid all unnecessary copies of data while copying it from the input to the output. We’ll see later that depending of how your code is built, chunks of what you want to send are copied multiple times internally between the input and the output, leading to a huge loss of performance.

On its side, “sendfile” is a system call name, which is mainly used to perform Zero-Copy operations. It has been introduced a long time ago (I did not manage to find the exact date, but I found some posts talking about it back to 2014), but is yet not that much used. It is however, or can at least be manually enabled, on the most widely used web-servers, as Nginx.

While we usually use “read” and “write” syscalls, specifying the amount of data (in bytes) we need to read, or the data (itself) we need to write, “sendfile” takes 4 parameters :
– the input file descriptor
– the output file descriptor
– the offset (in bytes, from the beginning of the input FD, where the copy has to start)
– the buffer size (how many bytes to copy)

Using this system call, instead of reading the data with “read” (which leads to a context-switch), putting it in a buffer on user space, then writing it to the output file descriptor with “write” or “sendto” (which leads to another context-switch), we can directly ask the kernel to copy X bytes from the input to the output FD, using a single context-switch, and without having to store any data chunk on user space.

Client side

Let’s build a simple TCP socket client in Python, which will connect to port 8080 and receive all available data.

import socket

client_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_sock.connect(('127.0.0.1', 8080))
recv_len = 0

while (data := client_sock.recv(65536)):
  recv_len += len(data)

print(f"Received {recv_len} bytes")
client_sock.close()

Serve it the standard way

We generally all do it the same way : we allocate a buffer, use the “read” function to get the amount of bytes up to the buffer size from the input file descriptor, and then “write” it (or “sendto” for a socket) to the output file descriptor.

import os
import socket
import time

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('127.0.0.1', 8080))
sock.listen(1)

try:
  print("Waiting for connection on port 8080...")
  conn, addr = sock.accept()
  print(f"Connection from {addr[0]} : {addr[1]}")
  start = time.time()
  with open('bigfile', 'rb') as f:
    while (data:=f.read(4096)):
      conn.sendall(data)
  conn.close()
  end = time.time()
  print(f"Time to send file : {end - start}")
except KeyboardInterrupt as e:
  pass

Below is the high-level representation of what happens on the system with such a way to proceed :

This sequence is repeated up to the point where we reached the EOF (end of file).
As you can easily imagine, sending a 2GB file this way will require a lot of system calls / context switches.
2 * 10^9 / 4 * 10^3 = around 500 000 reads, same amount of sendto, for a total of 1M context switches.

Let’s check if with perf :

root@bd9e1ad91c85:/test2# perf stat -e syscalls:sys_enter* -B -a python3 standard_server.py
Waiting for connection on port 8080...
Connection from 127.0.0.1 : 56976
Time to send file : 9.505289554595947
Bye bye

 Performance counter stats for 'system wide':

                 2      syscalls:sys_enter_socket
                 3      syscalls:sys_enter_connect
                 1      syscalls:sys_enter_getsockname
            524291      syscalls:sys_enter_sendto
             86709      syscalls:sys_enter_recvfrom
                 1      syscalls:sys_enter_getrandom
[...]
            525463      syscalls:sys_enter_read
               939      syscalls:sys_enter_write
[...]

We can see that it took more than 9 seconds to send a 2GB file that way.

Now with Zero-Copy

Let’s now try the same thing (sending a 2GB file into a TCP socket) using the “sendfile” system call.

Our server code is just a bit modified, see below.

import os
import socket
import time

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('127.0.0.1', 8080))
sock.listen(1)

try:
  print("Waiting for connection on port 8080...")
  conn, addr = sock.accept()
  print(f"Connection from {addr[0]} : {addr[1]}")
  start = time.time()
  file_size = os.stat('bigfile').st_size
  print(f"{file_size} bytes to send")
  with open('bigfile', 'rb') as f:
    zerocopy_result = None
    offset = 0
    while zerocopy_result != 0:
      zerocopy_result = os.sendfile(conn.fileno(), f.fileno(), offset, 4096)
      offset += zerocopy_result
  conn.close()
  end = time.time()
  print(f"Time to send file : {end - start}")
except KeyboardInterrupt as e:
  pass

sock.close()
print("Bye bye")

This way, here’s another representation of what happens under-the-hood :

We can see that there’s no need anymore to allocate a user-space buffer : the copied bytes never reach this context.
The full copy task is assigned to the Kernel, which will copy the requested amount of bytes from the input file descriptor to the output file descriptor.
Depending of the amount of bytes we request to be copied, we have to use this system call with an increasing offset value, up to the point we get a return value of 0, which means that the EOF has been reached (else, we’ll get the number of copied bytes).

You can see on the code above that I am still copying the file by chunks of 4096 bytes :

zerocopy_result = os.sendfile(conn.fileno(), f.fileno(), offset, 4096)

However, the result is that I only need 1 system call to copy this amount of bytes from the input FD to the output FD (instead of a read + a sendto), and there’s no need to copy the data on any user-space buffer.
The performance increase is quite clear :

root@bd9e1ad91c85:/test2# perf stat -e syscalls:sys_enter* -B -a python3 zerocopy_server.py
Waiting for connection on port 8080...
Connection from 127.0.0.1 : 56986
2147483648 bytes to send
Time to send file : 5.522516250610352
Bye bye

 Performance counter stats for 'system wide':

                 1      syscalls:sys_enter_socket
                 1      syscalls:sys_enter_connect
                 1      syscalls:sys_enter_getsockname
                 0      syscalls:sys_enter_getpeername
                 0      syscalls:sys_enter_sendto
             75078      syscalls:sys_enter_recvfrom
[...]
               217      syscalls:sys_enter_read
[...]
            524289      syscalls:sys_enter_sendfile64
[...]

We can see that we only have a few calls to the “read” syscall (which is needed by the python interpreter), and no call to the “sendto” syscall.
However, we have actually more than 500K calls to the “sendfile” syscall (again, 2 * 10^9 / 4 * 10^3).
We actually have divided by 2 the number of context switches needed, and we can see the time needed to send the 2GB file over this socket : almost half the time than before.

We can actually do way better by increasing the amount of bytes copied by the kernel from the input FD to the output FD (let’s say, to 65KB), as we don’t need to create this as a user-space buffer :

zerocopy_result = os.sendfile(conn.fileno(), f.fileno(), offset, 65535)

Let’s measure it again :

root@bd9e1ad91c85:/test2# perf stat -e syscalls:sys_enter* -B -a python3 zerocopy_server.py
Waiting for connection on port 8080...
Connection from 127.0.0.1 : 56992
2147483648 bytes to send
Time to send file : 1.3105833530426025
Bye bye

 Performance counter stats for 'system wide':

                 3      syscalls:sys_enter_socket
                 0      syscalls:sys_enter_socketpair
                 1      syscalls:sys_enter_bind
                 1      syscalls:sys_enter_listen
                 1      syscalls:sys_enter_accept4
                 0      syscalls:sys_enter_accept
                 3      syscalls:sys_enter_connect
                 1      syscalls:sys_enter_getsockname
                 0      syscalls:sys_enter_getpeername
               164      syscalls:sys_enter_read
                 3      syscalls:sys_enter_sendto
             32780      syscalls:sys_enter_recvfrom
                 1      syscalls:sys_enter_setsockopt
             32769      syscalls:sys_enter_sendfile64
                 0      syscalls:sys_enter_copy_file_range
                 0      syscalls:sys_enter_truncate
[...]

Nothing to add 🙂

I hope this new article has been informative / interesting for you ! Can’t wait to share the next one !