Post

Tutorial: Understanding epoll's Implementation Details and Performance

Tutorial: Understanding epoll's Implementation Details and Performance

1. Introduction to epoll

epoll is a Linux kernel mechanism for efficient I/O event monitoring, designed to handle large numbers of file descriptors (FDs). It excels in high-concurrency scenarios (e.g., web servers with 10k+ connections) by providing O(1) event readiness checks, unlike older mechanisms like select/poll.

2. Why epoll Outperforms select/poll

  • select/poll limitations:
    • O(n) scanning: Every call iterates through all monitored FDs to check readiness.
    • Repeated kernel↔user copies: The entire FD set is copied to the kernel on each call.
    • FD limits: select uses a fixed-size bitmap (typically 1024 FDs).
  • epoll advantages:
    • O(1) readiness checks: Only ready events are processed.
    • No redundant copies: The kernel maintains the FD set.
    • No FD limits: Scales to millions of connections.

3. Key epoll Components

3.1 Data Structures

  • Red-Black Tree:
    • Purpose: Stores all FDs registered via epoll_ctl.
    • Why a tree?: Efficient O(log n) inserts/deletes/modifies.
    • Kernel-managed: Persists across epoll_wait calls.
  • Ready List (Linked List):
    • Purpose: Stores FDs with detected events.
    • Populated via callbacks: Kernel adds FDs to this list when I/O events occur.

3.2 System Calls

  1. epoll_create1():
    • Creates an epoll instance, returning a file descriptor pointing to it.
    • Internally initializes the red-black tree and ready list.
  2. epoll_ctl(epfd, op, fd, event):
    • Modifies the red-black tree:
      • EPOLL_CTL_ADD: Insert FD with event mask (e.g., EPOLLIN).
      • EPOLL_CTL_MOD: Update event mask for an existing FD.
      • EPOLL_CTL_DEL: Remove FD from the tree.
    • Kernel callback setup: When adding an FD, the kernel registers a callback to track its I/O state. This callback adds the FD to the ready list when events occur.
  3. epoll_wait(epfd, events, maxevents, timeout):
    • Blocks until ≥1 event occurs or timeout.
    • Checks the ready list: Copies ready events (struct epoll_event) to user space.
    • O(1) event count: Returns only active events (up to maxevents).

4. Event Notification Mechanism

4.1 Callback-Driven Ready List

  • Kernel-space efficiency: When network packets arrive or a socket becomes writable, the kernel’s network stack triggers the FD-specific callback (e.g., for a socket, this might be sock_def_readable).
  • Callback action: Adds the FD to the ready list and wakes up the process blocked in epoll_wait.
  • No polling required: The kernel doesn’t scan FDs—events are pushed to the ready list as they occur.

4.2 Edge-Triggered (ET) vs. Level-Triggered (LT)

  • Level-Triggered (default):
    • epoll_wait notifies if the FD is still ready (e.g., unread data exists).
    • Similar to select/poll behavior.
  • Edge-Triggered (enable with EPOLLET):
    • Notifies only when the FD’s state changes (e.g., new data arrives).
    • Why it’s faster: Reduces redundant wakeups (e.g., a single notification until new data arrives).
    • Requires non-blocking FDs: Apps must read/write until EAGAIN to avoid missing events.

5. Why epoll is Fast: Implementation Summary

Factorselect/pollepoll
FD ManagementLinear array/bitmap (O(n))Red-black tree (O(log n))
Event DetectionScan all FDs every callCallback-driven ready list (O(1))
Kernel↔User Data CopyCopy entire FD set each callCopy only active events
ScalabilityDegrades with FDsScales with active events

6. Example: TCP Server Using epoll

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <sys/epoll.h>
#include <fcntl.h>

#define MAX_EVENTS 64

int main() {
  int epfd = epoll_create1(0);
  struct epoll_event ev, events[MAX_EVENTS];

  // Add listen socket to epoll
  ev.events = EPOLLIN | EPOLLET; // Edge-Triggered
  ev.data.fd = listen_sock;
  epoll_ctl(epfd, EPOLL_CTL_ADD, listen_sock, &ev);

  while (1) {
    int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
    for (int i = 0; i < nready; i++) {
      if (events[i].data.fd == listen_sock) {
        // Accept new connection and add to epoll with EPOLLET
        int conn_sock = accept(...);
        fcntl(conn_sock, F_SETFL, O_NONBLOCK);
        ev.events = EPOLLIN | EPOLLET;
        ev.data.fd = conn_sock;
        epoll_ctl(epfd, EPOLL_CTL_ADD, conn_sock, &ev);
      } else {
        // Handle read/write (loop with EAGAIN check for ET!)
        while (read(events[i].data.fd, ...) {
          if (errno == EAGAIN) break;
        }
      }
    }
  }
}

7. Common Pitfalls

  • Edge-Triggered starvation: Forgetting to read/write until EAGAIN may cause missed events.
  • Thread safety: Use EPOLLONESHOT if multiple threads process the same FD.
  • FD leaks: Remove closed FDs from epoll with EPOLL_CTL_DEL.

8. Conclusion

epoll achieves high performance through:

  1. Efficient data structures: Red-black tree for FD management.
  2. Kernel callbacks: Eliminate O(n) scans.
  3. Minimal data copying: Only active events are passed to userspace.
  4. Edge-Triggered optimization: Reduces redundant wakeups.

Use epoll for Linux-based high-concurrency apps (e.g., HTTP servers, real-time systems). For non-Linux systems, explore alternatives like kqueue (BSD) or IOCP (Windows).

This post is licensed under CC BY 4.0 by the author.