Tutorial: Understanding epoll's Implementation Details and Performance
Tutorial: Understanding epoll's Implementation Details and Performance
1. Introduction to epoll
epoll
is a Linux kernel mechanism for efficient I/O event monitoring, designed to handle large numbers of file descriptors (FDs). It excels in high-concurrency scenarios (e.g., web servers with 10k+ connections) by providing O(1) event readiness checks, unlike older mechanisms like select
/poll
.
2. Why epoll Outperforms select/poll
- select/poll limitations:
- O(n) scanning: Every call iterates through all monitored FDs to check readiness.
- Repeated kernel↔user copies: The entire FD set is copied to the kernel on each call.
- FD limits:
select
uses a fixed-size bitmap (typically 1024 FDs).
- epoll advantages:
- O(1) readiness checks: Only ready events are processed.
- No redundant copies: The kernel maintains the FD set.
- No FD limits: Scales to millions of connections.
3. Key epoll Components
3.1 Data Structures
- Red-Black Tree:
- Purpose: Stores all FDs registered via
epoll_ctl
. - Why a tree?: Efficient O(log n) inserts/deletes/modifies.
- Kernel-managed: Persists across
epoll_wait
calls.
- Purpose: Stores all FDs registered via
- Ready List (Linked List):
- Purpose: Stores FDs with detected events.
- Populated via callbacks: Kernel adds FDs to this list when I/O events occur.
3.2 System Calls
epoll_create1()
:- Creates an
epoll
instance, returning a file descriptor pointing to it. - Internally initializes the red-black tree and ready list.
- Creates an
epoll_ctl(epfd, op, fd, event)
:- Modifies the red-black tree:
- EPOLL_CTL_ADD: Insert FD with event mask (e.g.,
EPOLLIN
). - EPOLL_CTL_MOD: Update event mask for an existing FD.
- EPOLL_CTL_DEL: Remove FD from the tree.
- EPOLL_CTL_ADD: Insert FD with event mask (e.g.,
- Kernel callback setup: When adding an FD, the kernel registers a callback to track its I/O state. This callback adds the FD to the ready list when events occur.
- Modifies the red-black tree:
epoll_wait(epfd, events, maxevents, timeout)
:- Blocks until ≥1 event occurs or timeout.
- Checks the ready list: Copies ready events (struct
epoll_event
) to user space. - O(1) event count: Returns only active events (up to
maxevents
).
4. Event Notification Mechanism
4.1 Callback-Driven Ready List
- Kernel-space efficiency: When network packets arrive or a socket becomes writable, the kernel’s network stack triggers the FD-specific callback (e.g., for a socket, this might be
sock_def_readable
). - Callback action: Adds the FD to the ready list and wakes up the process blocked in
epoll_wait
. - No polling required: The kernel doesn’t scan FDs—events are pushed to the ready list as they occur.
4.2 Edge-Triggered (ET) vs. Level-Triggered (LT)
- Level-Triggered (default):
epoll_wait
notifies if the FD is still ready (e.g., unread data exists).- Similar to
select
/poll
behavior.
- Edge-Triggered (enable with
EPOLLET
):- Notifies only when the FD’s state changes (e.g., new data arrives).
- Why it’s faster: Reduces redundant wakeups (e.g., a single notification until new data arrives).
- Requires non-blocking FDs: Apps must read/write until
EAGAIN
to avoid missing events.
5. Why epoll is Fast: Implementation Summary
Factor | select/poll | epoll |
---|---|---|
FD Management | Linear array/bitmap (O(n)) | Red-black tree (O(log n)) |
Event Detection | Scan all FDs every call | Callback-driven ready list (O(1)) |
Kernel↔User Data Copy | Copy entire FD set each call | Copy only active events |
Scalability | Degrades with FDs | Scales with active events |
6. Example: TCP Server Using epoll
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <sys/epoll.h>
#include <fcntl.h>
#define MAX_EVENTS 64
int main() {
int epfd = epoll_create1(0);
struct epoll_event ev, events[MAX_EVENTS];
// Add listen socket to epoll
ev.events = EPOLLIN | EPOLLET; // Edge-Triggered
ev.data.fd = listen_sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, listen_sock, &ev);
while (1) {
int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
for (int i = 0; i < nready; i++) {
if (events[i].data.fd == listen_sock) {
// Accept new connection and add to epoll with EPOLLET
int conn_sock = accept(...);
fcntl(conn_sock, F_SETFL, O_NONBLOCK);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = conn_sock;
epoll_ctl(epfd, EPOLL_CTL_ADD, conn_sock, &ev);
} else {
// Handle read/write (loop with EAGAIN check for ET!)
while (read(events[i].data.fd, ...) {
if (errno == EAGAIN) break;
}
}
}
}
}
7. Common Pitfalls
- Edge-Triggered starvation: Forgetting to read/write until
EAGAIN
may cause missed events. - Thread safety: Use
EPOLLONESHOT
if multiple threads process the same FD. - FD leaks: Remove closed FDs from epoll with
EPOLL_CTL_DEL
.
8. Conclusion
epoll
achieves high performance through:
- Efficient data structures: Red-black tree for FD management.
- Kernel callbacks: Eliminate O(n) scans.
- Minimal data copying: Only active events are passed to userspace.
- Edge-Triggered optimization: Reduces redundant wakeups.
Use epoll
for Linux-based high-concurrency apps (e.g., HTTP servers, real-time systems). For non-Linux systems, explore alternatives like kqueue
(BSD) or IOCP
(Windows).
This post is licensed under CC BY 4.0 by the author.