Note - we've now conducted a more complete study. See :
Tim Brecht, January 12 2004.
Note: these are very preliminary test results and they aren't
testing the conditions that epoll was designed to run well under
(i.e., environments where there are lots and lots of connections
that may or may not be doing much).
We've been experimenting with using epoll_create, epoll_ctl,
and epoll_wait in our user-level micro web-server call the
userver
I've conducted a series of experiments comparing the performance
of the userver using select, poll, and epoll.
This set of experiments is very preliminary as we are still working
on improving the epoll version.
Note that we are currently only using level-triggered events.
This is simply because select and poll are level-triggered and
it made integration with the existing code easy
(we hope to examine an edge-triggered approach in the future).
This is a description of what we've been trying and finding so far.
The workload consists of 8 clients running httperf and pounding on
the server by repeatedly requesting the same one file (sized to fit
in one packet).
There are no dead connections.
The server contains two 2.4 GHz Xeon processors.
1 GB of memory, and a fast SCSI disk.
It has two on board Intel Pro/1000 NICs.
It also contains two dual ported Intel MT Pro/1000 cards.
The 8 clients are all dual Intel Xeon systems
with IDE drives and 1 GB of memory.
Each client has two on board Intel Pro/1000 ports/cards
and a single dual ported Intel MT Pro/1000 card.
What each line in the experiment represents is explained below.
The current again very preliminary results show that using
the current implemenations (again we believe they can be improved
more) for epoll, that select and poll perform better.
Experiments were conducted using the userver
version 0.4.2 with the following options.
Grof output using epoll (reply rate = 12523)
This version spends a significant portion of it's time
in epoll_ctl (16.74%). This is used when a connection
transitions for example from a state where it is reading a request
to a state where it is writing a request.
We tell the OS that we only want to know about events that permit
us to write on that connection and we don't want to know about
events that permit us to read on that connection (at this time).
Comparatively there isn't that much time spent in epoll_wait
(7.75%).
In all cases below we are using write to send the
reply header, sendfile to send the file, and setsockopt
to cork and uncork the socket so that the header and file
can be sent in the same packet.
The syscall to fcntl is used to place new sockets into
non blocking mode.
Grof output using epoll2 (reply rate = 12842)
This version is designed to reduce the amount of time spend
doing epoll_ctl calls. This is achieved but at the expense
of increasing the amount of time spent doing epoll_wait calls.
When compared with the technique above the number of
epoll_ctl calls is reduced from about 7.3 million to about 4.1 million
and the resulting portion of
time spent in epoll_ctl is reduced from 16.74% to 10.49%.
Interestingly the number of calls to epoll_wait has increased
by two orders of magnitude.
This is because (as noted in the description of the different
methods above) we get a large number of events that we have
no interest in. E.g., that we are able to write to a connection
that we are currently trying to read from.
Additionally, because there will almost always be events that are ready
at the time of the call we get smaller numbers of events back on average
(15.4 events per epoll_wait call).
This compares with an average of 261.8 events from each epoll_wait
call for the simple epoll version.
In this case we also see a bit of overhead from
determining what state of the finite state machine
we are in (get_fsm_state). This is required in this server
because now we can get events from epoll_wait that we
may not want to act on because we aren't in a state that
warrants it. E.g., if the state we are in is reading a request
we don't act on events that permit us to write to that
connection (until we have the full request, have parsed it, etc.).
Grof output using select (reply rate = 14196)
In this case we are using select and all control
of which descriptors/events we are interested in are controlled
by setting and/or clearing bits in the user address space.
On the call to select that information is copied to and from the kernel.
But because it doesn't require an extra system call per
change in state for each connection there is not overhead for it
and the select server is able to perform considerably better
(** under this workload **).
Grof output using poll (reply rate 14471).
Note that in this instance poll is slighly better than select.
In this case get_fsm_state appears because on 2.6.0 I noticed
instances (potential bugs I think) where poll would report
that socket writable event even when that event was not in the
interest set. The work around for this test was to add code
to check the connections state (in the finite state machine)
before doing reads and/or writes. Otherwise we were trying to
write when we shouldn't have been.
Louay Gammo, Tim Brecht, Amol Shukla, David Pariag,
Comparing and Evaluating epoll, select, and poll Event Mechanisms
6th Annual Ottawa Linux Symposium,
Ottawa, Canada, July, 2004.
Preliminary Experiments with epoll on Linux 2.6.0
(epoll2) -c 15000 -f 32000 -C --use-sendfile --use-tcp-cork -m 0 --stats-interval 5 --use-epoll2
(epoll) -c 15000 -f 32000 -C --use-sendfile --use-tcp-cork -m 0 --stats-interval 5 --use-epoll
(poll) -c 15000 -f 32000 -C --use-sendfile --use-tcp-cork -m 0 --stats-interval 5 --use-poll
(select) -c 15000 -f 32000 -C --use-sendfile --use-tcp-cork -m 0 --stats-interval 5
(epoll orig) same as (epoll) above but code no longer exists as it was clearly a poor implementation.
/* now that we have the request we don't want info about
* reading this sd but about writing this sd.
*/
interest_set_readble(sd);
interest_set_writable(sd);
The result was an excessive number of calls to epoll_ctl.
/* now that we have the request we don't want info about
* reading this sd but about writing this sd.
*/
interest_set_change(sd, ISET_NOT_READABLE | ISET_WRITABLE);
This type of improvement reduces the number of epoll_ctl calls by half.
As can be seen by comparing epoll-orig and epoll this does help performance
some.
So the pros and cons of this approach are:
Some gprof Output
All output here is obtained using a separate run from the
experiments shown in the graph above.
We ran each experiment for one data point (i.e., a target
request rate of 24000 requests per second).
I've included any routines that account for 1% or more
of the cpu time.
Flat profile:
Each sample counts as 0.000999001 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.12 22.58 22.58 read
16.74 41.37 18.79 7353337 0.00 0.00 sys_epoll_ctl
14.33 57.45 16.08 close
9.65 68.27 10.83 accept
9.27 78.67 10.40 setsockopt
7.75 87.37 8.69 20526 0.00 0.00 sys_epoll_wait
5.01 92.99 5.62 write
3.47 96.89 3.90 do_fcntl
2.74 99.96 3.08 sendfile
2.42 102.67 2.71 1673500 0.00 0.00 parse_bytes
----------------------------------------------------------------------
Flat profile:
Each sample counts as 0.000999001 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
18.37 22.73 22.73 read
14.64 40.85 18.12 close
13.97 58.12 17.28 2779336 0.00 0.00 sys_epoll_wait
10.49 71.11 12.98 4144242 0.00 0.00 sys_epoll_ctl
8.99 82.23 11.12 accept
7.77 91.84 9.61 setsockopt
4.23 97.07 5.23 write
3.44 101.33 4.26 do_fcntl
3.14 105.22 3.89 sendfile
2.68 108.54 3.32 1745477 0.00 0.00 parse_bytes
1.88 110.86 2.32 85789176 0.00 0.00 get_fsm_state
1.76 113.04 2.18 42894588 0.00 0.00 do_epoll_event
1.27 114.61 1.58 gettimeofday
----------------------------------------------------------------------
Flat profile:
Each sample counts as 0.000999001 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
19.71 21.99 21.99 read
15.87 39.70 17.71 close
14.31 55.66 15.96 select
11.73 68.74 13.08 setsockopt
10.21 80.13 11.39 accept
6.27 87.13 7.00 write
4.01 91.60 4.47 do_fcntl
3.74 95.77 4.17 sendfile
2.59 98.66 2.89 1916279 0.00 0.00 parse_bytes
1.31 100.12 1.46 25416050 0.00 0.00 check_and_do_read_write
1.23 101.50 1.38 32857 0.00 0.00 process_sds
1.13 102.77 1.26 4133040 0.00 0.00 socket_readable
----------------------------------------------------------------------
Flat profile:
Each sample counts as 0.000999001 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
18.71 19.76 19.76 read
15.93 36.57 16.81 close
14.06 51.40 14.84 poll
11.32 63.36 11.95 setsockopt
10.33 74.26 10.91 accept
5.93 80.52 6.26 write
3.93 84.66 4.15 do_fcntl
3.86 88.74 4.08 sendfile
2.62 91.50 2.76 1768890 0.00 0.00 parse_bytes
1.00 92.56 1.05 5657788 0.00 0.00 get_fsm_state
----------------------------------------------------------------------
Future work
Last modified:
Mon Jan 12 11:36:47 EST 2004