Message ID | 51C1993C.9030204@linux.intel.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir <eliezer.tamir@linux.intel.com> wrote: > This is a wild hack, just as a POC to show the power or LLS with epoll. > > We assume that we only ever need to poll on one device queue, > so the first FD that reports POLL_LL gets saved aside so we can poll on. > > While this assumption is wrong in so many ways, it's very easy to satisfy > with a micro-benchmark. > > [this patch needs the poll patch to be applied first] > with sockperf doing epoll on 1000 sockets I see an avg latency of 6us > hi eliezer, please consider the following solution for epoll that is based on polling dev+queue. instead of looping over the socket as in LLS, maintain in eventpool struct a list of device+queues (qdlist). the dqlist must be unique w.r.t. device+queue, (no two identical device+queues items in qdlist). each device+queues item (qditem) holds: * device (id) * queue (id) * list of epi (epilist) that created this qditem - I think it won't be possible to extend epitem (breaks cache aligned to 128)... instead you can have a simple ll_usec list. * ll_usec, the maximum time to poll from all the referring epi items. finally, polling should iterate over the qdlist once, and then check for events. ---- as far as coding this sketch involves: 1) adjust eventpoll struct. 2) initialize on creation (epoll_create) 3) update the list on modification (epoll_ctl) 3.1) ep_insert->add this epi/ll_usec in relevant qditem (or create new one), and update qditem->ll_usec 3.2) ep_remove->remove this epi/ll_usec from relevant qditem (MUST be existing -- sort of ref counting), and update qditem->ll_usec 3.3) ep_modify->... 4) on polling event (epoll_wait) ep_poll->if qdlist is not empty, then find the maximum ll_usec (could be done while maintaining...) ... and just before going into wait ... if max ll_usec!=0 poll once on all device+queues in the qdlist. continue to the next iteration (check events). 5) to support this flow we also need to implement API for 5.1) given a file/fd/epi, if is a sock then get the device+queue. 5.2) poll over a given device+queue (dq_poll_ll) once. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 25/06/2013 17:26, yaniv saar wrote: > On Wed, Jun 19, 2013 at 2:42 PM, Eliezer Tamir > <eliezer.tamir@linux.intel.com> wrote: >> >> [this patch needs the poll patch to be applied first] >> with sockperf doing epoll on 1000 sockets I see an avg latency of 6us >> > > hi eliezer, > > please consider the following solution for epoll that is based on > polling dev+queue. > instead of looping over the socket as in LLS, maintain in eventpool > struct a list of device+queues (qdlist). Thanks for looking into this. I'm currently working on a solution that has a lot similar to what you are proposing. We don't need a new id mechanism, we already have the napi_id. The nice thing about the napi_id is that the only locking it needs is an rcu_read_lock when dereferencing. we don't need to remember the ll_usec value of each socket because the patch for select/poll (currently waiting for review) added a separate sysctl value for poll. I would like to find a way for the user to specify how long to busy wait, directly from the system call, but I was not able to find a simple way of adding this without a change to the system call prototype. we do however need to track when a socket's napi_id changes. But for that we can hook into sk_mark_ll(). so here is a list of proposed changes: 1. add a linked list of unique napi_id's to struct eventpoll. each id will have a collision list of sockets that have the same id. -a hash is gratuitous, we expect the unique list to have 0 to 2 elements in most cases. 2. when a new socket is added, if its id is new it gets added to the unique list, otherwise to the collision list of that id. 3. when a socket is removed, if it's on the unique list, replace it with the first on its former collision list. 4. add callback mechanism to sk_mark_ll() which will be activated when the mark changes, update the lists. (a socket may be polled by more than one epoll so be careful) 5. add and remove to/from the lists in ep_insert and ep_remove respectively. check if we need to do something for ep_modify(). 6. add an ep_poll helper that will round robin polling on the files in the unique list. 7. init everything from epoll_create. locking: napi_id's are great since they don't need locking except for an rcu_read_lock when polling on one. the lists need a spinlock for adding/removing, maybe they can use ep->lock. callback registration/removal needs to use the same mechanism that ep_add / ep_remove use to protect themselves from the rest of epoll. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index deecc72..3c7562b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -41,6 +41,7 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> #include <linux/compat.h> +#include <net/ll_poll.h> /* * LOCKING: @@ -214,6 +215,7 @@ struct eventpoll { /* used to optimize loop detection check */ int visited; struct list_head visited_list_link; + struct epitem *ll_epi; }; /* Wait structure used by the poll hooks */ @@ -773,13 +775,30 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file) return 0; } -static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt) +static inline unsigned int ep_item_poll(struct epitem *epi, poll_table *pt, struct eventpoll *ep) { + unsigned int events = epi->ffd.file->f_op->poll(epi->ffd.file, pt); pt->_key = epi->event.events; - return epi->ffd.file->f_op->poll(epi->ffd.file, pt) & epi->event.events; + if (events & POLLLLS) { + events &= ~POLLLLS; + ep->ll_epi = epi; + } + + return events & epi->event.events; +} + +static inline bool ep_item_poll_ll(struct epitem *epi) +{ + poll_table wait; + + wait._key = POLLLLS; + wait._qproc = NULL; + + return epi->ffd.file->f_op->poll(epi->ffd.file, &wait); } + static int ep_read_events_proc(struct eventpoll *ep, struct list_head
This is a wild hack, just as a POC to show the power or LLS with epoll. We assume that we only ever need to poll on one device queue, so the first FD that reports POLL_LL gets saved aside so we can poll on. While this assumption is wrong in so many ways, it's very easy to satisfy with a micro-benchmark. [this patch needs the poll patch to be applied first] with sockperf doing epoll on 1000 sockets I see an avg latency of 6us Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> --- fs/eventpoll.c | 39 +++++++++++++++++++++++++++++++++------ 1 files changed, 33 insertions(+), 6 deletions(-) *head, void *priv) { @@ -789,7 +808,7 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, init_poll_funcptr(&pt, NULL); list_for_each_entry_safe(epi, tmp, head, rdllink) { - if (ep_item_poll(epi, &pt)) + if (ep_item_poll(epi, &pt, ep)) return POLLIN | POLLRDNORM; else { /* @@ -1271,7 +1290,7 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event, * this operation completes, the poll callback can start hitting * the new item. */ - revents = ep_item_poll(epi, &epq.pt); + revents = ep_item_poll(epi, &epq.pt, ep); /* * We have to check if something went wrong during the poll wait queue @@ -1403,7 +1422,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, struct epoll_even * Get current event bits. We can safely use the file* here because * its usage count has been increased by the caller of this function. */ - revents = ep_item_poll(epi, &pt); + revents = ep_item_poll(epi, &pt, ep); /* * If the item is "hot" and it is not registered inside the ready @@ -1471,7 +1490,7 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, list_del_init(&epi->rdllink); - revents = ep_item_poll(epi, &pt); + revents = ep_item_poll(epi, &pt, ep); /* * If the event mask intersect the caller-requested one, @@ -1558,6 +1577,10 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, long slack = 0; wait_queue_t wait; ktime_t expires, *to = NULL; + cycles_t ll_time = ll_end_time(); + //bool try_ll = true; + bool can_ll = !!ep->ll_epi; + if (timeout > 0) { struct timespec end_time = ep_set_mstimeout(timeout); @@ -1601,6 +1624,10 @@ fetch_events: break; } + while (can_ll && can_poll_ll(ll_time) + && !ep_events_available(ep)) + ep_item_poll_ll(ep->ll_epi); + spin_unlock_irqrestore(&ep->lock, flags); if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS)) timed_out = 1; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html