Message ID | 1454616988-21901-1-git-send-email-emmanuel.grumbach@intel.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote: > As many (all?) WiFi devices, Intel WiFi devices have > transmit queues which have 256 transmit descriptors > each and each descriptor corresponds to an MPDU. > This means that when it is full, the queue contains > 256 * ~1500 bytes to be transmitted (if we don't have > A-MSDUs). The purpose of those queues is to have enough > packets to be ready for transmission so that when the device > gets an opportunity to transmit (TxOP), it can take as many > packets as the spec allows and aggregate them into one > A-MPDU or even several A-MPDUs if we are using bursts. I guess this is only really usable if you have exactly one peer connected (ie, in station mode)? Otherwise, you could have one slow peer and one fast one, and then I suspect this would not work so well? For reference, ath10k has around 1400 tx descriptors, though in practice not all are usable, and in stock firmware, I'm guessing the NIC will never be able to actually fill up it's tx descriptors and stop traffic. Instead, it just allows the stack to try to TX, then drops the frame... Thanks, Ben
On 02/04/2016 10:46 PM, Ben Greear wrote: > On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote: >> As many (all?) WiFi devices, Intel WiFi devices have >> transmit queues which have 256 transmit descriptors >> each and each descriptor corresponds to an MPDU. >> This means that when it is full, the queue contains >> 256 * ~1500 bytes to be transmitted (if we don't have >> A-MSDUs). The purpose of those queues is to have enough >> packets to be ready for transmission so that when the device >> gets an opportunity to transmit (TxOP), it can take as many >> packets as the spec allows and aggregate them into one >> A-MPDU or even several A-MPDUs if we are using bursts. > I guess this is only really usable if you have exactly one > peer connected (ie, in station mode)? > > Otherwise, you could have one slow peer and one fast one, > and then I suspect this would not work so well? Yes. I guess this one (big) limitation. I guess that what would happen in this case is that the the latency would constantly jitter. But I also noticed that I could reduce the transmit queue to 130 descriptors (instead of 256) and still reach maximal throughput because we can refill the queues quickly enough. In iwlwifi, we have plans to have one queue for each peer. This is under development. Not sure when it'll be ready. It also requires firmware change obviously. > For reference, ath10k has around 1400 tx descriptors, though > in practice not all are usable, and in stock firmware, I'm guessing > the NIC will never be able to actually fill up it's tx descriptors > and stop traffic. Instead, it just allows the stack to try to > TX, then drops the frame... 1400 descriptors, ok... but they are not organised in queues? (forgive my ignorance of athX drivers) > > Thanks, > Ben >
On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote: > > > On 02/04/2016 10:46 PM, Ben Greear wrote: >> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote: >>> As many (all?) WiFi devices, Intel WiFi devices have >>> transmit queues which have 256 transmit descriptors >>> each and each descriptor corresponds to an MPDU. >>> This means that when it is full, the queue contains >>> 256 * ~1500 bytes to be transmitted (if we don't have >>> A-MSDUs). The purpose of those queues is to have enough >>> packets to be ready for transmission so that when the device >>> gets an opportunity to transmit (TxOP), it can take as many >>> packets as the spec allows and aggregate them into one >>> A-MPDU or even several A-MPDUs if we are using bursts. >> I guess this is only really usable if you have exactly one >> peer connected (ie, in station mode)? >> >> Otherwise, you could have one slow peer and one fast one, >> and then I suspect this would not work so well? > > Yes. I guess this one (big) limitation. I guess that what would happen > in this case is that the the latency would constantly jitter. But I also > noticed that I could reduce the transmit queue to 130 descriptors > (instead of 256) and still reach maximal throughput because we can > refill the queues quickly enough. > In iwlwifi, we have plans to have one queue for each peer. > This is under development. Not sure when it'll be ready. It also requires > firmware change obviously. Per-peer queues will probably be nice, especially if we can keep the buffer bloat manageable. >> For reference, ath10k has around 1400 tx descriptors, though >> in practice not all are usable, and in stock firmware, I'm guessing >> the NIC will never be able to actually fill up it's tx descriptors >> and stop traffic. Instead, it just allows the stack to try to >> TX, then drops the frame... > > 1400 descriptors, ok... but they are not organised in queues? > (forgive my ignorance of athX drivers) I think all the details are in the firmware, at least for now. The firmware details are probably not something I should go into, but suffice it to say its complex and varies between firmware versions in non-trivial ways. Thanks, Ben
On 4 February 2016 at 22:14, Ben Greear <greearb@candelatech.com> wrote: > On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote: >> On 02/04/2016 10:46 PM, Ben Greear wrote: >>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote: >>>> >>>> As many (all?) WiFi devices, Intel WiFi devices have >>>> transmit queues which have 256 transmit descriptors >>>> each and each descriptor corresponds to an MPDU. >>>> This means that when it is full, the queue contains >>>> 256 * ~1500 bytes to be transmitted (if we don't have >>>> A-MSDUs). The purpose of those queues is to have enough >>>> packets to be ready for transmission so that when the device >>>> gets an opportunity to transmit (TxOP), it can take as many >>>> packets as the spec allows and aggregate them into one >>>> A-MPDU or even several A-MPDUs if we are using bursts. >>> >>> I guess this is only really usable if you have exactly one >>> peer connected (ie, in station mode)? >>> >>> Otherwise, you could have one slow peer and one fast one, >>> and then I suspect this would not work so well? >> >> >> Yes. I guess this one (big) limitation. I guess that what would happen >> in this case is that the the latency would constantly jitter. But I also Hmm.. You'd probably need to track per-station packet sojourn time as well and make it possible to stop/wake queues per station. >> noticed that I could reduce the transmit queue to 130 descriptors >> (instead of 256) and still reach maximal throughput because we can >> refill the queues quickly enough. >> In iwlwifi, we have plans to have one queue for each peer. >> This is under development. Not sure when it'll be ready. It also requires >> firmware change obviously. > > Per-peer queues will probably be nice, especially if we can keep the > buffer bloat manageable. Per-station queues sound tricky if you consider bufferbloat. To maximize use of airtime (i.e. txop) you need to send big aggregates. Since aggregates are per station-tid to maximize multi-station performance (in AP mode) you'll need to queue a lot of frames, per each station, depending on the chosen tx rate. A bursted txop can be as big as 5-10ms. If you consider you want to queue 5-10ms worth of data for *each* station at any given time you obviously introduce a lot of lag. If you have 10 stations you might end up with service period at 10*10ms = 100ms. This gets even worse if you consider MU-MIMO because you need to do an expensive sounding procedure before transmitting. So while SU aggregation can probably still work reasonably well with shorter bursts (1-2ms) MU needs at least 3ms to get *any* gain when compared to SU (which obviously means you want more to actually make MU pay off). The rule of thumb is the longer you wait the bigger capacity you can get. Apparently there's interest in maximizing throughput but it stands in direct opposition of keeping the latency down so I've been thinking how to satisfy both. The current approach ath10k is taking (patches in review [1][2]) is to use mac80211 software queues for per-station queuing, exposing queue state to firmware (it decides where frames should be dequeued from) and making it possible to stop/wake per-station tx subqueue with fake netdev queues. I'm starting to think this is not the right way though because it's inherently hard to control latency and there's a huge memory overhead associated with the fake netdev queues. Also fq_codel is a less effective with this kind of setup. My current thinking is that the entire problem should be solved via (per-AC) qdiscs, e.g. fq_codel. I guess one could use limit/target/interval/quantum knobs to tune it for higher latency of aggregation-oriented Wi-Fi links where long service time (think 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi works in the first place, i.e. Wi-Fi gets better throughput if you deliver bursts of packets destined to the same station. Moreover this gets even more complicated with MU-MIMO where you may want to consider spatial location (which influences signal quality when grouped) of each station when you decide which set of stations you're going to aggregate to in parallel. Since drivers have a finite tx ring this it is important to deliver bursts that can actually be aggregated efficiently. This means driver would need to be able to tell qdisc about per-flow conditions to influence the RR scheme in some way (assuming a qdiscs even understands flows; do we need a unified way of talking about flows between qdiscs and drivers?). [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html >>> For reference, ath10k has around 1400 tx descriptors, though >>> in practice not all are usable, and in stock firmware, I'm guessing >>> the NIC will never be able to actually fill up it's tx descriptors >>> and stop traffic. Instead, it just allows the stack to try to >>> TX, then drops the frame... >> >> >> 1400 descriptors, ok... but they are not organised in queues? >> (forgive my ignorance of athX drivers) > > > I think all the details are in the firmware, at least for now. Yeah. Basically ath10k has a flat set of tx descriptors which are AC-agnostic. Firmware classifies them internally to per-AC HW queues. Michał
> A bursted txop can be as big as 5-10ms. If you consider you want to > queue 5-10ms worth of data for *each* station at any given time you > obviously introduce a lot of lag. If you have 10 stations you might > end up with service period at 10*10ms = 100ms. This gets even worse if > you consider MU-MIMO because you need to do an expensive sounding > procedure before transmitting. So while SU aggregation can probably > still work reasonably well with shorter bursts (1-2ms) MU needs at > least 3ms to get *any* gain when compared to SU (which obviously means > you want more to actually make MU pay off). I am not sure where you get these numbers. Got a spreadsheet? Gradually reducing the maximum sized txop as a function of the number of stations makes sense. If you have 10 stations pending delivery and reduced the max txop to 1ms, you hurt bandwidth at that instant, but by offering more service to more stations, in less time, they will converge on a reasonable share of the bandwidth for each, faster[1]. And I'm sure that the person videoconferencing on a link like that would appreciate getting some service inside of a 10ms interval, rather than a 100ms. yes, there's overhead, and that's not the right number, which would vary as to g,n,ac and successors. You will also get more opportunities to use mu-mimo with shorter bursts extant and more stations being regularly serviced. [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50 > The rule of thumb is the > longer you wait the bigger capacity you can get. This is not strictly true as the "fountain" of packets is regulated by acks on the other side of the link, and ramp up or down as a function of service time and loss. > > Apparently there's interest in maximizing throughput but it stands in > direct opposition of keeping the latency down so I've been thinking > how to satisfy both. > > The current approach ath10k is taking (patches in review [1][2]) is to > use mac80211 software queues for per-station queuing, exposing queue > state to firmware (it decides where frames should be dequeued from) > and making it possible to stop/wake per-station tx subqueue with fake > netdev queues. I'm starting to think this is not the right way though > because it's inherently hard to control latency and there's a huge > memory overhead associated with the fake netdev queues. What is this overhead? Applying things like codel tend to dramatically shorten the amount of skbs extant... modern 802.11ac capable hardware has tons more memory... > Also fq_codel > is a less effective with this kind of setup. fq_codel's principal problems with working with wifi are long and documented in the talk above. > My current thinking is that the entire problem should be solved via > (per-AC) qdiscs, e.g. fq_codel. I guess one could use > limit/target/interval/quantum knobs to tune it for higher latency of > aggregation-oriented Wi-Fi links where long service time (think > 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi > works in the first place, i.e. Wi-Fi gets better throughput if you > deliver bursts of packets destined to the same station. Moreover this > gets even more complicated with MU-MIMO where you may want to consider > spatial location (which influences signal quality when grouped) of > each station when you decide which set of stations you're going to > aggregate to in parallel. Since drivers have a finite tx ring this it > is important to deliver bursts that can actually be aggregated > efficiently. This means driver would need to be able to tell qdisc > about per-flow conditions to influence the RR scheme in some way > (assuming a qdiscs even understands flows; do we need a unified way of > talking about flows between qdiscs and drivers?). This is a very good summary of the problems in layering fq_codel as it exists today on top of wifi as it exists today. :/ Our conclusion several years ago was that as the information needed to do things more right was in the mac80211 layer that we could not evolve the qdisc layer to suit, and needed to move the core ideas into the mac80211 layer. Things have evolved since, but I still think we can't get enough info up to the qdisc layer (locks and so on) to use it sanely. > > [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html > [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html I will review! > >>>> For reference, ath10k has around 1400 tx descriptors, though >>>> in practice not all are usable, and in stock firmware, I'm guessing >>>> the NIC will never be able to actually fill up it's tx descriptors >>>> and stop traffic. Instead, it just allows the stack to try to >>>> TX, then drops the frame... >>> >>> >>> 1400 descriptors, ok... but they are not organised in queues? >>> (forgive my ignorance of athX drivers) >> >> >> I think all the details are in the firmware, at least for now. > > Yeah. Basically ath10k has a flat set of tx descriptors which are > AC-agnostic. Firmware classifies them internally to per-AC HW queues. > > > Michał
On 02/05/2016 12:44 AM, Michal Kazior wrote: > Per-station queues sound tricky if you consider bufferbloat. > > To maximize use of airtime (i.e. txop) you need to send big > aggregates. Since aggregates are per station-tid to maximize > multi-station performance (in AP mode) you'll need to queue a lot of > frames, per each station, depending on the chosen tx rate. > > A bursted txop can be as big as 5-10ms. If you consider you want to > queue 5-10ms worth of data for *each* station at any given time you > obviously introduce a lot of lag. If you have 10 stations you might > end up with service period at 10*10ms = 100ms. This gets even worse if > you consider MU-MIMO because you need to do an expensive sounding > procedure before transmitting. So while SU aggregation can probably > still work reasonably well with shorter bursts (1-2ms) MU needs at > least 3ms to get *any* gain when compared to SU (which obviously means > you want more to actually make MU pay off). The rule of thumb is the > longer you wait the bigger capacity you can get. > > Apparently there's interest in maximizing throughput but it stands in > direct opposition of keeping the latency down so I've been thinking > how to satisfy both. I really think this should be tunable. For instance, someone making an AP that is mostly for letting lots of users stream movies would care a lot more about throughput than someone making an AP that is mainly for browsing the web and doing more latency-sensitive activities. > The current approach ath10k is taking (patches in review [1][2]) is to > use mac80211 software queues for per-station queuing, exposing queue > state to firmware (it decides where frames should be dequeued from) > and making it possible to stop/wake per-station tx subqueue with fake > netdev queues. I'm starting to think this is not the right way though > because it's inherently hard to control latency and there's a huge > memory overhead associated with the fake netdev queues. Also fq_codel > is a less effective with this kind of setup. > > My current thinking is that the entire problem should be solved via > (per-AC) qdiscs, e.g. fq_codel. I guess one could use > limit/target/interval/quantum knobs to tune it for higher latency of > aggregation-oriented Wi-Fi links where long service time (think > 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi > works in the first place, i.e. Wi-Fi gets better throughput if you > deliver bursts of packets destined to the same station. Moreover this > gets even more complicated with MU-MIMO where you may want to consider > spatial location (which influences signal quality when grouped) of > each station when you decide which set of stations you're going to > aggregate to in parallel. Since drivers have a finite tx ring this it > is important to deliver bursts that can actually be aggregated > efficiently. This means driver would need to be able to tell qdisc > about per-flow conditions to influence the RR scheme in some way > (assuming a qdiscs even understands flows; do we need a unified way of > talking about flows between qdiscs and drivers?). I wonder if it would work better if we removed most of the tid handling and aggregation logic in the firmware. Maybe just have the mgt Q and best effort (and skip VO/VI). Let the OS tell (or suggest to) the firmware when aggregation starts and stops. That might at least cut the number of queues in half, saving memory and latency up and down the stack. Thanks, Ben
On 5 February 2016 at 17:47, Dave Taht <dave.taht@gmail.com> wrote: >> A bursted txop can be as big as 5-10ms. If you consider you want to >> queue 5-10ms worth of data for *each* station at any given time you >> obviously introduce a lot of lag. If you have 10 stations you might >> end up with service period at 10*10ms = 100ms. This gets even worse if >> you consider MU-MIMO because you need to do an expensive sounding >> procedure before transmitting. So while SU aggregation can probably >> still work reasonably well with shorter bursts (1-2ms) MU needs at >> least 3ms to get *any* gain when compared to SU (which obviously means >> you want more to actually make MU pay off). > > I am not sure where you get these numbers. Got a spreadsheet? Here's a nice summary on some of it: http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu Even if your single A-MPDU can be shorter than a txop you can burst a few of them if my understanding is correct. The overhead associated with MU sounding is something I've been told only. Apparently for MU to pay off you need fairly big bursts. This implies that the more stations you have to service the less it makes sense to attempt MU if you consider latency. > Gradually reducing the maximum sized txop as a function of the number > of stations makes sense. If you have 10 stations pending delivery and > reduced the max txop to 1ms, you hurt bandwidth at that instant, but > by offering more service to more stations, in less time, they will > converge on a reasonable share of the bandwidth for each, faster[1]. > And I'm sure that the person videoconferencing on a link like that > would appreciate getting some service inside of a 10ms interval, > rather than a 100ms. > > yes, there's overhead, and that's not the right number, which would > vary as to g,n,ac and successors. > > You will also get more opportunities to use mu-mimo with shorter > bursts extant and more stations being regularly serviced. > > [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50 This is my thinking as well, at least for most common use cases. If you try to optimize for throughput by introducing extra induced latency you might end up not being able to use aggregation in practice anyway because you won't be able to start up connections and ask for enough data - or at least that's what my intuition tells me. But, like I've mentioned, there's interest in making it possible to maximize for throughput (regardless of latency). This surely makes sense for synthetic UDP benchmarks. But does it make sense for any real-world application? No idea. >> The rule of thumb is the >> longer you wait the bigger capacity you can get. > > This is not strictly true as the "fountain" of packets is regulated by > acks on the other side of the link, and ramp up or down as a function > of service time and loss. Yes, if you consider real world cases, i.e. TCP, web traffic, etc. then you're correct. >> Apparently there's interest in maximizing throughput but it stands in >> direct opposition of keeping the latency down so I've been thinking >> how to satisfy both. >> >> The current approach ath10k is taking (patches in review [1][2]) is to >> use mac80211 software queues for per-station queuing, exposing queue >> state to firmware (it decides where frames should be dequeued from) >> and making it possible to stop/wake per-station tx subqueue with fake >> netdev queues. I'm starting to think this is not the right way though >> because it's inherently hard to control latency and there's a huge >> memory overhead associated with the fake netdev queues. > > What is this overhead? E.g. if you want to be able to maximize throughput for 50 MU clients you need to be able to queue, in theory, 50*200 (roughly) frames. This translates to both huge memory usage and latency *and* renders (fq_)codel qdisc rather.. moot. > Applying things like codel tend to dramatically shorten the amount of > skbs extant... > modern 802.11ac capable hardware has tons more > memory... I don't think it does. QCA988x is able to handle "only" 1424 tx descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked politely. This is still not enough to satisfy the insane "maximize the capacity/throughput" expectations though. You could actually argue it's too much from the bufferbloat problem point of view anyway and Emmanuel's patch proves it is beneficial to buffer less in driver depending on the sojourn packet time. >> Also fq_codel >> is a less effective with this kind of setup. > > fq_codel's principal problems with working with wifi are long and > documented in the talk above. > >> My current thinking is that the entire problem should be solved via >> (per-AC) qdiscs, e.g. fq_codel. I guess one could use >> limit/target/interval/quantum knobs to tune it for higher latency of >> aggregation-oriented Wi-Fi links where long service time (think >> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi >> works in the first place, i.e. Wi-Fi gets better throughput if you >> deliver bursts of packets destined to the same station. Moreover this >> gets even more complicated with MU-MIMO where you may want to consider >> spatial location (which influences signal quality when grouped) of >> each station when you decide which set of stations you're going to >> aggregate to in parallel. Since drivers have a finite tx ring this it >> is important to deliver bursts that can actually be aggregated >> efficiently. This means driver would need to be able to tell qdisc >> about per-flow conditions to influence the RR scheme in some way >> (assuming a qdiscs even understands flows; do we need a unified way of >> talking about flows between qdiscs and drivers?). > > This is a very good summary of the problems in layering fq_codel as it > exists today on top of wifi as it exists today. :/ Our conclusion > several years ago was that as the information needed to do things more > right was in the mac80211 layer that we could not evolve the qdisc > layer to suit, and needed to move the core ideas into the mac80211 > layer. > > Things have evolved since, but I still think we can't get enough info > up to the qdisc layer (locks and so on) to use it sanely. The current per-station queue implementation in mac80211 doesn't seem sufficient. Each of these queues use a simple flat fifo queue (sk_buff_head) limited by packet count only with somewhat broken behavior when packet limit is reached as you end up with unfairly populated queues a lot. They need to have codel applied on them somehow. You'll eventually end up re-inventing fq_codel in mac80211 making the qdisc redundant. Moreover these queues aren't per-station only. They are per-station-per-tid giving 16 queues per station. This is important because you can't aggregate traffic going out on different tids. Even without explicit air conditions feedback from mac80211 to fq_codel I suspect it'd be still beneficial if fq_codel was able to group (sub-)flows into per-station-tid and burst them out them in subsequent dequeue() calls so the chance they get aggregated is higher. Michał
On Fri, Feb 5, 2016 at 10:44 AM, Michal Kazior <michal.kazior@tieto.com> wrote: > On 4 February 2016 at 22:14, Ben Greear <greearb@candelatech.com> wrote: >> On 02/04/2016 12:56 PM, Grumbach, Emmanuel wrote: >>> On 02/04/2016 10:46 PM, Ben Greear wrote: >>>> On 02/04/2016 12:16 PM, Emmanuel Grumbach wrote: >>>>> >>>>> As many (all?) WiFi devices, Intel WiFi devices have >>>>> transmit queues which have 256 transmit descriptors >>>>> each and each descriptor corresponds to an MPDU. >>>>> This means that when it is full, the queue contains >>>>> 256 * ~1500 bytes to be transmitted (if we don't have >>>>> A-MSDUs). The purpose of those queues is to have enough >>>>> packets to be ready for transmission so that when the device >>>>> gets an opportunity to transmit (TxOP), it can take as many >>>>> packets as the spec allows and aggregate them into one >>>>> A-MPDU or even several A-MPDUs if we are using bursts. >>>> >>>> I guess this is only really usable if you have exactly one >>>> peer connected (ie, in station mode)? >>>> >>>> Otherwise, you could have one slow peer and one fast one, >>>> and then I suspect this would not work so well? >>> >>> >>> Yes. I guess this one (big) limitation. I guess that what would happen >>> in this case is that the the latency would constantly jitter. But I also > > Hmm.. You'd probably need to track per-station packet sojourn time as > well and make it possible to stop/wake queues per station. Clearly here comes the difference between the devices you work on, and the devices I work on. Intel devices are more client oriented. Our AP mode doesn't handle many clients etc.. > > >>> noticed that I could reduce the transmit queue to 130 descriptors >>> (instead of 256) and still reach maximal throughput because we can >>> refill the queues quickly enough. >>> In iwlwifi, we have plans to have one queue for each peer. >>> This is under development. Not sure when it'll be ready. It also requires >>> firmware change obviously. >> >> Per-peer queues will probably be nice, especially if we can keep the >> buffer bloat manageable. > > Per-station queues sound tricky if you consider bufferbloat. iwlwifi's A-MPDU model is different from athXk's I guess. In iwlwifi (the Intel devices really since it is mostly firmware) the firmware will try to use a TxOP whenever there is data in the queue. Then, once we have a chance to transmit, we will look at what we have in the queue and send the biggest aggregates we can (and bursts if allowed). But we will not defer packets' transmission to get bigger aggregates. Testing shows that under ideal conditions, we can have enough packets in the queue to build big aggregates without creating artificial latency. > > To maximize use of airtime (i.e. txop) you need to send big > aggregates. Since aggregates are per station-tid to maximize > multi-station performance (in AP mode) you'll need to queue a lot of > frames, per each station, depending on the chosen tx rate. Sure. > > A bursted txop can be as big as 5-10ms. If you consider you want to > queue 5-10ms worth of data for *each* station at any given time you > obviously introduce a lot of lag. If you have 10 stations you might > end up with service period at 10*10ms = 100ms. This gets even worse if > you consider MU-MIMO because you need to do an expensive sounding > procedure before transmitting. So while SU aggregation can probably > still work reasonably well with shorter bursts (1-2ms) MU needs at > least 3ms to get *any* gain when compared to SU (which obviously means > you want more to actually make MU pay off). The rule of thumb is the > longer you wait the bigger capacity you can get. I am not an expert about MU-MIMO, but I'll believe you :) We can chat about this in a few days :) Queueing frames under good conditions is fine in a way that it is a "Good queue" (hey Stephen), you need those queues to maximize the throughput because of the bursty nature of WiFi and the queue "moves" quickly since you have high throughput so that the sojourn time in your queue is relatively small, but when the link conditions gets less good you need to reduce the queue length because it doesn't really help you anyway. This is what my patch aims at fixing. All this is true when you have a small number of stations... I understand from your comment that even in ideal conditions you still need to create a lot of latency to gain TPT. Then there isn't much we can do without impacting either TPT or latency. Then, there is a real tradeoff. I guess that again you are facing a classic AP problem that a station or an AP with a small number of concurrent associated clients will likely not have. All this encourages me in my belief that I should do something in iwlwifi for iwlwifi and at mac80211's level since there seem to be very different problems / use cases. But this code can still suit those use cases can all fit and we'd just (...) have to give different parameters to the "algorithm"? > > Apparently there's interest in maximizing throughput but it stands in > direct opposition of keeping the latency down so I've been thinking > how to satisfy both. In your case, then yes. I guess I should limit my patch for queues that serve vif of type BSS for now. It'll still help us for 99% of iwlwifi's usages. > > The current approach ath10k is taking (patches in review [1][2]) is to > use mac80211 software queues for per-station queuing, exposing queue > state to firmware (it decides where frames should be dequeued from) > and making it possible to stop/wake per-station tx subqueue with fake > netdev queues. I'm starting to think this is not the right way though > because it's inherently hard to control latency and there's a huge > memory overhead associated with the fake netdev queues. Also fq_codel > is a less effective with this kind of setup. I'd love to hear more the reasons of fq_codel efficiency's problem here (just for education). I saw these patches. This makes the firmware much more actively involved in the scheduling. As I said, in iwlwifi we plan to have one hardware queue per-sta-per-tid so that the firmware would be able to choose what station it wants to send frames to. The main purpose is to improve uAPSD response time as an AP, but it will also make it easier to choose the right frames to add to a MU-MIMO transmission. > > My current thinking is that the entire problem should be solved via > (per-AC) qdiscs, e.g. fq_codel. I guess one could use > limit/target/interval/quantum knobs to tune it for higher latency of > aggregation-oriented Wi-Fi links where long service time (think > 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi > works in the first place, i.e. Wi-Fi gets better throughput if you > deliver bursts of packets destined to the same station. Moreover this > gets even more complicated with MU-MIMO where you may want to consider > spatial location (which influences signal quality when grouped) of > each station when you decide which set of stations you're going to > aggregate to in parallel. Since drivers have a finite tx ring this it > is important to deliver bursts that can actually be aggregated > efficiently. This means driver would need to be able to tell qdisc > about per-flow conditions to influence the RR scheme in some way > (assuming a qdiscs even understands flows; do we need a unified way of > talking about flows between qdiscs and drivers?). Interesting. Not sure that 100-200ms is acceptable for all the use cases though. Bursts can be achieved by TSO maybe? I did that to get A-MSDU. But... that won't help for bridged traffic which is your primary use case. > > > [1]: https://www.spinics.net/lists/linux-wireless/msg146187.html > [2]: https://www.spinics.net/lists/linux-wireless/msg146512.html > > >>>> For reference, ath10k has around 1400 tx descriptors, though >>>> in practice not all are usable, and in stock firmware, I'm guessing >>>> the NIC will never be able to actually fill up it's tx descriptors >>>> and stop traffic. Instead, it just allows the stack to try to >>>> TX, then drops the frame... >>> >>> >>> 1400 descriptors, ok... but they are not organised in queues? >>> (forgive my ignorance of athX drivers) >> >> >> I think all the details are in the firmware, at least for now. > > Yeah. Basically ath10k has a flat set of tx descriptors which are > AC-agnostic. Firmware classifies them internally to per-AC HW queues. > > > Michał > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 8, 2016 at 12:00 PM, Michal Kazior <michal.kazior@tieto.com> wrote: > On 5 February 2016 at 17:47, Dave Taht <dave.taht@gmail.com> wrote: >>> A bursted txop can be as big as 5-10ms. If you consider you want to >>> queue 5-10ms worth of data for *each* station at any given time you >>> obviously introduce a lot of lag. If you have 10 stations you might >>> end up with service period at 10*10ms = 100ms. This gets even worse if >>> you consider MU-MIMO because you need to do an expensive sounding >>> procedure before transmitting. So while SU aggregation can probably >>> still work reasonably well with shorter bursts (1-2ms) MU needs at >>> least 3ms to get *any* gain when compared to SU (which obviously means >>> you want more to actually make MU pay off). >> >> I am not sure where you get these numbers. Got a spreadsheet? > > Here's a nice summary on some of it: > > http://chimera.labs.oreilly.com/books/1234000001739/ch03.html#figure-mac-ampdu > > Even if your single A-MPDU can be shorter than a txop you can burst a > few of them if my understanding is correct. > > The overhead associated with MU sounding is something I've been told > only. Apparently for MU to pay off you need fairly big bursts. This > implies that the more stations you have to service the less it makes > sense to attempt MU if you consider latency. > > >> Gradually reducing the maximum sized txop as a function of the number >> of stations makes sense. If you have 10 stations pending delivery and >> reduced the max txop to 1ms, you hurt bandwidth at that instant, but >> by offering more service to more stations, in less time, they will >> converge on a reasonable share of the bandwidth for each, faster[1]. >> And I'm sure that the person videoconferencing on a link like that >> would appreciate getting some service inside of a 10ms interval, >> rather than a 100ms. >> >> yes, there's overhead, and that's not the right number, which would >> vary as to g,n,ac and successors. >> >> You will also get more opportunities to use mu-mimo with shorter >> bursts extant and more stations being regularly serviced. >> >> [1] https://www.youtube.com/watch?v=Rb-UnHDw02o at about 13:50 > > This is my thinking as well, at least for most common use cases. > > If you try to optimize for throughput by introducing extra induced > latency you might end up not being able to use aggregation in practice > anyway because you won't be able to start up connections and ask for > enough data - or at least that's what my intuition tells me. > > But, like I've mentioned, there's interest in making it possible to > maximize for throughput (regardless of latency). This surely makes > sense for synthetic UDP benchmarks. But does it make sense for any > real-world application? No idea. > > >>> The rule of thumb is the >>> longer you wait the bigger capacity you can get. >> >> This is not strictly true as the "fountain" of packets is regulated by >> acks on the other side of the link, and ramp up or down as a function >> of service time and loss. > > Yes, if you consider real world cases, i.e. TCP, web traffic, etc. > then you're correct. > > >>> Apparently there's interest in maximizing throughput but it stands in >>> direct opposition of keeping the latency down so I've been thinking >>> how to satisfy both. >>> >>> The current approach ath10k is taking (patches in review [1][2]) is to >>> use mac80211 software queues for per-station queuing, exposing queue >>> state to firmware (it decides where frames should be dequeued from) >>> and making it possible to stop/wake per-station tx subqueue with fake >>> netdev queues. I'm starting to think this is not the right way though >>> because it's inherently hard to control latency and there's a huge >>> memory overhead associated with the fake netdev queues. >> >> What is this overhead? > > E.g. if you want to be able to maximize throughput for 50 MU clients > you need to be able to queue, in theory, 50*200 (roughly) frames. This > translates to both huge memory usage and latency *and* renders > (fq_)codel qdisc rather.. moot. Ok - now I understand. So yes the conclusion below (make fq_codel station aware) makes a lot sense. > > >> Applying things like codel tend to dramatically shorten the amount of >> skbs extant... > >> modern 802.11ac capable hardware has tons more >> memory... > > I don't think it does. QCA988x is able to handle "only" 1424 tx > descriptors (IOW 1500-byte long MSDUs) in the driver-to-firmware tx > queue (it's a flat queue). QCA99x0 is able to handle 2500 if asked > politely. As I said, our design is not flat which removes for the firmware to classify the packets by station to be able to build aggregates, but the downside is the number of clients you can service. > > This is still not enough to satisfy the insane "maximize the > capacity/throughput" expectations though. > > You could actually argue it's too much from the bufferbloat problem > point of view anyway and Emmanuel's patch proves it is beneficial to > buffer less in driver depending on the sojourn packet time. > In real life scenario, yes. But the more I read your comments the more I get to the conclusion we (Intel) simply haven't been able to reach the throughput you have which *requires* more latency to reach :) Note that you always talk about overall throughput (to all the stations at the same time) which, again, makes a lot of sense when you work primarily as an AP. We look at the world with different viewpoint. > >>> Also fq_codel >>> is a less effective with this kind of setup. >> >> fq_codel's principal problems with working with wifi are long and >> documented in the talk above. >> >>> My current thinking is that the entire problem should be solved via >>> (per-AC) qdiscs, e.g. fq_codel. I guess one could use >>> limit/target/interval/quantum knobs to tune it for higher latency of >>> aggregation-oriented Wi-Fi links where long service time (think >>> 100-200ms) is acceptable. However fq_codel is oblivious to how Wi-Fi >>> works in the first place, i.e. Wi-Fi gets better throughput if you >>> deliver bursts of packets destined to the same station. Moreover this >>> gets even more complicated with MU-MIMO where you may want to consider >>> spatial location (which influences signal quality when grouped) of >>> each station when you decide which set of stations you're going to >>> aggregate to in parallel. Since drivers have a finite tx ring this it >>> is important to deliver bursts that can actually be aggregated >>> efficiently. This means driver would need to be able to tell qdisc >>> about per-flow conditions to influence the RR scheme in some way >>> (assuming a qdiscs even understands flows; do we need a unified way of >>> talking about flows between qdiscs and drivers?). >> >> This is a very good summary of the problems in layering fq_codel as it >> exists today on top of wifi as it exists today. :/ Our conclusion >> several years ago was that as the information needed to do things more >> right was in the mac80211 layer that we could not evolve the qdisc >> layer to suit, and needed to move the core ideas into the mac80211 >> layer. >> >> Things have evolved since, but I still think we can't get enough info >> up to the qdisc layer (locks and so on) to use it sanely. > > The current per-station queue implementation in mac80211 doesn't seem > sufficient. Each of these queues use a simple flat fifo queue > (sk_buff_head) limited by packet count only with somewhat broken > behavior when packet limit is reached as you end up with unfairly > populated queues a lot. They need to have codel applied on them > somehow. You'll eventually end up re-inventing fq_codel in mac80211 > making the qdisc redundant. > > Moreover these queues aren't per-station only. They are > per-station-per-tid giving 16 queues per station. This is important > because you can't aggregate traffic going out on different tids. > > Even without explicit air conditions feedback from mac80211 to > fq_codel I suspect it'd be still beneficial if fq_codel was able to > group (sub-)flows into per-station-tid and burst them out them in > subsequent dequeue() calls so the chance they get aggregated is > higher. That would allow you to avoid classifying (and buffering) in HW and then reduce the latency. Agree. > > > Michał > -- > To unsubscribe from this list: send the line "unsubscribe linux-wireless" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h index 2f95916..d83eb56 100644 --- a/drivers/net/wireless/intel/iwlwifi/pcie/internal.h +++ b/drivers/net/wireless/intel/iwlwifi/pcie/internal.h @@ -192,6 +192,11 @@ struct iwl_cmd_meta { u32 flags; }; +struct iwl_txq_auto_size { + int min_space; + unsigned long reset_ts; +}; + /* * Generic queue structure * @@ -293,6 +298,7 @@ struct iwl_txq { bool block; unsigned long wd_timeout; struct sk_buff_head overflow_q; + struct iwl_txq_auto_size auto_sz; }; static inline dma_addr_t diff --git a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c index 837a7d5..4d1dee6 100644 --- a/drivers/net/wireless/intel/iwlwifi/pcie/tx.c +++ b/drivers/net/wireless/intel/iwlwifi/pcie/tx.c @@ -572,6 +572,8 @@ static int iwl_pcie_txq_init(struct iwl_trans *trans, struct iwl_txq *txq, spin_lock_init(&txq->lock); __skb_queue_head_init(&txq->overflow_q); + txq->auto_sz.min_space = 240; + txq->auto_sz.reset_ts = jiffies; /* * Tell nic where to find circular buffer of Tx Frame Descriptors for @@ -1043,10 +1045,14 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn, q->read_ptr != tfd_num; q->read_ptr = iwl_queue_inc_wrap(q->read_ptr)) { struct sk_buff *skb = txq->entries[txq->q.read_ptr].skb; + struct ieee80211_tx_info *info; + unsigned long tx_time; if (WARN_ON_ONCE(!skb)) continue; + info = IEEE80211_SKB_CB(skb); + iwl_pcie_free_tso_page(skb); __skb_queue_tail(skbs, skb); @@ -1056,6 +1062,18 @@ void iwl_trans_pcie_reclaim(struct iwl_trans *trans, int txq_id, int ssn, iwl_pcie_txq_inval_byte_cnt_tbl(trans, txq); iwl_pcie_txq_free_tfd(trans, txq); + + tx_time = (uintptr_t)info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2]; + if (time_before(jiffies, tx_time + msecs_to_jiffies(10))) { + txq->auto_sz.min_space -= 10; + txq->auto_sz.min_space = + max(txq->auto_sz.min_space, txq->q.high_mark); + } else if (time_after(jiffies, + tx_time + msecs_to_jiffies(20))) { + txq->auto_sz.min_space += 10; + txq->auto_sz.min_space = + min(txq->auto_sz.min_space, 252); + } } iwl_pcie_txq_progress(txq); @@ -2185,6 +2203,7 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb, struct iwl_device_cmd *dev_cmd, int txq_id) { struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans); + struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb); struct ieee80211_hdr *hdr; struct iwl_tx_cmd *tx_cmd = (struct iwl_tx_cmd *)dev_cmd->payload; struct iwl_cmd_meta *out_meta; @@ -2234,13 +2253,20 @@ int iwl_trans_pcie_tx(struct iwl_trans *trans, struct sk_buff *skb, spin_lock(&txq->lock); - if (iwl_queue_space(q) < q->high_mark) { + info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 2] = + (void *)(uintptr_t)jiffies; + + if (time_after(jiffies, + txq->auto_sz.reset_ts + msecs_to_jiffies(5000))) { + txq->auto_sz.min_space = 240; + txq->auto_sz.reset_ts = jiffies; + } + + if (iwl_queue_space(q) < txq->auto_sz.min_space) { iwl_stop_queue(trans, txq); /* don't put the packet on the ring, if there is no room */ if (unlikely(iwl_queue_space(q) < 3)) { - struct ieee80211_tx_info *info = IEEE80211_SKB_CB(skb); - info->driver_data[IWL_TRANS_FIRST_DRIVER_DATA + 1] = dev_cmd; __skb_queue_tail(&txq->overflow_q, skb);
As many (all?) WiFi devices, Intel WiFi devices have transmit queues which have 256 transmit descriptors each and each descriptor corresponds to an MPDU. This means that when it is full, the queue contains 256 * ~1500 bytes to be transmitted (if we don't have A-MSDUs). The purpose of those queues is to have enough packets to be ready for transmission so that when the device gets an opportunity to transmit (TxOP), it can take as many packets as the spec allows and aggregate them into one A-MPDU or even several A-MPDUs if we are using bursts. The problem is that the packets that are in these queues are already out of control of the Qdisc and can stay in those queues for a fairly long time when the link condition is not good. This leads to the well known bufferbloat problem. This patch adds a way to tune the size of the transmit queue so that it won't cause excessive latency. When the link condition is good, the packets will flow smoothly and the transmit queue will grow quickly allowing A-MPDUs and maximal throughput. When the link is not optimal, we will have retransmissions, lower transmit rates or signal detection (CCA) which will cause a delay in the packet transmission. The driver will sense this higher latency and will reduce the size of the transmit queue. This means that the packets that continue to arrive will pile up in the Qdisc rather than in the device queues. The major advantage of this approach is that codel can now do its work. The algorithm is really (too?) simple: every 5 seconds, starts from a short queue again. If a packet has been in the queue for less than 10ms, allow 10 more MPDUs in. If a packet has been in the queue for more than 20ms, reduce by 10 the size of the transmit queue. The implementation is really naive and way too simple: * reading jiffies for every Tx / Tx status is not a good idead. * jiffies are not fine-grained enough on all platforms * the constants chosen are really arbitrary and can't be tuned. * This may be implemented in mac80211 probably and help other drivers. * etc... But already this gives nice results. I ran a very simple experiment: I put the device in a controlled environment and ran traffic while running default sized ping in the background. In this configuration, our device quickly raises its transmission rate to the best rate. Then, I force the device to use the lowest rate (6Mbps). Of course, the throughput collapses, but the ping RTT shoots up. Using codel helps, but the latency is still high. Codel with this patch gives much better results: pfifo_fast: rtt min/avg/max/mdev = 1932.616/2393.284/2833.407/315.941 ms, pipe 3, ipg/ewma 2215.707/2446.884 ms fq_codel + Tx queue auto-sizing: rtt min/avg/max/mdev = 13.541/32.396/54.791/9.610 ms, ipg/ewma 200.685/32.202 ms fq_codel without Tx queue auto-sizing: rtt min/avg/max/mdev = 140.821/257.303/331.889/31.074 ms, pipe 2, ipg/ewma 258.147/252.847 ms Clearly, there is more work to do to be able to merge this, but it seems that the wireless problems mentioned in https://lwn.net/Articles/616241/ may have a solution. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Dave Taht <dave.taht@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> --- Fix Dave's email address --- drivers/net/wireless/intel/iwlwifi/pcie/internal.h | 6 ++++ drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 32 ++++++++++++++++++++-- 2 files changed, 35 insertions(+), 3 deletions(-)