Message ID | 20201021015902.19223-1-matthew.ruffell@canonical.com |
---|---|
Headers | show |
Series | bcache: Issues with large IO wait in bch_mca_scan() when shrinker is enabled | expand |
On 21.10.20 03:58, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1898786 > > [Impact] > > Systems that utilise bcache can experience extremely high IO wait times when > under constant IO pressure. The IO wait times seem to stay at a consistent 1 > second, and never drop as long as the bcache shrinker is enabled. > > If you disable the shrinker, then IO wait drops significantly to normal levels. > > We did some perf analysis, and it seems we spend a huge amount of time in > bch_mca_scan(), likely waiting for the mutex "&c->bucket_lock". > > Looking at the recent commits in Bionic, we found the following commit merged in > upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: > > commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > Author: Coly Li <colyli@suse.de> > Date: Wed Nov 13 16:03:24 2019 +0800 > Subject: bcache: at least try to shrink 1 node in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > > It mentions in the description that: > >> If sc->nr_to_scan is smaller than c->btree_pages, after the above >> calculation, variable 'nr' will be 0 and nothing will be shrunk. It is >> frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make >> nr to be zero. Then bch_mca_scan() will do nothing more then acquiring >> and releasing mutex c->bucket_lock. > > This seems to be what is going on here, but the above commit only addresses when > nr is 0. > > From what I can see, the problems we are experiencing are when nr is 1 or 2, and > again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to > release c->bucket_lock since the shrinker loop never executes since there is no > work to do. > > [Fix] > > The following commits fix the problem, and all landed in 5.6-rc1: > > commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:40 2020 +0800 > Subject: bcache: remove member accessed from struct btree > Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 > > commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:41 2020 +0800 > Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > > commit e3de04469a49ee09c89e80bf821508df458ccee6 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:42 2020 +0800 > Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 > > The first commit is a dependency of the other two. The first commit removes a > "recently accessed" marker, used to indicate if a particular cache has been used > recently, and if it has been, not consider it for cache eviction. The commit > mentions that under heavy IO, all caches will end up being recently accessed, > and nothing will ever be shrunk. > > The second commit changes a previous design decision of skipping the first > 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr > being 1, or 2, just like 0 was common in the very first commit I mentioned. > This time, if we land on 1 or 2, the loop exits and nothing happens, and we > waste time waiting on locks, just like the very first commit I mentioned. > The fix is to try shrink caches from the tail of the list, and not the beginning. > > The third commit fixes a minor issue where we don't want to re-arrange the linked > list c->btree_cache, which is what the second commit ended up doing, and instead, > just shrink the cache at the end of the linked list, and not change the order. > > One minor backport / context adjustment was required in the first commit for > Bionic, and the rest are all clean cherry picks to Bionic and Focal. > > [Testcase] > > This is kind of hard to test, since the problem shows up in production > environments that are under constant IO pressure, with many different items > entering and leaving the cache. > > The Launchpad git server is currently suffering this issue, and has been sitting > at a constant IO wait of 1 second / slightly over 1 second which was causing > slow response times, which was leading to build failures when git clones ended > up timing out. > > We installed a test kernel, which is available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test > > Once the test kernel had been installed, IO wait times with the shrinker enabled > dropped to normal levels, and the git server became responsive again. We have > been monitoring the performance of the git server and watching IO wait times > in grafana over the past week, and everything is looking good, and indicate that > these patches solve the issue. > > [Regression Potential] > > If a regression were to occur, it would only affect users of bcache who have the > shrinker enabled, which I believe is the default setting in bcache deployments. > > In that case, users could disable the shrinker as a workaround until a fix is > available. You can disable the shrinker with: > "sudo bash -c "echo 1 > /sys/fs/bcache/<uuid>/internal/btree_shrinker_disabled"" > > The patches remove the "recently accessed" marker, which changes the behaviour > slightly when it comes to deciding what to evict from the cache. Some workloads > that aren't under IO pressure may see a difference in what items are evicted > from the cache, which may slightly change bcache's performance profile. > > Since the patches prune the cache from the tail of the cache linked list, we > will likely end up pruning more nodes than previously, since it was common to > land on the first 3 caches, which were hard coded to be ignored and not pruned, > which may also slightly change the bcache's performance profile. > > Coly Li (3): > bcache: remove member accessed from struct btree > bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > bcache: reap from tail of c->btree_cache in bch_mca_scan() > > drivers/md/bcache/btree.c | 24 ++++++++++-------------- > drivers/md/bcache/btree.h | 2 -- > 2 files changed, 10 insertions(+), 16 deletions(-) > Testing looks good and cherry picks/backports. Formally, for future submissions, I would suggest to always group multi-series submissions together. In this case this would have made the list two patches shorter and as a reviewer one does not need to look twice. Things can be prepared with a bit of creative file naming and adjusted subjects. For the subjects: [PATCH 1/3 B] ... [PATCH 1/3 F] ... [PATCH 2/3 B/F] ... [PATCH 3/3 B/F] ... But this time, Acked-by: Stefan Bader <stefan.bader@canonical.com>
On 21/10/2020 02:58, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1898786 > > [Impact] > > Systems that utilise bcache can experience extremely high IO wait times when > under constant IO pressure. The IO wait times seem to stay at a consistent 1 > second, and never drop as long as the bcache shrinker is enabled. > > If you disable the shrinker, then IO wait drops significantly to normal levels. > > We did some perf analysis, and it seems we spend a huge amount of time in > bch_mca_scan(), likely waiting for the mutex "&c->bucket_lock". > > Looking at the recent commits in Bionic, we found the following commit merged in > upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: > > commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > Author: Coly Li <colyli@suse.de> > Date: Wed Nov 13 16:03:24 2019 +0800 > Subject: bcache: at least try to shrink 1 node in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > > It mentions in the description that: > >> If sc->nr_to_scan is smaller than c->btree_pages, after the above >> calculation, variable 'nr' will be 0 and nothing will be shrunk. It is >> frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make >> nr to be zero. Then bch_mca_scan() will do nothing more then acquiring >> and releasing mutex c->bucket_lock. > > This seems to be what is going on here, but the above commit only addresses when > nr is 0. > > From what I can see, the problems we are experiencing are when nr is 1 or 2, and > again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to > release c->bucket_lock since the shrinker loop never executes since there is no > work to do. > > [Fix] > > The following commits fix the problem, and all landed in 5.6-rc1: > > commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:40 2020 +0800 > Subject: bcache: remove member accessed from struct btree > Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 > > commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:41 2020 +0800 > Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > > commit e3de04469a49ee09c89e80bf821508df458ccee6 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:42 2020 +0800 > Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 > > The first commit is a dependency of the other two. The first commit removes a > "recently accessed" marker, used to indicate if a particular cache has been used > recently, and if it has been, not consider it for cache eviction. The commit > mentions that under heavy IO, all caches will end up being recently accessed, > and nothing will ever be shrunk. > > The second commit changes a previous design decision of skipping the first > 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr > being 1, or 2, just like 0 was common in the very first commit I mentioned. > This time, if we land on 1 or 2, the loop exits and nothing happens, and we > waste time waiting on locks, just like the very first commit I mentioned. > The fix is to try shrink caches from the tail of the list, and not the beginning. > > The third commit fixes a minor issue where we don't want to re-arrange the linked > list c->btree_cache, which is what the second commit ended up doing, and instead, > just shrink the cache at the end of the linked list, and not change the order. > > One minor backport / context adjustment was required in the first commit for > Bionic, and the rest are all clean cherry picks to Bionic and Focal. > > [Testcase] > > This is kind of hard to test, since the problem shows up in production > environments that are under constant IO pressure, with many different items > entering and leaving the cache. > > The Launchpad git server is currently suffering this issue, and has been sitting > at a constant IO wait of 1 second / slightly over 1 second which was causing > slow response times, which was leading to build failures when git clones ended > up timing out. > > We installed a test kernel, which is available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test > > Once the test kernel had been installed, IO wait times with the shrinker enabled > dropped to normal levels, and the git server became responsive again. We have > been monitoring the performance of the git server and watching IO wait times > in grafana over the past week, and everything is looking good, and indicate that > these patches solve the issue. > > [Regression Potential] > > If a regression were to occur, it would only affect users of bcache who have the > shrinker enabled, which I believe is the default setting in bcache deployments. > > In that case, users could disable the shrinker as a workaround until a fix is > available. You can disable the shrinker with: > "sudo bash -c "echo 1 > /sys/fs/bcache/<uuid>/internal/btree_shrinker_disabled"" > > The patches remove the "recently accessed" marker, which changes the behaviour > slightly when it comes to deciding what to evict from the cache. Some workloads > that aren't under IO pressure may see a difference in what items are evicted > from the cache, which may slightly change bcache's performance profile. > > Since the patches prune the cache from the tail of the cache linked list, we > will likely end up pruning more nodes than previously, since it was common to > land on the first 3 caches, which were hard coded to be ignored and not pruned, > which may also slightly change the bcache's performance profile. > > Coly Li (3): > bcache: remove member accessed from struct btree > bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > bcache: reap from tail of c->btree_cache in bch_mca_scan() > > drivers/md/bcache/btree.c | 24 ++++++++++-------------- > drivers/md/bcache/btree.h | 2 -- > 2 files changed, 10 insertions(+), 16 deletions(-) > Acked-by: Colin Ian King <colin.king@canonical.com>
Applied to Focal/master-next Thanks, Ian On 2020-10-21 14:58:59 , Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1898786 > > [Impact] > > Systems that utilise bcache can experience extremely high IO wait times when > under constant IO pressure. The IO wait times seem to stay at a consistent 1 > second, and never drop as long as the bcache shrinker is enabled. > > If you disable the shrinker, then IO wait drops significantly to normal levels. > > We did some perf analysis, and it seems we spend a huge amount of time in > bch_mca_scan(), likely waiting for the mutex "&c->bucket_lock". > > Looking at the recent commits in Bionic, we found the following commit merged in > upstream 5.1-rc1 and backported to 4.15.0-87-generic through upstream stable: > > commit 9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > Author: Coly Li <colyli@suse.de> > Date: Wed Nov 13 16:03:24 2019 +0800 > Subject: bcache: at least try to shrink 1 node in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/9fcc34b1a6dd4b8e5337e2b6ef45e428897eca6b > > It mentions in the description that: > > > If sc->nr_to_scan is smaller than c->btree_pages, after the above > > calculation, variable 'nr' will be 0 and nothing will be shrunk. It is > > frequeently observed that only 1 or 2 is set to sc->nr_to_scan and make > > nr to be zero. Then bch_mca_scan() will do nothing more then acquiring > > and releasing mutex c->bucket_lock. > > This seems to be what is going on here, but the above commit only addresses when > nr is 0. > > From what I can see, the problems we are experiencing are when nr is 1 or 2, and > again, we just waste time in bch_mca_scan() waiting on c->bucket_lock, only to > release c->bucket_lock since the shrinker loop never executes since there is no > work to do. > > [Fix] > > The following commits fix the problem, and all landed in 5.6-rc1: > > commit 125d98edd11464c8e0ec9eaaba7d682d0f832686 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:40 2020 +0800 > Subject: bcache: remove member accessed from struct btree > Link: https://github.com/torvalds/linux/commit/125d98edd11464c8e0ec9eaaba7d682d0f832686 > > commit d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:41 2020 +0800 > Subject: bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/d5c9c470b01177e4d90cdbf178b8c7f37f5b8795 > > commit e3de04469a49ee09c89e80bf821508df458ccee6 > Author: Coly Li <colyli@suse.de> > Date: Fri Jan 24 01:01:42 2020 +0800 > Subject: bcache: reap from tail of c->btree_cache in bch_mca_scan() > Link: https://github.com/torvalds/linux/commit/e3de04469a49ee09c89e80bf821508df458ccee6 > > The first commit is a dependency of the other two. The first commit removes a > "recently accessed" marker, used to indicate if a particular cache has been used > recently, and if it has been, not consider it for cache eviction. The commit > mentions that under heavy IO, all caches will end up being recently accessed, > and nothing will ever be shrunk. > > The second commit changes a previous design decision of skipping the first > 3 caches to shrink, since it is a common case to call bch_mca_scan() with nr > being 1, or 2, just like 0 was common in the very first commit I mentioned. > This time, if we land on 1 or 2, the loop exits and nothing happens, and we > waste time waiting on locks, just like the very first commit I mentioned. > The fix is to try shrink caches from the tail of the list, and not the beginning. > > The third commit fixes a minor issue where we don't want to re-arrange the linked > list c->btree_cache, which is what the second commit ended up doing, and instead, > just shrink the cache at the end of the linked list, and not change the order. > > One minor backport / context adjustment was required in the first commit for > Bionic, and the rest are all clean cherry picks to Bionic and Focal. > > [Testcase] > > This is kind of hard to test, since the problem shows up in production > environments that are under constant IO pressure, with many different items > entering and leaving the cache. > > The Launchpad git server is currently suffering this issue, and has been sitting > at a constant IO wait of 1 second / slightly over 1 second which was causing > slow response times, which was leading to build failures when git clones ended > up timing out. > > We installed a test kernel, which is available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf294907-test > > Once the test kernel had been installed, IO wait times with the shrinker enabled > dropped to normal levels, and the git server became responsive again. We have > been monitoring the performance of the git server and watching IO wait times > in grafana over the past week, and everything is looking good, and indicate that > these patches solve the issue. > > [Regression Potential] > > If a regression were to occur, it would only affect users of bcache who have the > shrinker enabled, which I believe is the default setting in bcache deployments. > > In that case, users could disable the shrinker as a workaround until a fix is > available. You can disable the shrinker with: > "sudo bash -c "echo 1 > /sys/fs/bcache/<uuid>/internal/btree_shrinker_disabled"" > > The patches remove the "recently accessed" marker, which changes the behaviour > slightly when it comes to deciding what to evict from the cache. Some workloads > that aren't under IO pressure may see a difference in what items are evicted > from the cache, which may slightly change bcache's performance profile. > > Since the patches prune the cache from the tail of the cache linked list, we > will likely end up pruning more nodes than previously, since it was common to > land on the first 3 caches, which were hard coded to be ignored and not pruned, > which may also slightly change the bcache's performance profile. > > Coly Li (3): > bcache: remove member accessed from struct btree > bcache: reap c->btree_cache_freeable from the tail in bch_mca_scan() > bcache: reap from tail of c->btree_cache in bch_mca_scan() > > drivers/md/bcache/btree.c | 24 ++++++++++-------------- > drivers/md/bcache/btree.h | 2 -- > 2 files changed, 10 insertions(+), 16 deletions(-) > > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team