Message ID | cover.1596536719.git.lukasstraub2@web.de |
---|---|
Headers | show |
Series | colo: Introduce resource agent and test suite/CI | expand |
On Tue, 4 Aug 2020 12:46:29 +0200 Lukas Straub <lukasstraub2@web.de> wrote: > Hello Everyone, > So here is v3. Patch 1 can already be merged independently of the others. > Please review. > > Regards, > Lukas Straub > > Based-on: <cover.1596528468.git.lukasstraub2@web.de> > "Introduce 'yank' oob qmp command to recover from hanging qemu" > > Changes: > > v3: > -resource-agent: Don't determine local qemu state by remote master-score, query > directly via qmp instead > -resource-agent: Add max_queue_size parameter for colo-compare > -resource-agent: Fix monitor action on secondary returning error during > clean shutdown > -resource-agent: Fix stop action setting master-score to 0 on primary on > clean shutdown > > v2: > -use new yank api > -drop disk_size parameter > -introduce pick_qemu_util function and use it > > Overview: > > Hello Everyone, > These patches introduce a resource agent for fully automatic management of colo > and a test suite building upon the resource agent to extensively test colo. > > Test suite features: > -Tests failover with peer crashing and hanging and failover during checkpoint > -Tests network using ssh and iperf3 > -Quick test requires no special configuration > -Network test for testing colo-compare > -Stress test: failover all the time with network load > > Resource agent features: > -Fully automatic management of colo > -Handles many failures: hanging/crashing qemu, replication error, disk error, ... > -Recovers from hanging qemu by using the "yank" oob command > -Tracks which node has up-to-date data > -Works well in clusters with more than 2 nodes > > Run times on my laptop: > Quick test: 200s > Network test: 800s (tagged as slow) > Stress test: 1300s (tagged as slow) > > For the last two tests, the test suite needs access to a network bridge to > properly test the network, so some parameters need to be given to the test > run. See tests/acceptance/colo.py for more information. > > Regards, > Lukas Straub > > Lukas Straub (7): > block/quorum.c: stable children names > avocado_qemu: Introduce pick_qemu_util to pick qemu utility binaries > boot_linux.py: Use pick_qemu_util > colo: Introduce resource agent > colo: Introduce high-level test suite > configure,Makefile: Install colo resource-agent > MAINTAINERS: Add myself as maintainer for COLO resource agent > > MAINTAINERS | 6 + > Makefile | 5 + > block/quorum.c | 20 +- > configure | 10 + > scripts/colo-resource-agent/colo | 1501 +++++++++++++++++++++ > scripts/colo-resource-agent/crm_master | 44 + > scripts/colo-resource-agent/crm_resource | 12 + > tests/acceptance/avocado_qemu/__init__.py | 15 + > tests/acceptance/boot_linux.py | 11 +- > tests/acceptance/colo.py | 677 ++++++++++ > 10 files changed, 2286 insertions(+), 15 deletions(-) > create mode 100755 scripts/colo-resource-agent/colo > create mode 100755 scripts/colo-resource-agent/crm_master > create mode 100755 scripts/colo-resource-agent/crm_resource > create mode 100644 tests/acceptance/colo.py > > -- > 2.20.1 Ping...
On 8/18/20 2:27 PM, Lukas Straub wrote: > On Tue, 4 Aug 2020 12:46:29 +0200 > Lukas Straub <lukasstraub2@web.de> wrote: > >> Hello Everyone, >> So here is v3. Patch 1 can already be merged independently of the others. >> Please review. >> >> Regards, >> Lukas Straub >> >> Based-on: <cover.1596528468.git.lukasstraub2@web.de> >> "Introduce 'yank' oob qmp command to recover from hanging qemu" >> >> Changes: >> >> v3: >> -resource-agent: Don't determine local qemu state by remote master-score, query >> directly via qmp instead >> -resource-agent: Add max_queue_size parameter for colo-compare >> -resource-agent: Fix monitor action on secondary returning error during >> clean shutdown >> -resource-agent: Fix stop action setting master-score to 0 on primary on >> clean shutdown >> >> v2: >> -use new yank api >> -drop disk_size parameter >> -introduce pick_qemu_util function and use it >> >> Overview: >> >> Hello Everyone, >> These patches introduce a resource agent for fully automatic management of colo >> and a test suite building upon the resource agent to extensively test colo. >> >> Test suite features: >> -Tests failover with peer crashing and hanging and failover during checkpoint >> -Tests network using ssh and iperf3 >> -Quick test requires no special configuration >> -Network test for testing colo-compare >> -Stress test: failover all the time with network load >> >> Resource agent features: >> -Fully automatic management of colo >> -Handles many failures: hanging/crashing qemu, replication error, disk error, ... >> -Recovers from hanging qemu by using the "yank" oob command >> -Tracks which node has up-to-date data >> -Works well in clusters with more than 2 nodes >> >> Run times on my laptop: >> Quick test: 200s >> Network test: 800s (tagged as slow) >> Stress test: 1300s (tagged as slow) >> >> For the last two tests, the test suite needs access to a network bridge to >> properly test the network, so some parameters need to be given to the test >> run. See tests/acceptance/colo.py for more information. >> >> Regards, >> Lukas Straub >> >> Lukas Straub (7): >> block/quorum.c: stable children names >> avocado_qemu: Introduce pick_qemu_util to pick qemu utility binaries >> boot_linux.py: Use pick_qemu_util >> colo: Introduce resource agent >> colo: Introduce high-level test suite >> configure,Makefile: Install colo resource-agent >> MAINTAINERS: Add myself as maintainer for COLO resource agent >> >> MAINTAINERS | 6 + >> Makefile | 5 + >> block/quorum.c | 20 +- >> configure | 10 + >> scripts/colo-resource-agent/colo | 1501 +++++++++++++++++++++ >> scripts/colo-resource-agent/crm_master | 44 + >> scripts/colo-resource-agent/crm_resource | 12 + >> tests/acceptance/avocado_qemu/__init__.py | 15 + >> tests/acceptance/boot_linux.py | 11 +- >> tests/acceptance/colo.py | 677 ++++++++++ >> 10 files changed, 2286 insertions(+), 15 deletions(-) >> create mode 100755 scripts/colo-resource-agent/colo >> create mode 100755 scripts/colo-resource-agent/crm_master >> create mode 100755 scripts/colo-resource-agent/crm_resource >> create mode 100644 tests/acceptance/colo.py >> >> -- >> 2.20.1 > > Ping... > Cleber, Wainer, can you have a look at tests/acceptance/colo.py please?
On Tue, 18 Aug 2020 14:27:01 +0200 Lukas Straub <lukasstraub2@web.de> wrote: > On Tue, 4 Aug 2020 12:46:29 +0200 > Lukas Straub <lukasstraub2@web.de> wrote: > > > Hello Everyone, > > So here is v3. Patch 1 can already be merged independently of the others. > > Please review. > > > > Regards, > > Lukas Straub > > > > Based-on: <cover.1596528468.git.lukasstraub2@web.de> > > "Introduce 'yank' oob qmp command to recover from hanging qemu" > > > > Changes: > > > > v3: > > -resource-agent: Don't determine local qemu state by remote master-score, query > > directly via qmp instead > > -resource-agent: Add max_queue_size parameter for colo-compare > > -resource-agent: Fix monitor action on secondary returning error during > > clean shutdown > > -resource-agent: Fix stop action setting master-score to 0 on primary on > > clean shutdown > > > > v2: > > -use new yank api > > -drop disk_size parameter > > -introduce pick_qemu_util function and use it > > > > Overview: > > > > Hello Everyone, > > These patches introduce a resource agent for fully automatic management of colo > > and a test suite building upon the resource agent to extensively test colo. > > > > Test suite features: > > -Tests failover with peer crashing and hanging and failover during checkpoint > > -Tests network using ssh and iperf3 > > -Quick test requires no special configuration > > -Network test for testing colo-compare > > -Stress test: failover all the time with network load > > > > Resource agent features: > > -Fully automatic management of colo > > -Handles many failures: hanging/crashing qemu, replication error, disk error, ... > > -Recovers from hanging qemu by using the "yank" oob command > > -Tracks which node has up-to-date data > > -Works well in clusters with more than 2 nodes > > > > Run times on my laptop: > > Quick test: 200s > > Network test: 800s (tagged as slow) > > Stress test: 1300s (tagged as slow) > > > > For the last two tests, the test suite needs access to a network bridge to > > properly test the network, so some parameters need to be given to the test > > run. See tests/acceptance/colo.py for more information. > > > > Regards, > > Lukas Straub > > > > Lukas Straub (7): > > block/quorum.c: stable children names > > avocado_qemu: Introduce pick_qemu_util to pick qemu utility binaries > > boot_linux.py: Use pick_qemu_util > > colo: Introduce resource agent > > colo: Introduce high-level test suite > > configure,Makefile: Install colo resource-agent > > MAINTAINERS: Add myself as maintainer for COLO resource agent > > > > MAINTAINERS | 6 + > > Makefile | 5 + > > block/quorum.c | 20 +- > > configure | 10 + > > scripts/colo-resource-agent/colo | 1501 +++++++++++++++++++++ > > scripts/colo-resource-agent/crm_master | 44 + > > scripts/colo-resource-agent/crm_resource | 12 + > > tests/acceptance/avocado_qemu/__init__.py | 15 + > > tests/acceptance/boot_linux.py | 11 +- > > tests/acceptance/colo.py | 677 ++++++++++ > > 10 files changed, 2286 insertions(+), 15 deletions(-) > > create mode 100755 scripts/colo-resource-agent/colo > > create mode 100755 scripts/colo-resource-agent/crm_master > > create mode 100755 scripts/colo-resource-agent/crm_resource > > create mode 100644 tests/acceptance/colo.py > > > > -- > > 2.20.1 > > Ping... Ping 2... Kevin, can you already apply patch 1 "block/quorum.c: stable children names"? It resolves the following bug: https://bugs.launchpad.net/qemu/+bug/1881231 Regards, Lukas Straub
Hi Wainer, As Cleber is busy with Gating CI, can you review tests/acceptance/colo.py please? On 8/27/20 10:40 AM, Lukas Straub wrote: > On Tue, 18 Aug 2020 14:27:01 +0200 > Lukas Straub <lukasstraub2@web.de> wrote: > >> On Tue, 4 Aug 2020 12:46:29 +0200 >> Lukas Straub <lukasstraub2@web.de> wrote: >> >>> Hello Everyone, >>> So here is v3. Patch 1 can already be merged independently of the others. >>> Please review. >>> >>> Regards, >>> Lukas Straub >>> >>> Based-on: <cover.1596528468.git.lukasstraub2@web.de> >>> "Introduce 'yank' oob qmp command to recover from hanging qemu" >>> >>> Changes: >>> >>> v3: >>> -resource-agent: Don't determine local qemu state by remote master-score, query >>> directly via qmp instead >>> -resource-agent: Add max_queue_size parameter for colo-compare >>> -resource-agent: Fix monitor action on secondary returning error during >>> clean shutdown >>> -resource-agent: Fix stop action setting master-score to 0 on primary on >>> clean shutdown >>> >>> v2: >>> -use new yank api >>> -drop disk_size parameter >>> -introduce pick_qemu_util function and use it >>> >>> Overview: >>> >>> Hello Everyone, >>> These patches introduce a resource agent for fully automatic management of colo >>> and a test suite building upon the resource agent to extensively test colo. >>> >>> Test suite features: >>> -Tests failover with peer crashing and hanging and failover during checkpoint >>> -Tests network using ssh and iperf3 >>> -Quick test requires no special configuration >>> -Network test for testing colo-compare >>> -Stress test: failover all the time with network load >>> >>> Resource agent features: >>> -Fully automatic management of colo >>> -Handles many failures: hanging/crashing qemu, replication error, disk error, ... >>> -Recovers from hanging qemu by using the "yank" oob command >>> -Tracks which node has up-to-date data >>> -Works well in clusters with more than 2 nodes >>> >>> Run times on my laptop: >>> Quick test: 200s >>> Network test: 800s (tagged as slow) >>> Stress test: 1300s (tagged as slow) >>> >>> For the last two tests, the test suite needs access to a network bridge to >>> properly test the network, so some parameters need to be given to the test >>> run. See tests/acceptance/colo.py for more information. >>> >>> Regards, >>> Lukas Straub >>> >>> Lukas Straub (7): >>> block/quorum.c: stable children names >>> avocado_qemu: Introduce pick_qemu_util to pick qemu utility binaries >>> boot_linux.py: Use pick_qemu_util >>> colo: Introduce resource agent >>> colo: Introduce high-level test suite >>> configure,Makefile: Install colo resource-agent >>> MAINTAINERS: Add myself as maintainer for COLO resource agent >>> >>> MAINTAINERS | 6 + >>> Makefile | 5 + >>> block/quorum.c | 20 +- >>> configure | 10 + >>> scripts/colo-resource-agent/colo | 1501 +++++++++++++++++++++ >>> scripts/colo-resource-agent/crm_master | 44 + >>> scripts/colo-resource-agent/crm_resource | 12 + >>> tests/acceptance/avocado_qemu/__init__.py | 15 + >>> tests/acceptance/boot_linux.py | 11 +- >>> tests/acceptance/colo.py | 677 ++++++++++ >>> 10 files changed, 2286 insertions(+), 15 deletions(-) >>> create mode 100755 scripts/colo-resource-agent/colo >>> create mode 100755 scripts/colo-resource-agent/crm_master >>> create mode 100755 scripts/colo-resource-agent/crm_resource >>> create mode 100644 tests/acceptance/colo.py >>> >>> -- >>> 2.20.1 >> >> Ping... > > Ping 2... > > Kevin, can you already apply patch 1 "block/quorum.c: stable children names"? It resolves the following bug: https://bugs.launchpad.net/qemu/+bug/1881231 > > Regards, > Lukas Straub >