Message ID | patch-17494-tamar@arm.com |
---|---|
Headers | show |
Series | Support early break/return auto-vectorization | expand |
Resending attached only due to size limit > -----Original Message----- > From: Tamar Christina > Sent: Wednesday, June 28, 2023 2:42 PM > To: gcc-patches@gcc.gnu.org > Cc: nd <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com > Subject: [PATCH 3/19]middle-end clean up vect testsuite using pragma > novector > > Hi All, > > The support for early break vectorization breaks lots of scan vect and slp > testcases because they assume that loops with abort () in them cannot be > vectorized. Additionally it breaks the point of having a scalar loop to check > the output of the vectorizer if that loop is also vectorized. > > For that reason this adds > > #pragma GCC novector to all tests which have a scalar loop that we would > have > vectorized using this patch series. > > FWIW, none of these tests were failing to vectorize or run before the pragma. > The tests that did point to some issues were copies to the early break test > suit as well. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/testsuite/ChangeLog: > > * g++.dg/vect/pr84556.cc: Add novector pragma. > * g++.dg/vect/simd-1.cc: Add novector pragma. > * g++.dg/vect/simd-2.cc: Add novector pragma. > * g++.dg/vect/simd-3.cc: Add novector pragma. > * g++.dg/vect/simd-4.cc: Add novector pragma. > * g++.dg/vect/simd-5.cc: Add novector pragma. > * g++.dg/vect/simd-6.cc: Add novector pragma. > * g++.dg/vect/simd-7.cc: Add novector pragma. > * g++.dg/vect/simd-8.cc: Add novector pragma. > * g++.dg/vect/simd-9.cc: Add novector pragma. > * g++.dg/vect/simd-clone-6.cc: Add novector pragma. > * gcc.dg/vect/O3-pr70130.c: Add novector pragma. > * gcc.dg/vect/Os-vect-95.c: Add novector pragma. > * gcc.dg/vect/bb-slp-1.c: Add novector pragma. > * gcc.dg/vect/bb-slp-16.c: Add novector pragma. > * gcc.dg/vect/bb-slp-2.c: Add novector pragma. > * gcc.dg/vect/bb-slp-24.c: Add novector pragma. > * gcc.dg/vect/bb-slp-25.c: Add novector pragma. > * gcc.dg/vect/bb-slp-26.c: Add novector pragma. > * gcc.dg/vect/bb-slp-27.c: Add novector pragma. > * gcc.dg/vect/bb-slp-28.c: Add novector pragma. > * gcc.dg/vect/bb-slp-29.c: Add novector pragma. > * gcc.dg/vect/bb-slp-42.c: Add novector pragma. > * gcc.dg/vect/bb-slp-cond-1.c: Add novector pragma. > * gcc.dg/vect/bb-slp-over-widen-1.c: Add novector pragma. > * gcc.dg/vect/bb-slp-over-widen-2.c: Add novector pragma. > * gcc.dg/vect/bb-slp-pattern-1.c: Add novector pragma. > * gcc.dg/vect/bb-slp-pattern-2.c: Add novector pragma. > * gcc.dg/vect/bb-slp-pow-1.c: Add novector pragma. > * gcc.dg/vect/bb-slp-pr101615-2.c: Add novector pragma. > * gcc.dg/vect/bb-slp-pr65935.c: Add novector pragma. > * gcc.dg/vect/bb-slp-subgroups-1.c: Add novector pragma. > * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Add novector > pragma. > * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Add novector > pragma. > * gcc.dg/vect/costmodel/i386/costmodel-vect-68.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-slp-12.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-slp-33.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-slp-34.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-31b.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-31c.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-33.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-68a.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-68b.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-68c.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-76a.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-76c.c: Add novector > pragma. > * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: Add > novector pragma. > * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Add novector > pragma. > * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Add novector > pragma. > * gcc.dg/vect/costmodel/x86_64/costmodel-vect-68.c: Add novector > pragma. > * gcc.dg/vect/fast-math-bb-slp-call-1.c: Add novector pragma. > * gcc.dg/vect/fast-math-bb-slp-call-2.c: Add novector pragma. > * gcc.dg/vect/fast-math-vect-call-1.c: Add novector pragma. > * gcc.dg/vect/fast-math-vect-call-2.c: Add novector pragma. > * gcc.dg/vect/fast-math-vect-complex-3.c: Add novector pragma. > * gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-noreassoc-outer-1.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-noreassoc-outer-2.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-noreassoc-outer-3.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-noreassoc-outer-5.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-10.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-10a.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-10b.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-11.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-12.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-15.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-16.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-17.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-18.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-19.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-20.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-21.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-22.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-3.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-4.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-5.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-6-global.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-6.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-7.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-8.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-9.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-9a.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-outer-9b.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-slp-30.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-slp-31.c: Add novector pragma. > * gcc.dg/vect/no-scevccp-vect-iv-2.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-31.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-34.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-36.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-64.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-65.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-66.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-68.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-69.c: Add novector pragma. > * gcc.dg/vect/no-section-anchors-vect-outer-4h.c: Add novector > pragma. > * gcc.dg/vect/no-trapping-math-2.c: Add novector pragma. > * gcc.dg/vect/no-trapping-math-vect-111.c: Add novector pragma. > * gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c: Add novector > pragma. > * gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c: Add novector > pragma. > * gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c: Add novector > pragma. > * gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c: Add novector > pragma. > * gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c: Add novector > pragma. > * gcc.dg/vect/no-tree-dom-vect-bug.c: Add novector pragma. > * gcc.dg/vect/no-tree-pre-slp-29.c: Add novector pragma. > * gcc.dg/vect/no-vfa-pr29145.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-101.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-102.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-102a.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-37.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-43.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-45.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-49.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-51.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-53.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-57.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-61.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-79.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-depend-1.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-depend-2.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-depend-3.c: Add novector pragma. > * gcc.dg/vect/no-vfa-vect-dv-2.c: Add novector pragma. > * gcc.dg/vect/pr101445.c: Add novector pragma. > * gcc.dg/vect/pr103581.c: Add novector pragma. > * gcc.dg/vect/pr105219.c: Add novector pragma. > * gcc.dg/vect/pr108608.c: Add novector pragma. > * gcc.dg/vect/pr18400.c: Add novector pragma. > * gcc.dg/vect/pr18536.c: Add novector pragma. > * gcc.dg/vect/pr20122.c: Add novector pragma. > * gcc.dg/vect/pr25413.c: Add novector pragma. > * gcc.dg/vect/pr30784.c: Add novector pragma. > * gcc.dg/vect/pr37539.c: Add novector pragma. > * gcc.dg/vect/pr40074.c: Add novector pragma. > * gcc.dg/vect/pr45752.c: Add novector pragma. > * gcc.dg/vect/pr45902.c: Add novector pragma. > * gcc.dg/vect/pr46009.c: Add novector pragma. > * gcc.dg/vect/pr48172.c: Add novector pragma. > * gcc.dg/vect/pr51074.c: Add novector pragma. > * gcc.dg/vect/pr51581-3.c: Add novector pragma. > * gcc.dg/vect/pr51581-4.c: Add novector pragma. > * gcc.dg/vect/pr53185-2.c: Add novector pragma. > * gcc.dg/vect/pr56826.c: Add novector pragma. > * gcc.dg/vect/pr56918.c: Add novector pragma. > * gcc.dg/vect/pr56920.c: Add novector pragma. > * gcc.dg/vect/pr56933.c: Add novector pragma. > * gcc.dg/vect/pr57705.c: Add novector pragma. > * gcc.dg/vect/pr57741-2.c: Add novector pragma. > * gcc.dg/vect/pr57741-3.c: Add novector pragma. > * gcc.dg/vect/pr59591-1.c: Add novector pragma. > * gcc.dg/vect/pr59591-2.c: Add novector pragma. > * gcc.dg/vect/pr59594.c: Add novector pragma. > * gcc.dg/vect/pr59984.c: Add novector pragma. > * gcc.dg/vect/pr60276.c: Add novector pragma. > * gcc.dg/vect/pr61194.c: Add novector pragma. > * gcc.dg/vect/pr61680.c: Add novector pragma. > * gcc.dg/vect/pr62021.c: Add novector pragma. > * gcc.dg/vect/pr63341-2.c: Add novector pragma. > * gcc.dg/vect/pr64252.c: Add novector pragma. > * gcc.dg/vect/pr64404.c: Add novector pragma. > * gcc.dg/vect/pr64421.c: Add novector pragma. > * gcc.dg/vect/pr64493.c: Add novector pragma. > * gcc.dg/vect/pr64495.c: Add novector pragma. > * gcc.dg/vect/pr66251.c: Add novector pragma. > * gcc.dg/vect/pr66253.c: Add novector pragma. > * gcc.dg/vect/pr68502-1.c: Add novector pragma. > * gcc.dg/vect/pr68502-2.c: Add novector pragma. > * gcc.dg/vect/pr69820.c: Add novector pragma. > * gcc.dg/vect/pr70021.c: Add novector pragma. > * gcc.dg/vect/pr70354-1.c: Add novector pragma. > * gcc.dg/vect/pr70354-2.c: Add novector pragma. > * gcc.dg/vect/pr71259.c: Add novector pragma. > * gcc.dg/vect/pr78005.c: Add novector pragma. > * gcc.dg/vect/pr78558.c: Add novector pragma. > * gcc.dg/vect/pr80815-2.c: Add novector pragma. > * gcc.dg/vect/pr80815-3.c: Add novector pragma. > * gcc.dg/vect/pr80928.c: Add novector pragma. > * gcc.dg/vect/pr81410.c: Add novector pragma. > * gcc.dg/vect/pr81633.c: Add novector pragma. > * gcc.dg/vect/pr81740-1.c: Add novector pragma. > * gcc.dg/vect/pr81740-2.c: Add novector pragma. > * gcc.dg/vect/pr85586.c: Add novector pragma. > * gcc.dg/vect/pr87288-1.c: Add novector pragma. > * gcc.dg/vect/pr87288-2.c: Add novector pragma. > * gcc.dg/vect/pr87288-3.c: Add novector pragma. > * gcc.dg/vect/pr88903-1.c: Add novector pragma. > * gcc.dg/vect/pr88903-2.c: Add novector pragma. > * gcc.dg/vect/pr90018.c: Add novector pragma. > * gcc.dg/vect/pr92420.c: Add novector pragma. > * gcc.dg/vect/pr94994.c: Add novector pragma. > * gcc.dg/vect/pr96783-1.c: Add novector pragma. > * gcc.dg/vect/pr96783-2.c: Add novector pragma. > * gcc.dg/vect/pr97081-2.c: Add novector pragma. > * gcc.dg/vect/pr97558-2.c: Add novector pragma. > * gcc.dg/vect/pr97678.c: Add novector pragma. > * gcc.dg/vect/section-anchors-pr27770.c: Add novector pragma. > * gcc.dg/vect/section-anchors-vect-69.c: Add novector pragma. > * gcc.dg/vect/slp-1.c: Add novector pragma. > * gcc.dg/vect/slp-10.c: Add novector pragma. > * gcc.dg/vect/slp-11a.c: Add novector pragma. > * gcc.dg/vect/slp-11b.c: Add novector pragma. > * gcc.dg/vect/slp-11c.c: Add novector pragma. > * gcc.dg/vect/slp-12a.c: Add novector pragma. > * gcc.dg/vect/slp-12b.c: Add novector pragma. > * gcc.dg/vect/slp-12c.c: Add novector pragma. > * gcc.dg/vect/slp-13-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-13.c: Add novector pragma. > * gcc.dg/vect/slp-14.c: Add novector pragma. > * gcc.dg/vect/slp-15.c: Add novector pragma. > * gcc.dg/vect/slp-16.c: Add novector pragma. > * gcc.dg/vect/slp-17.c: Add novector pragma. > * gcc.dg/vect/slp-18.c: Add novector pragma. > * gcc.dg/vect/slp-19a.c: Add novector pragma. > * gcc.dg/vect/slp-19b.c: Add novector pragma. > * gcc.dg/vect/slp-19c.c: Add novector pragma. > * gcc.dg/vect/slp-2.c: Add novector pragma. > * gcc.dg/vect/slp-20.c: Add novector pragma. > * gcc.dg/vect/slp-21.c: Add novector pragma. > * gcc.dg/vect/slp-22.c: Add novector pragma. > * gcc.dg/vect/slp-23.c: Add novector pragma. > * gcc.dg/vect/slp-24-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-24.c: Add novector pragma. > * gcc.dg/vect/slp-25.c: Add novector pragma. > * gcc.dg/vect/slp-26.c: Add novector pragma. > * gcc.dg/vect/slp-28.c: Add novector pragma. > * gcc.dg/vect/slp-3-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-3.c: Add novector pragma. > * gcc.dg/vect/slp-33.c: Add novector pragma. > * gcc.dg/vect/slp-34-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-34.c: Add novector pragma. > * gcc.dg/vect/slp-35.c: Add novector pragma. > * gcc.dg/vect/slp-37.c: Add novector pragma. > * gcc.dg/vect/slp-4-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-4.c: Add novector pragma. > * gcc.dg/vect/slp-41.c: Add novector pragma. > * gcc.dg/vect/slp-43.c: Add novector pragma. > * gcc.dg/vect/slp-45.c: Add novector pragma. > * gcc.dg/vect/slp-46.c: Add novector pragma. > * gcc.dg/vect/slp-47.c: Add novector pragma. > * gcc.dg/vect/slp-48.c: Add novector pragma. > * gcc.dg/vect/slp-49.c: Add novector pragma. > * gcc.dg/vect/slp-5.c: Add novector pragma. > * gcc.dg/vect/slp-6.c: Add novector pragma. > * gcc.dg/vect/slp-7.c: Add novector pragma. > * gcc.dg/vect/slp-8.c: Add novector pragma. > * gcc.dg/vect/slp-9.c: Add novector pragma. > * gcc.dg/vect/slp-cond-1.c: Add novector pragma. > * gcc.dg/vect/slp-cond-2-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-cond-2.c: Add novector pragma. > * gcc.dg/vect/slp-cond-3.c: Add novector pragma. > * gcc.dg/vect/slp-cond-4.c: Add novector pragma. > * gcc.dg/vect/slp-cond-5.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-1.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-10.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-11-big-array.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-11.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-12.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-2.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-3.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-4.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-5.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-6.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-7.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-8.c: Add novector pragma. > * gcc.dg/vect/slp-multitypes-9.c: Add novector pragma. > * gcc.dg/vect/slp-perm-1.c: Add novector pragma. > * gcc.dg/vect/slp-perm-10.c: Add novector pragma. > * gcc.dg/vect/slp-perm-11.c: Add novector pragma. > * gcc.dg/vect/slp-perm-12.c: Add novector pragma. > * gcc.dg/vect/slp-perm-2.c: Add novector pragma. > * gcc.dg/vect/slp-perm-3.c: Add novector pragma. > * gcc.dg/vect/slp-perm-4.c: Add novector pragma. > * gcc.dg/vect/slp-perm-5.c: Add novector pragma. > * gcc.dg/vect/slp-perm-6.c: Add novector pragma. > * gcc.dg/vect/slp-perm-7.c: Add novector pragma. > * gcc.dg/vect/slp-perm-8.c: Add novector pragma. > * gcc.dg/vect/slp-perm-9.c: Add novector pragma. > * gcc.dg/vect/slp-widen-mult-half.c: Add novector pragma. > * gcc.dg/vect/slp-widen-mult-s16.c: Add novector pragma. > * gcc.dg/vect/slp-widen-mult-u8.c: Add novector pragma. > * gcc.dg/vect/vect-100.c: Add novector pragma. > * gcc.dg/vect/vect-103.c: Add novector pragma. > * gcc.dg/vect/vect-104.c: Add novector pragma. > * gcc.dg/vect/vect-105-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-105.c: Add novector pragma. > * gcc.dg/vect/vect-106.c: Add novector pragma. > * gcc.dg/vect/vect-107.c: Add novector pragma. > * gcc.dg/vect/vect-108.c: Add novector pragma. > * gcc.dg/vect/vect-109.c: Add novector pragma. > * gcc.dg/vect/vect-11.c: Add novector pragma. > * gcc.dg/vect/vect-110.c: Add novector pragma. > * gcc.dg/vect/vect-113.c: Add novector pragma. > * gcc.dg/vect/vect-114.c: Add novector pragma. > * gcc.dg/vect/vect-115.c: Add novector pragma. > * gcc.dg/vect/vect-116.c: Add novector pragma. > * gcc.dg/vect/vect-117.c: Add novector pragma. > * gcc.dg/vect/vect-11a.c: Add novector pragma. > * gcc.dg/vect/vect-12.c: Add novector pragma. > * gcc.dg/vect/vect-122.c: Add novector pragma. > * gcc.dg/vect/vect-124.c: Add novector pragma. > * gcc.dg/vect/vect-13.c: Add novector pragma. > * gcc.dg/vect/vect-14.c: Add novector pragma. > * gcc.dg/vect/vect-15-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-15.c: Add novector pragma. > * gcc.dg/vect/vect-17.c: Add novector pragma. > * gcc.dg/vect/vect-18.c: Add novector pragma. > * gcc.dg/vect/vect-19.c: Add novector pragma. > * gcc.dg/vect/vect-2-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-2.c: Add novector pragma. > * gcc.dg/vect/vect-20.c: Add novector pragma. > * gcc.dg/vect/vect-21.c: Add novector pragma. > * gcc.dg/vect/vect-22.c: Add novector pragma. > * gcc.dg/vect/vect-23.c: Add novector pragma. > * gcc.dg/vect/vect-24.c: Add novector pragma. > * gcc.dg/vect/vect-25.c: Add novector pragma. > * gcc.dg/vect/vect-26.c: Add novector pragma. > * gcc.dg/vect/vect-27.c: Add novector pragma. > * gcc.dg/vect/vect-28.c: Add novector pragma. > * gcc.dg/vect/vect-29.c: Add novector pragma. > * gcc.dg/vect/vect-3.c: Add novector pragma. > * gcc.dg/vect/vect-30.c: Add novector pragma. > * gcc.dg/vect/vect-31-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-31.c: Add novector pragma. > * gcc.dg/vect/vect-32-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-32.c: Add novector pragma. > * gcc.dg/vect/vect-33-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-33.c: Add novector pragma. > * gcc.dg/vect/vect-34-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-34.c: Add novector pragma. > * gcc.dg/vect/vect-35-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-35.c: Add novector pragma. > * gcc.dg/vect/vect-36-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-36.c: Add novector pragma. > * gcc.dg/vect/vect-38.c: Add novector pragma. > * gcc.dg/vect/vect-4.c: Add novector pragma. > * gcc.dg/vect/vect-40.c: Add novector pragma. > * gcc.dg/vect/vect-42.c: Add novector pragma. > * gcc.dg/vect/vect-44.c: Add novector pragma. > * gcc.dg/vect/vect-46.c: Add novector pragma. > * gcc.dg/vect/vect-48.c: Add novector pragma. > * gcc.dg/vect/vect-5.c: Add novector pragma. > * gcc.dg/vect/vect-50.c: Add novector pragma. > * gcc.dg/vect/vect-52.c: Add novector pragma. > * gcc.dg/vect/vect-54.c: Add novector pragma. > * gcc.dg/vect/vect-56.c: Add novector pragma. > * gcc.dg/vect/vect-58.c: Add novector pragma. > * gcc.dg/vect/vect-6-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-6.c: Add novector pragma. > * gcc.dg/vect/vect-60.c: Add novector pragma. > * gcc.dg/vect/vect-62.c: Add novector pragma. > * gcc.dg/vect/vect-63.c: Add novector pragma. > * gcc.dg/vect/vect-64.c: Add novector pragma. > * gcc.dg/vect/vect-65.c: Add novector pragma. > * gcc.dg/vect/vect-66.c: Add novector pragma. > * gcc.dg/vect/vect-67.c: Add novector pragma. > * gcc.dg/vect/vect-68.c: Add novector pragma. > * gcc.dg/vect/vect-7.c: Add novector pragma. > * gcc.dg/vect/vect-70.c: Add novector pragma. > * gcc.dg/vect/vect-71.c: Add novector pragma. > * gcc.dg/vect/vect-72.c: Add novector pragma. > * gcc.dg/vect/vect-73-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-73.c: Add novector pragma. > * gcc.dg/vect/vect-74-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-74.c: Add novector pragma. > * gcc.dg/vect/vect-75-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-75.c: Add novector pragma. > * gcc.dg/vect/vect-76-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-76.c: Add novector pragma. > * gcc.dg/vect/vect-77-alignchecks.c: Add novector pragma. > * gcc.dg/vect/vect-77-global.c: Add novector pragma. > * gcc.dg/vect/vect-77.c: Add novector pragma. > * gcc.dg/vect/vect-78-alignchecks.c: Add novector pragma. > * gcc.dg/vect/vect-78-global.c: Add novector pragma. > * gcc.dg/vect/vect-78.c: Add novector pragma. > * gcc.dg/vect/vect-8.c: Add novector pragma. > * gcc.dg/vect/vect-80-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-80.c: Add novector pragma. > * gcc.dg/vect/vect-82.c: Add novector pragma. > * gcc.dg/vect/vect-82_64.c: Add novector pragma. > * gcc.dg/vect/vect-83.c: Add novector pragma. > * gcc.dg/vect/vect-83_64.c: Add novector pragma. > * gcc.dg/vect/vect-85-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-85.c: Add novector pragma. > * gcc.dg/vect/vect-86.c: Add novector pragma. > * gcc.dg/vect/vect-87.c: Add novector pragma. > * gcc.dg/vect/vect-88.c: Add novector pragma. > * gcc.dg/vect/vect-89-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-89.c: Add novector pragma. > * gcc.dg/vect/vect-9.c: Add novector pragma. > * gcc.dg/vect/vect-92.c: Add novector pragma. > * gcc.dg/vect/vect-93.c: Add novector pragma. > * gcc.dg/vect/vect-95.c: Add novector pragma. > * gcc.dg/vect/vect-96.c: Add novector pragma. > * gcc.dg/vect/vect-97-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-97.c: Add novector pragma. > * gcc.dg/vect/vect-98-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-98.c: Add novector pragma. > * gcc.dg/vect/vect-99.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-10.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-11.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-12.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-14.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-15.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-16.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-18.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-19.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-20.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-8.c: Add novector pragma. > * gcc.dg/vect/vect-alias-check-9.c: Add novector pragma. > * gcc.dg/vect/vect-align-1.c: Add novector pragma. > * gcc.dg/vect/vect-align-2.c: Add novector pragma. > * gcc.dg/vect/vect-all-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-all.c: Add novector pragma. > * gcc.dg/vect/vect-avg-1.c: Add novector pragma. > * gcc.dg/vect/vect-avg-11.c: Add novector pragma. > * gcc.dg/vect/vect-avg-15.c: Add novector pragma. > * gcc.dg/vect/vect-avg-16.c: Add novector pragma. > * gcc.dg/vect/vect-avg-5.c: Add novector pragma. > * gcc.dg/vect/vect-bitfield-write-1.c: Add novector pragma. > * gcc.dg/vect/vect-bitfield-write-2.c: Add novector pragma. > * gcc.dg/vect/vect-bitfield-write-3.c: Add novector pragma. > * gcc.dg/vect/vect-bitfield-write-4.c: Add novector pragma. > * gcc.dg/vect/vect-bitfield-write-5.c: Add novector pragma. > * gcc.dg/vect/vect-bool-cmp.c: Add novector pragma. > * gcc.dg/vect/vect-bswap16.c: Add novector pragma. > * gcc.dg/vect/vect-bswap32.c: Add novector pragma. > * gcc.dg/vect/vect-bswap64.c: Add novector pragma. > * gcc.dg/vect/vect-complex-1.c: Add novector pragma. > * gcc.dg/vect/vect-complex-2.c: Add novector pragma. > * gcc.dg/vect/vect-complex-4.c: Add novector pragma. > * gcc.dg/vect/vect-cond-1.c: Add novector pragma. > * gcc.dg/vect/vect-cond-10.c: Add novector pragma. > * gcc.dg/vect/vect-cond-11.c: Add novector pragma. > * gcc.dg/vect/vect-cond-3.c: Add novector pragma. > * gcc.dg/vect/vect-cond-4.c: Add novector pragma. > * gcc.dg/vect/vect-cond-5.c: Add novector pragma. > * gcc.dg/vect/vect-cond-6.c: Add novector pragma. > * gcc.dg/vect/vect-cond-7.c: Add novector pragma. > * gcc.dg/vect/vect-cond-8.c: Add novector pragma. > * gcc.dg/vect/vect-cond-9.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-1.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-3.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-4.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-5.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-6.c: Add novector pragma. > * gcc.dg/vect/vect-cond-arith-7.c: Add novector pragma. > * gcc.dg/vect/vect-cselim-1.c: Add novector pragma. > * gcc.dg/vect/vect-cselim-2.c: Add novector pragma. > * gcc.dg/vect/vect-div-bitmask-4.c: Add novector pragma. > * gcc.dg/vect/vect-div-bitmask-5.c: Add novector pragma. > * gcc.dg/vect/vect-div-bitmask.h: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-1.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-2.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-3.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-4.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-5.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-6-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-6.c: Add novector pragma. > * gcc.dg/vect/vect-double-reduc-7.c: Add novector pragma. > * gcc.dg/vect/vect-float-extend-1.c: Add novector pragma. > * gcc.dg/vect/vect-float-truncate-1.c: Add novector pragma. > * gcc.dg/vect/vect-floatint-conversion-1.c: Add novector pragma. > * gcc.dg/vect/vect-floatint-conversion-2.c: Add novector pragma. > * gcc.dg/vect/vect-fma-1.c: Add novector pragma. > * gcc.dg/vect/vect-gather-1.c: Add novector pragma. > * gcc.dg/vect/vect-gather-3.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-11.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-16.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-17.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-2.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-3.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-4.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-5.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-6.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-7.c: Add novector pragma. > * gcc.dg/vect/vect-ifcvt-9.c: Add novector pragma. > * gcc.dg/vect/vect-intfloat-conversion-1.c: Add novector pragma. > * gcc.dg/vect/vect-intfloat-conversion-2.c: Add novector pragma. > * gcc.dg/vect/vect-intfloat-conversion-3.c: Add novector pragma. > * gcc.dg/vect/vect-intfloat-conversion-4a.c: Add novector pragma. > * gcc.dg/vect/vect-intfloat-conversion-4b.c: Add novector pragma. > * gcc.dg/vect/vect-iv-1.c: Add novector pragma. > * gcc.dg/vect/vect-iv-10.c: Add novector pragma. > * gcc.dg/vect/vect-iv-2.c: Add novector pragma. > * gcc.dg/vect/vect-iv-3.c: Add novector pragma. > * gcc.dg/vect/vect-iv-4.c: Add novector pragma. > * gcc.dg/vect/vect-iv-5.c: Add novector pragma. > * gcc.dg/vect/vect-iv-6.c: Add novector pragma. > * gcc.dg/vect/vect-iv-7.c: Add novector pragma. > * gcc.dg/vect/vect-iv-8-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-iv-8.c: Add novector pragma. > * gcc.dg/vect/vect-iv-8a-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-iv-8a.c: Add novector pragma. > * gcc.dg/vect/vect-live-1.c: Add novector pragma. > * gcc.dg/vect/vect-live-2.c: Add novector pragma. > * gcc.dg/vect/vect-live-3.c: Add novector pragma. > * gcc.dg/vect/vect-live-4.c: Add novector pragma. > * gcc.dg/vect/vect-live-5.c: Add novector pragma. > * gcc.dg/vect/vect-live-slp-1.c: Add novector pragma. > * gcc.dg/vect/vect-live-slp-2.c: Add novector pragma. > * gcc.dg/vect/vect-live-slp-3.c: Add novector pragma. > * gcc.dg/vect/vect-mask-load-1.c: Add novector pragma. > * gcc.dg/vect/vect-mask-loadstore-1.c: Add novector pragma. > * gcc.dg/vect/vect-mulhrs-1.c: Add novector pragma. > * gcc.dg/vect/vect-mult-const-pattern-1.c: Add novector pragma. > * gcc.dg/vect/vect-mult-const-pattern-2.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-1.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-10.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-11.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-12.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-13.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-14.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-16.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-17.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-2.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-3.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-4.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-5.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-6.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-8.c: Add novector pragma. > * gcc.dg/vect/vect-multitypes-9.c: Add novector pragma. > * gcc.dg/vect/vect-nb-iter-ub-1.c: Add novector pragma. > * gcc.dg/vect/vect-nb-iter-ub-2.c: Add novector pragma. > * gcc.dg/vect/vect-nb-iter-ub-3.c: Add novector pragma. > * gcc.dg/vect/vect-neg-store-1.c: Add novector pragma. > * gcc.dg/vect/vect-neg-store-2.c: Add novector pragma. > * gcc.dg/vect/vect-nest-cycle-1.c: Add novector pragma. > * gcc.dg/vect/vect-nest-cycle-2.c: Add novector pragma. > * gcc.dg/vect/vect-nest-cycle-3.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2a-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2a.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2b.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2c-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2c.c: Add novector pragma. > * gcc.dg/vect/vect-outer-2d.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3a-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3a.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3b.c: Add novector pragma. > * gcc.dg/vect/vect-outer-3c.c: Add novector pragma. > * gcc.dg/vect/vect-outer-4.c: Add novector pragma. > * gcc.dg/vect/vect-outer-4d-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-4d.c: Add novector pragma. > * gcc.dg/vect/vect-outer-5.c: Add novector pragma. > * gcc.dg/vect/vect-outer-6.c: Add novector pragma. > * gcc.dg/vect/vect-outer-fir-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-fir-lb-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-outer-fir-lb.c: Add novector pragma. > * gcc.dg/vect/vect-outer-fir.c: Add novector pragma. > * gcc.dg/vect/vect-outer-simd-1.c: Add novector pragma. > * gcc.dg/vect/vect-outer-simd-2.c: Add novector pragma. > * gcc.dg/vect/vect-outer-simd-3.c: Add novector pragma. > * gcc.dg/vect/vect-outer-slp-2.c: Add novector pragma. > * gcc.dg/vect/vect-outer-slp-3.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-1-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-1.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-11.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-13.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-15.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-17.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-18.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-19.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-2-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-2.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-20.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-21.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-22.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-3-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-3.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-4-big-array.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-4.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-5.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-7.c: Add novector pragma. > * gcc.dg/vect/vect-over-widen-9.c: Add novector pragma. > * gcc.dg/vect/vect-peel-1-src.c: Add novector pragma. > * gcc.dg/vect/vect-peel-2-src.c: Add novector pragma. > * gcc.dg/vect/vect-peel-4-src.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-1.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-2.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-3.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-4.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-5.c: Add novector pragma. > * gcc.dg/vect/vect-recurr-6.c: Add novector pragma. > * gcc.dg/vect/vect-sdiv-pow2-1.c: Add novector pragma. > * gcc.dg/vect/vect-sdivmod-1.c: Add novector pragma. > * gcc.dg/vect/vect-shift-1.c: Add novector pragma. > * gcc.dg/vect/vect-shift-3.c: Add novector pragma. > * gcc.dg/vect/vect-shift-4.c: Add novector pragma. > * gcc.dg/vect/vect-simd-1.c: Add novector pragma. > * gcc.dg/vect/vect-simd-10.c: Add novector pragma. > * gcc.dg/vect/vect-simd-11.c: Add novector pragma. > * gcc.dg/vect/vect-simd-12.c: Add novector pragma. > * gcc.dg/vect/vect-simd-13.c: Add novector pragma. > * gcc.dg/vect/vect-simd-14.c: Add novector pragma. > * gcc.dg/vect/vect-simd-15.c: Add novector pragma. > * gcc.dg/vect/vect-simd-16.c: Add novector pragma. > * gcc.dg/vect/vect-simd-17.c: Add novector pragma. > * gcc.dg/vect/vect-simd-18.c: Add novector pragma. > * gcc.dg/vect/vect-simd-19.c: Add novector pragma. > * gcc.dg/vect/vect-simd-20.c: Add novector pragma. > * gcc.dg/vect/vect-simd-8.c: Add novector pragma. > * gcc.dg/vect/vect-simd-9.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-1.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-10.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-11.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-15.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-2.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-3.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-4.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-5.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-6.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-7.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-8.c: Add novector pragma. > * gcc.dg/vect/vect-simd-clone-9.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-mult.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u16-i2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u16-i4.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u16-mult.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u32-mult.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u8-i2-gap.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u8-i8-gap2-big-array.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-a-u8-i8-gap2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-a-u8-i8-gap7-big-array.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-a-u8-i8-gap7.c: Add novector pragma. > * gcc.dg/vect/vect-strided-float.c: Add novector pragma. > * gcc.dg/vect/vect-strided-mult-char-ls.c: Add novector pragma. > * gcc.dg/vect/vect-strided-mult.c: Add novector pragma. > * gcc.dg/vect/vect-strided-same-dr.c: Add novector pragma. > * gcc.dg/vect/vect-strided-shift-1.c: Add novector pragma. > * gcc.dg/vect/vect-strided-store-a-u8-i2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-store-u16-i4.c: Add novector pragma. > * gcc.dg/vect/vect-strided-store-u32-i2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-store.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u16-i2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u16-i3.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u16-i4.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u32-i4.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u32-i8.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u32-mult.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i2-gap.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap2-big-array.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap2.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap4-big-array.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap4-unknown.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap4.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap7-big-array.c: Add novector > pragma. > * gcc.dg/vect/vect-strided-u8-i8-gap7.c: Add novector pragma. > * gcc.dg/vect/vect-strided-u8-i8.c: Add novector pragma. > * gcc.dg/vect/vect-vfa-01.c: Add novector pragma. > * gcc.dg/vect/vect-vfa-02.c: Add novector pragma. > * gcc.dg/vect/vect-vfa-03.c: Add novector pragma. > * gcc.dg/vect/vect-vfa-04.c: Add novector pragma. > * gcc.dg/vect/vect-vfa-slp.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-1.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-const-s16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-const-u16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-half-u8.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-half.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-s16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-s8.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-u16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-u8-u32.c: Add novector pragma. > * gcc.dg/vect/vect-widen-mult-u8.c: Add novector pragma. > * gcc.dg/vect/vect-widen-shift-s16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-shift-s8.c: Add novector pragma. > * gcc.dg/vect/vect-widen-shift-u16.c: Add novector pragma. > * gcc.dg/vect/vect-widen-shift-u8.c: Add novector pragma. > * gcc.dg/vect/wrapv-vect-7.c: Add novector pragma.
On Mon, 6 Nov 2023, Tamar Christina wrote: > Hi All, > > This patch adds initial support for early break vectorization in GCC. > The support is added for any target that implements a vector cbranch optab, > this includes both fully masked and non-masked targets. > > Depending on the operation, the vectorizer may also require support for boolean > mask reductions using Inclusive OR. This is however only checked then the > comparison would produce multiple statements. > > Note: I am currently struggling to get patch 7 correct in all cases and could use > some feedback there. > > Concretely the kind of loops supported are of the forms: > > for (int i = 0; i < N; i++) > { > <statements1> > if (<condition>) > { > ... > <action>; > } > <statements2> > } > > where <action> can be: > - break > - return > - goto > > Any number of statements can be used before the <action> occurs. > > Since this is an initial version for GCC 14 it has the following limitations and > features: > > - Only fixed sized iterations and buffers are supported. That is to say any > vectors loaded or stored must be to statically allocated arrays with known > sizes. N must also be known. This limitation is because our primary target > for this optimization is SVE. For VLA SVE we can't easily do cross page > iteraion checks. The result is likely to also not be beneficial. For that > reason we punt support for variable buffers till we have First-Faulting > support in GCC. > - any stores in <statements1> should not be to the same objects as in > <condition>. Loads are fine as long as they don't have the possibility to > alias. More concretely, we block RAW dependencies when the intermediate value > can't be separated fromt the store, or the store itself can't be moved. > - Prologue peeling, alignment peelinig and loop versioning are supported. > - Fully masked loops, unmasked loops and partially masked loops are supported > - Any number of loop early exits are supported. > - No support for epilogue vectorization. The only epilogue supported is the > scalar final one. Peeling code supports it but the code motion code cannot > find instructions to make the move in the epilog. > - Early breaks are only supported for inner loop vectorization. > > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break > > With the help of IPA and LTO this still gets hit quite often. During bootstrap > it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now > since these are tests for support for early exit vectorization. > > This implementation does not support completely handling the early break inside > the vector loop itself but instead supports adding checks such that if we know > that we have to exit in the current iteration then we branch to scalar code to > actually do the final VF iterations which handles all the code in <action>. > > For the scalar loop we know that whatever exit you take you have to perform at > most VF iterations. For vector code we only case about the state of fully > performed iteration and reset the scalar code to the (partially) remaining loop. > > That is to say, the first vector loop executes so long as the early exit isn't > needed. Once the exit is taken, the scalar code will perform at most VF extra > iterations. The exact number depending on peeling and iteration start and which > exit was taken (natural or early). For this scalar loop, all early exits are > treated the same. > > When we vectorize we move any statement not related to the early break itself > and that would be incorrect to execute before the break (i.e. has side effects) > to after the break. If this is not possible we decline to vectorize. > > This means that we check at the start of iterations whether we are going to exit > or not. During the analyis phase we check whether we are allowed to do this > moving of statements. Also note that we only move the scalar statements, but > only do so after peeling but just before we start transforming statements. > > Codegen: > > for e.g. > > #define N 803 > unsigned vect_a[N]; > unsigned vect_b[N]; > > unsigned test4(unsigned x) > { > unsigned ret = 0; > for (int i = 0; i < N; i++) > { > vect_b[i] = x + i; > if (vect_a[i] > x) > break; > vect_a[i] = x; > > } > return ret; > } > > We generate for Adv. SIMD: > > test4: > adrp x2, .LC0 > adrp x3, .LANCHOR0 > dup v2.4s, w0 > add x3, x3, :lo12:.LANCHOR0 > movi v4.4s, 0x4 > add x4, x3, 3216 > ldr q1, [x2, #:lo12:.LC0] > mov x1, 0 > mov w2, 0 > .p2align 3,,7 > .L3: > ldr q0, [x3, x1] > add v3.4s, v1.4s, v2.4s > add v1.4s, v1.4s, v4.4s > cmhi v0.4s, v0.4s, v2.4s > umaxp v0.4s, v0.4s, v0.4s > fmov x5, d0 > cbnz x5, .L6 > add w2, w2, 1 > str q3, [x1, x4] > str q2, [x3, x1] > add x1, x1, 16 > cmp w2, 200 > bne .L3 > mov w7, 3 > .L2: > lsl w2, w2, 2 > add x5, x3, 3216 > add w6, w2, w0 > sxtw x4, w2 > ldr w1, [x3, x4, lsl 2] > str w6, [x5, x4, lsl 2] > cmp w0, w1 > bcc .L4 > add w1, w2, 1 > str w0, [x3, x4, lsl 2] > add w6, w1, w0 > sxtw x1, w1 > ldr w4, [x3, x1, lsl 2] > str w6, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > add w4, w2, 2 > str w0, [x3, x1, lsl 2] > sxtw x1, w4 > add w6, w1, w0 > ldr w4, [x3, x1, lsl 2] > str w6, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > str w0, [x3, x1, lsl 2] > add w2, w2, 3 > cmp w7, 3 > beq .L4 > sxtw x1, w2 > add w2, w2, w0 > ldr w4, [x3, x1, lsl 2] > str w2, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > str w0, [x3, x1, lsl 2] > .L4: > mov w0, 0 > ret > .p2align 2,,3 > .L6: > mov w7, 4 > b .L2 > > and for SVE: > > test4: > adrp x2, .LANCHOR0 > add x2, x2, :lo12:.LANCHOR0 > add x5, x2, 3216 > mov x3, 0 > mov w1, 0 > cntw x4 > mov z1.s, w0 > index z0.s, #0, #1 > ptrue p1.b, all > ptrue p0.s, all > .p2align 3,,7 > .L3: > ld1w z2.s, p1/z, [x2, x3, lsl 2] > add z3.s, z0.s, z1.s > cmplo p2.s, p0/z, z1.s, z2.s > b.any .L2 > st1w z3.s, p1, [x5, x3, lsl 2] > add w1, w1, 1 > st1w z1.s, p1, [x2, x3, lsl 2] > add x3, x3, x4 > incw z0.s > cmp w3, 803 > bls .L3 > .L5: > mov w0, 0 > ret > .p2align 2,,3 > .L2: > cntw x5 > mul w1, w1, w5 > cbz w5, .L5 > sxtw x1, w1 > sub w5, w5, #1 > add x5, x5, x1 > add x6, x2, 3216 > b .L6 > .p2align 2,,3 > .L14: > str w0, [x2, x1, lsl 2] > cmp x1, x5 > beq .L5 > mov x1, x4 > .L6: > ldr w3, [x2, x1, lsl 2] > add w4, w0, w1 > str w4, [x6, x1, lsl 2] > add x4, x1, 1 > cmp w0, w3 > bcs .L14 > mov w0, 0 > ret > > On the workloads this work is based on we see between 2-3x performance uplift > using this patch. > > Follow up plan: > - Boolean vectorization has several shortcomings. I've filed PR110223 with the > bigger ones that cause vectorization to fail with this patch. > - SLP support. This is planned for GCC 15 as for majority of the cases build > SLP itself fails. It would be nice to get at least single-lane SLP support working. I think you need to treat the gcond as SLP root stmt and basically do discovery on the condition as to as if it were a mask generating condition. Code generation would then simply schedule the gcond root instances first (that would get you the code motion automagically). So, add a new slp_instance_kind, for example slp_inst_kind_early_break, and record the gcond as root stmt. Possibly "pattern" recognizing gcond <_1 != _2> as _mask = _1 != _2; gcond <_mask != 0> makes the SLP discovery less fiddly (but in theory you can of course handle gconds directly). Is there any part of the series that can be pushed independelty? If so I'll try to look at those parts first. Thanks, Richard.
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Monday, November 6, 2023 2:25 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return auto- > vectorization > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > Hi All, > > > > This patch adds initial support for early break vectorization in GCC. > > The support is added for any target that implements a vector cbranch > > optab, this includes both fully masked and non-masked targets. > > > > Depending on the operation, the vectorizer may also require support > > for boolean mask reductions using Inclusive OR. This is however only > > checked then the comparison would produce multiple statements. > > > > Note: I am currently struggling to get patch 7 correct in all cases and could > use > > some feedback there. > > > > Concretely the kind of loops supported are of the forms: > > > > for (int i = 0; i < N; i++) > > { > > <statements1> > > if (<condition>) > > { > > ... > > <action>; > > } > > <statements2> > > } > > > > where <action> can be: > > - break > > - return > > - goto > > > > Any number of statements can be used before the <action> occurs. > > > > Since this is an initial version for GCC 14 it has the following > > limitations and > > features: > > > > - Only fixed sized iterations and buffers are supported. That is to say any > > vectors loaded or stored must be to statically allocated arrays with known > > sizes. N must also be known. This limitation is because our primary target > > for this optimization is SVE. For VLA SVE we can't easily do cross page > > iteraion checks. The result is likely to also not be beneficial. For that > > reason we punt support for variable buffers till we have First-Faulting > > support in GCC. > > - any stores in <statements1> should not be to the same objects as in > > <condition>. Loads are fine as long as they don't have the possibility to > > alias. More concretely, we block RAW dependencies when the intermediate > value > > can't be separated fromt the store, or the store itself can't be moved. > > - Prologue peeling, alignment peelinig and loop versioning are supported. > > - Fully masked loops, unmasked loops and partially masked loops are > > supported > > - Any number of loop early exits are supported. > > - No support for epilogue vectorization. The only epilogue supported is the > > scalar final one. Peeling code supports it but the code motion code cannot > > find instructions to make the move in the epilog. > > - Early breaks are only supported for inner loop vectorization. > > > > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break > > > > With the help of IPA and LTO this still gets hit quite often. During > > bootstrap it hit rather frequently. Additionally TSVC s332, s481 and > > s482 all pass now since these are tests for support for early exit > vectorization. > > > > This implementation does not support completely handling the early > > break inside the vector loop itself but instead supports adding checks > > such that if we know that we have to exit in the current iteration > > then we branch to scalar code to actually do the final VF iterations which > handles all the code in <action>. > > > > For the scalar loop we know that whatever exit you take you have to > > perform at most VF iterations. For vector code we only case about the > > state of fully performed iteration and reset the scalar code to the (partially) > remaining loop. > > > > That is to say, the first vector loop executes so long as the early > > exit isn't needed. Once the exit is taken, the scalar code will > > perform at most VF extra iterations. The exact number depending on peeling > and iteration start and which > > exit was taken (natural or early). For this scalar loop, all early exits are > > treated the same. > > > > When we vectorize we move any statement not related to the early break > > itself and that would be incorrect to execute before the break (i.e. > > has side effects) to after the break. If this is not possible we decline to > vectorize. > > > > This means that we check at the start of iterations whether we are > > going to exit or not. During the analyis phase we check whether we > > are allowed to do this moving of statements. Also note that we only > > move the scalar statements, but only do so after peeling but just before we > start transforming statements. > > > > Codegen: > > > > for e.g. > > > > #define N 803 > > unsigned vect_a[N]; > > unsigned vect_b[N]; > > > > unsigned test4(unsigned x) > > { > > unsigned ret = 0; > > for (int i = 0; i < N; i++) > > { > > vect_b[i] = x + i; > > if (vect_a[i] > x) > > break; > > vect_a[i] = x; > > > > } > > return ret; > > } > > > > We generate for Adv. SIMD: > > > > test4: > > adrp x2, .LC0 > > adrp x3, .LANCHOR0 > > dup v2.4s, w0 > > add x3, x3, :lo12:.LANCHOR0 > > movi v4.4s, 0x4 > > add x4, x3, 3216 > > ldr q1, [x2, #:lo12:.LC0] > > mov x1, 0 > > mov w2, 0 > > .p2align 3,,7 > > .L3: > > ldr q0, [x3, x1] > > add v3.4s, v1.4s, v2.4s > > add v1.4s, v1.4s, v4.4s > > cmhi v0.4s, v0.4s, v2.4s > > umaxp v0.4s, v0.4s, v0.4s > > fmov x5, d0 > > cbnz x5, .L6 > > add w2, w2, 1 > > str q3, [x1, x4] > > str q2, [x3, x1] > > add x1, x1, 16 > > cmp w2, 200 > > bne .L3 > > mov w7, 3 > > .L2: > > lsl w2, w2, 2 > > add x5, x3, 3216 > > add w6, w2, w0 > > sxtw x4, w2 > > ldr w1, [x3, x4, lsl 2] > > str w6, [x5, x4, lsl 2] > > cmp w0, w1 > > bcc .L4 > > add w1, w2, 1 > > str w0, [x3, x4, lsl 2] > > add w6, w1, w0 > > sxtw x1, w1 > > ldr w4, [x3, x1, lsl 2] > > str w6, [x5, x1, lsl 2] > > cmp w0, w4 > > bcc .L4 > > add w4, w2, 2 > > str w0, [x3, x1, lsl 2] > > sxtw x1, w4 > > add w6, w1, w0 > > ldr w4, [x3, x1, lsl 2] > > str w6, [x5, x1, lsl 2] > > cmp w0, w4 > > bcc .L4 > > str w0, [x3, x1, lsl 2] > > add w2, w2, 3 > > cmp w7, 3 > > beq .L4 > > sxtw x1, w2 > > add w2, w2, w0 > > ldr w4, [x3, x1, lsl 2] > > str w2, [x5, x1, lsl 2] > > cmp w0, w4 > > bcc .L4 > > str w0, [x3, x1, lsl 2] > > .L4: > > mov w0, 0 > > ret > > .p2align 2,,3 > > .L6: > > mov w7, 4 > > b .L2 > > > > and for SVE: > > > > test4: > > adrp x2, .LANCHOR0 > > add x2, x2, :lo12:.LANCHOR0 > > add x5, x2, 3216 > > mov x3, 0 > > mov w1, 0 > > cntw x4 > > mov z1.s, w0 > > index z0.s, #0, #1 > > ptrue p1.b, all > > ptrue p0.s, all > > .p2align 3,,7 > > .L3: > > ld1w z2.s, p1/z, [x2, x3, lsl 2] > > add z3.s, z0.s, z1.s > > cmplo p2.s, p0/z, z1.s, z2.s > > b.any .L2 > > st1w z3.s, p1, [x5, x3, lsl 2] > > add w1, w1, 1 > > st1w z1.s, p1, [x2, x3, lsl 2] > > add x3, x3, x4 > > incw z0.s > > cmp w3, 803 > > bls .L3 > > .L5: > > mov w0, 0 > > ret > > .p2align 2,,3 > > .L2: > > cntw x5 > > mul w1, w1, w5 > > cbz w5, .L5 > > sxtw x1, w1 > > sub w5, w5, #1 > > add x5, x5, x1 > > add x6, x2, 3216 > > b .L6 > > .p2align 2,,3 > > .L14: > > str w0, [x2, x1, lsl 2] > > cmp x1, x5 > > beq .L5 > > mov x1, x4 > > .L6: > > ldr w3, [x2, x1, lsl 2] > > add w4, w0, w1 > > str w4, [x6, x1, lsl 2] > > add x4, x1, 1 > > cmp w0, w3 > > bcs .L14 > > mov w0, 0 > > ret > > > > On the workloads this work is based on we see between 2-3x performance > > uplift using this patch. > > > > Follow up plan: > > - Boolean vectorization has several shortcomings. I've filed PR110223 with > the > > bigger ones that cause vectorization to fail with this patch. > > - SLP support. This is planned for GCC 15 as for majority of the cases build > > SLP itself fails. > > It would be nice to get at least single-lane SLP support working. I think you > need to treat the gcond as SLP root stmt and basically do discovery on the > condition as to as if it were a mask generating condition. Hmm ok, will give it a try. > > Code generation would then simply schedule the gcond root instances first > (that would get you the code motion automagically). Right, so you're saying treat the gcond's as the seed, and stores as a sink. And then schedule only the instances without a gcond around such that we can still vectorize in place to get the branches. Ok, makes sense. > > So, add a new slp_instance_kind, for example slp_inst_kind_early_break, and > record the gcond as root stmt. Possibly "pattern" recognizing > > gcond <_1 != _2> > > as > > _mask = _1 != _2; > gcond <_mask != 0> > > makes the SLP discovery less fiddly (but in theory you can of course handle > gconds directly). > > Is there any part of the series that can be pushed independelty? If so I'll try to > look at those parts first. > Aside from: [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA form for early breaks [PATCH 7/21]middle-end: update IV update code to support early breaks and arbitrary exits The rest lie dormant and don't do anything or disrupt the tree until those two are in. The rest all just touch up different parts piecewise. They do rely on the new field introduced in: [PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks But can split them out. I'll start respinning no #4 and #7 with your latest changes now. Thanks, Tamar > Thanks, > Richard.
On Mon, 6 Nov 2023, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Monday, November 6, 2023 2:25 PM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return auto- > > vectorization > > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > > > Hi All, > > > > > > This patch adds initial support for early break vectorization in GCC. > > > The support is added for any target that implements a vector cbranch > > > optab, this includes both fully masked and non-masked targets. > > > > > > Depending on the operation, the vectorizer may also require support > > > for boolean mask reductions using Inclusive OR. This is however only > > > checked then the comparison would produce multiple statements. > > > > > > Note: I am currently struggling to get patch 7 correct in all cases and could > > use > > > some feedback there. > > > > > > Concretely the kind of loops supported are of the forms: > > > > > > for (int i = 0; i < N; i++) > > > { > > > <statements1> > > > if (<condition>) > > > { > > > ... > > > <action>; > > > } > > > <statements2> > > > } > > > > > > where <action> can be: > > > - break > > > - return > > > - goto > > > > > > Any number of statements can be used before the <action> occurs. > > > > > > Since this is an initial version for GCC 14 it has the following > > > limitations and > > > features: > > > > > > - Only fixed sized iterations and buffers are supported. That is to say any > > > vectors loaded or stored must be to statically allocated arrays with known > > > sizes. N must also be known. This limitation is because our primary target > > > for this optimization is SVE. For VLA SVE we can't easily do cross page > > > iteraion checks. The result is likely to also not be beneficial. For that > > > reason we punt support for variable buffers till we have First-Faulting > > > support in GCC. Btw, for this I wonder if you thought about marking memory accesses required for the early break condition as required to be vector-size aligned, thus peeling or versioning them for alignment? That should ensure they do not fault. OTOH I somehow remember prologue peeling isn't supported for early break vectorization? .. > > > - any stores in <statements1> should not be to the same objects as in > > > <condition>. Loads are fine as long as they don't have the possibility to > > > alias. More concretely, we block RAW dependencies when the intermediate > > value > > > can't be separated fromt the store, or the store itself can't be moved. > > > - Prologue peeling, alignment peelinig and loop versioning are supported. .. but here you say it is. Not sure if peeling for alignment works for VLA vectors though. Just to say x86 doesn't support first-faulting loads. > > > - Fully masked loops, unmasked loops and partially masked loops are > > > supported > > > - Any number of loop early exits are supported. > > > - No support for epilogue vectorization. The only epilogue supported is the > > > scalar final one. Peeling code supports it but the code motion code cannot > > > find instructions to make the move in the epilog. > > > - Early breaks are only supported for inner loop vectorization. > > > > > > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break > > > > > > With the help of IPA and LTO this still gets hit quite often. During > > > bootstrap it hit rather frequently. Additionally TSVC s332, s481 and > > > s482 all pass now since these are tests for support for early exit > > vectorization. > > > > > > This implementation does not support completely handling the early > > > break inside the vector loop itself but instead supports adding checks > > > such that if we know that we have to exit in the current iteration > > > then we branch to scalar code to actually do the final VF iterations which > > handles all the code in <action>. > > > > > > For the scalar loop we know that whatever exit you take you have to > > > perform at most VF iterations. For vector code we only case about the > > > state of fully performed iteration and reset the scalar code to the (partially) > > remaining loop. > > > > > > That is to say, the first vector loop executes so long as the early > > > exit isn't needed. Once the exit is taken, the scalar code will > > > perform at most VF extra iterations. The exact number depending on peeling > > and iteration start and which > > > exit was taken (natural or early). For this scalar loop, all early exits are > > > treated the same. > > > > > > When we vectorize we move any statement not related to the early break > > > itself and that would be incorrect to execute before the break (i.e. > > > has side effects) to after the break. If this is not possible we decline to > > vectorize. > > > > > > This means that we check at the start of iterations whether we are > > > going to exit or not. During the analyis phase we check whether we > > > are allowed to do this moving of statements. Also note that we only > > > move the scalar statements, but only do so after peeling but just before we > > start transforming statements. > > > > > > Codegen: > > > > > > for e.g. > > > > > > #define N 803 > > > unsigned vect_a[N]; > > > unsigned vect_b[N]; > > > > > > unsigned test4(unsigned x) > > > { > > > unsigned ret = 0; > > > for (int i = 0; i < N; i++) > > > { > > > vect_b[i] = x + i; > > > if (vect_a[i] > x) > > > break; > > > vect_a[i] = x; > > > > > > } > > > return ret; > > > } > > > > > > We generate for Adv. SIMD: > > > > > > test4: > > > adrp x2, .LC0 > > > adrp x3, .LANCHOR0 > > > dup v2.4s, w0 > > > add x3, x3, :lo12:.LANCHOR0 > > > movi v4.4s, 0x4 > > > add x4, x3, 3216 > > > ldr q1, [x2, #:lo12:.LC0] > > > mov x1, 0 > > > mov w2, 0 > > > .p2align 3,,7 > > > .L3: > > > ldr q0, [x3, x1] > > > add v3.4s, v1.4s, v2.4s > > > add v1.4s, v1.4s, v4.4s > > > cmhi v0.4s, v0.4s, v2.4s > > > umaxp v0.4s, v0.4s, v0.4s > > > fmov x5, d0 > > > cbnz x5, .L6 > > > add w2, w2, 1 > > > str q3, [x1, x4] > > > str q2, [x3, x1] > > > add x1, x1, 16 > > > cmp w2, 200 > > > bne .L3 > > > mov w7, 3 > > > .L2: > > > lsl w2, w2, 2 > > > add x5, x3, 3216 > > > add w6, w2, w0 > > > sxtw x4, w2 > > > ldr w1, [x3, x4, lsl 2] > > > str w6, [x5, x4, lsl 2] > > > cmp w0, w1 > > > bcc .L4 > > > add w1, w2, 1 > > > str w0, [x3, x4, lsl 2] > > > add w6, w1, w0 > > > sxtw x1, w1 > > > ldr w4, [x3, x1, lsl 2] > > > str w6, [x5, x1, lsl 2] > > > cmp w0, w4 > > > bcc .L4 > > > add w4, w2, 2 > > > str w0, [x3, x1, lsl 2] > > > sxtw x1, w4 > > > add w6, w1, w0 > > > ldr w4, [x3, x1, lsl 2] > > > str w6, [x5, x1, lsl 2] > > > cmp w0, w4 > > > bcc .L4 > > > str w0, [x3, x1, lsl 2] > > > add w2, w2, 3 > > > cmp w7, 3 > > > beq .L4 > > > sxtw x1, w2 > > > add w2, w2, w0 > > > ldr w4, [x3, x1, lsl 2] > > > str w2, [x5, x1, lsl 2] > > > cmp w0, w4 > > > bcc .L4 > > > str w0, [x3, x1, lsl 2] > > > .L4: > > > mov w0, 0 > > > ret > > > .p2align 2,,3 > > > .L6: > > > mov w7, 4 > > > b .L2 > > > > > > and for SVE: > > > > > > test4: > > > adrp x2, .LANCHOR0 > > > add x2, x2, :lo12:.LANCHOR0 > > > add x5, x2, 3216 > > > mov x3, 0 > > > mov w1, 0 > > > cntw x4 > > > mov z1.s, w0 > > > index z0.s, #0, #1 > > > ptrue p1.b, all > > > ptrue p0.s, all > > > .p2align 3,,7 > > > .L3: > > > ld1w z2.s, p1/z, [x2, x3, lsl 2] > > > add z3.s, z0.s, z1.s > > > cmplo p2.s, p0/z, z1.s, z2.s > > > b.any .L2 > > > st1w z3.s, p1, [x5, x3, lsl 2] > > > add w1, w1, 1 > > > st1w z1.s, p1, [x2, x3, lsl 2] > > > add x3, x3, x4 > > > incw z0.s > > > cmp w3, 803 > > > bls .L3 > > > .L5: > > > mov w0, 0 > > > ret > > > .p2align 2,,3 > > > .L2: > > > cntw x5 > > > mul w1, w1, w5 > > > cbz w5, .L5 > > > sxtw x1, w1 > > > sub w5, w5, #1 > > > add x5, x5, x1 > > > add x6, x2, 3216 > > > b .L6 > > > .p2align 2,,3 > > > .L14: > > > str w0, [x2, x1, lsl 2] > > > cmp x1, x5 > > > beq .L5 > > > mov x1, x4 > > > .L6: > > > ldr w3, [x2, x1, lsl 2] > > > add w4, w0, w1 > > > str w4, [x6, x1, lsl 2] > > > add x4, x1, 1 > > > cmp w0, w3 > > > bcs .L14 > > > mov w0, 0 > > > ret > > > > > > On the workloads this work is based on we see between 2-3x performance > > > uplift using this patch. > > > > > > Follow up plan: > > > - Boolean vectorization has several shortcomings. I've filed PR110223 with > > the > > > bigger ones that cause vectorization to fail with this patch. > > > - SLP support. This is planned for GCC 15 as for majority of the cases build > > > SLP itself fails. > > > > It would be nice to get at least single-lane SLP support working. I think you > > need to treat the gcond as SLP root stmt and basically do discovery on the > > condition as to as if it were a mask generating condition. > > Hmm ok, will give it a try. > > > > > Code generation would then simply schedule the gcond root instances first > > (that would get you the code motion automagically). > > Right, so you're saying treat the gcond's as the seed, and stores as a sink. > And then schedule only the instances without a gcond around such that we > can still vectorize in place to get the branches. Ok, makes sense. > > > > > So, add a new slp_instance_kind, for example slp_inst_kind_early_break, and > > record the gcond as root stmt. Possibly "pattern" recognizing > > > > gcond <_1 != _2> > > > > as > > > > _mask = _1 != _2; > > gcond <_mask != 0> > > > > makes the SLP discovery less fiddly (but in theory you can of course handle > > gconds directly). > > > > Is there any part of the series that can be pushed independelty? If so I'll try to > > look at those parts first. > > > > Aside from: > > [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA form for early breaks > [PATCH 7/21]middle-end: update IV update code to support early breaks and arbitrary exits > > The rest lie dormant and don't do anything or disrupt the tree until those two are in. > The rest all just touch up different parts piecewise. > > They do rely on the new field introduced in: > > [PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks > > But can split them out. > > I'll start respinning no #4 and #7 with your latest changes now. OK, I'll simply go 1-n then. Richard. > Thanks, > Tamar > > > Thanks, > > Richard.
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Tuesday, November 7, 2023 9:43 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto- > vectorization > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > > -----Original Message----- > > > From: Richard Biener <rguenther@suse.de> > > > Sent: Monday, November 6, 2023 2:25 PM > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return > > > auto- vectorization > > > > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > > > > > Hi All, > > > > > > > > This patch adds initial support for early break vectorization in GCC. > > > > The support is added for any target that implements a vector > > > > cbranch optab, this includes both fully masked and non-masked targets. > > > > > > > > Depending on the operation, the vectorizer may also require > > > > support for boolean mask reductions using Inclusive OR. This is > > > > however only checked then the comparison would produce multiple > statements. > > > > > > > > Note: I am currently struggling to get patch 7 correct in all > > > > cases and could > > > use > > > > some feedback there. > > > > > > > > Concretely the kind of loops supported are of the forms: > > > > > > > > for (int i = 0; i < N; i++) > > > > { > > > > <statements1> > > > > if (<condition>) > > > > { > > > > ... > > > > <action>; > > > > } > > > > <statements2> > > > > } > > > > > > > > where <action> can be: > > > > - break > > > > - return > > > > - goto > > > > > > > > Any number of statements can be used before the <action> occurs. > > > > > > > > Since this is an initial version for GCC 14 it has the following > > > > limitations and > > > > features: > > > > > > > > - Only fixed sized iterations and buffers are supported. That is to say any > > > > vectors loaded or stored must be to statically allocated arrays with > known > > > > sizes. N must also be known. This limitation is because our primary > target > > > > for this optimization is SVE. For VLA SVE we can't easily do cross page > > > > iteraion checks. The result is likely to also not be beneficial. For that > > > > reason we punt support for variable buffers till we have First-Faulting > > > > support in GCC. > > Btw, for this I wonder if you thought about marking memory accesses required > for the early break condition as required to be vector-size aligned, thus peeling > or versioning them for alignment? That should ensure they do not fault. > > OTOH I somehow remember prologue peeling isn't supported for early break > vectorization? .. > > > > > - any stores in <statements1> should not be to the same objects as in > > > > <condition>. Loads are fine as long as they don't have the possibility to > > > > alias. More concretely, we block RAW dependencies when the > > > > intermediate > > > value > > > > can't be separated fromt the store, or the store itself can't be moved. > > > > - Prologue peeling, alignment peelinig and loop versioning are supported. > > .. but here you say it is. Not sure if peeling for alignment works for VLA vectors > though. Just to say x86 doesn't support first-faulting loads. For VLA we support it through masking. i.e. if you need to peel N iterations, we generate a masked copy of the loop vectorized which masks off the first N bits. This is not typically needed, but we do support it. But the problem with this scheme and early break is obviously that the peeled loop needs to be vectorized so you kinda end up with the same issue again. So Atm it rejects it for VLA. Regards, Tamar > > > > > - Fully masked loops, unmasked loops and partially masked loops > > > > are supported > > > > - Any number of loop early exits are supported. > > > > - No support for epilogue vectorization. The only epilogue supported is > the > > > > scalar final one. Peeling code supports it but the code motion code > cannot > > > > find instructions to make the move in the epilog. > > > > - Early breaks are only supported for inner loop vectorization. > > > > > > > > I have pushed a branch to > > > > refs/users/tnfchris/heads/gcc-14-early-break > > > > > > > > With the help of IPA and LTO this still gets hit quite often. > > > > During bootstrap it hit rather frequently. Additionally TSVC > > > > s332, s481 and > > > > s482 all pass now since these are tests for support for early exit > > > vectorization. > > > > > > > > This implementation does not support completely handling the early > > > > break inside the vector loop itself but instead supports adding > > > > checks such that if we know that we have to exit in the current > > > > iteration then we branch to scalar code to actually do the final > > > > VF iterations which > > > handles all the code in <action>. > > > > > > > > For the scalar loop we know that whatever exit you take you have > > > > to perform at most VF iterations. For vector code we only case > > > > about the state of fully performed iteration and reset the scalar > > > > code to the (partially) > > > remaining loop. > > > > > > > > That is to say, the first vector loop executes so long as the > > > > early exit isn't needed. Once the exit is taken, the scalar code > > > > will perform at most VF extra iterations. The exact number > > > > depending on peeling > > > and iteration start and which > > > > exit was taken (natural or early). For this scalar loop, all early exits are > > > > treated the same. > > > > > > > > When we vectorize we move any statement not related to the early > > > > break itself and that would be incorrect to execute before the break (i.e. > > > > has side effects) to after the break. If this is not possible we > > > > decline to > > > vectorize. > > > > > > > > This means that we check at the start of iterations whether we are > > > > going to exit or not. During the analyis phase we check whether > > > > we are allowed to do this moving of statements. Also note that we > > > > only move the scalar statements, but only do so after peeling but > > > > just before we > > > start transforming statements. > > > > > > > > Codegen: > > > > > > > > for e.g. > > > > > > > > #define N 803 > > > > unsigned vect_a[N]; > > > > unsigned vect_b[N]; > > > > > > > > unsigned test4(unsigned x) > > > > { > > > > unsigned ret = 0; > > > > for (int i = 0; i < N; i++) > > > > { > > > > vect_b[i] = x + i; > > > > if (vect_a[i] > x) > > > > break; > > > > vect_a[i] = x; > > > > > > > > } > > > > return ret; > > > > } > > > > > > > > We generate for Adv. SIMD: > > > > > > > > test4: > > > > adrp x2, .LC0 > > > > adrp x3, .LANCHOR0 > > > > dup v2.4s, w0 > > > > add x3, x3, :lo12:.LANCHOR0 > > > > movi v4.4s, 0x4 > > > > add x4, x3, 3216 > > > > ldr q1, [x2, #:lo12:.LC0] > > > > mov x1, 0 > > > > mov w2, 0 > > > > .p2align 3,,7 > > > > .L3: > > > > ldr q0, [x3, x1] > > > > add v3.4s, v1.4s, v2.4s > > > > add v1.4s, v1.4s, v4.4s > > > > cmhi v0.4s, v0.4s, v2.4s > > > > umaxp v0.4s, v0.4s, v0.4s > > > > fmov x5, d0 > > > > cbnz x5, .L6 > > > > add w2, w2, 1 > > > > str q3, [x1, x4] > > > > str q2, [x3, x1] > > > > add x1, x1, 16 > > > > cmp w2, 200 > > > > bne .L3 > > > > mov w7, 3 > > > > .L2: > > > > lsl w2, w2, 2 > > > > add x5, x3, 3216 > > > > add w6, w2, w0 > > > > sxtw x4, w2 > > > > ldr w1, [x3, x4, lsl 2] > > > > str w6, [x5, x4, lsl 2] > > > > cmp w0, w1 > > > > bcc .L4 > > > > add w1, w2, 1 > > > > str w0, [x3, x4, lsl 2] > > > > add w6, w1, w0 > > > > sxtw x1, w1 > > > > ldr w4, [x3, x1, lsl 2] > > > > str w6, [x5, x1, lsl 2] > > > > cmp w0, w4 > > > > bcc .L4 > > > > add w4, w2, 2 > > > > str w0, [x3, x1, lsl 2] > > > > sxtw x1, w4 > > > > add w6, w1, w0 > > > > ldr w4, [x3, x1, lsl 2] > > > > str w6, [x5, x1, lsl 2] > > > > cmp w0, w4 > > > > bcc .L4 > > > > str w0, [x3, x1, lsl 2] > > > > add w2, w2, 3 > > > > cmp w7, 3 > > > > beq .L4 > > > > sxtw x1, w2 > > > > add w2, w2, w0 > > > > ldr w4, [x3, x1, lsl 2] > > > > str w2, [x5, x1, lsl 2] > > > > cmp w0, w4 > > > > bcc .L4 > > > > str w0, [x3, x1, lsl 2] > > > > .L4: > > > > mov w0, 0 > > > > ret > > > > .p2align 2,,3 > > > > .L6: > > > > mov w7, 4 > > > > b .L2 > > > > > > > > and for SVE: > > > > > > > > test4: > > > > adrp x2, .LANCHOR0 > > > > add x2, x2, :lo12:.LANCHOR0 > > > > add x5, x2, 3216 > > > > mov x3, 0 > > > > mov w1, 0 > > > > cntw x4 > > > > mov z1.s, w0 > > > > index z0.s, #0, #1 > > > > ptrue p1.b, all > > > > ptrue p0.s, all > > > > .p2align 3,,7 > > > > .L3: > > > > ld1w z2.s, p1/z, [x2, x3, lsl 2] > > > > add z3.s, z0.s, z1.s > > > > cmplo p2.s, p0/z, z1.s, z2.s > > > > b.any .L2 > > > > st1w z3.s, p1, [x5, x3, lsl 2] > > > > add w1, w1, 1 > > > > st1w z1.s, p1, [x2, x3, lsl 2] > > > > add x3, x3, x4 > > > > incw z0.s > > > > cmp w3, 803 > > > > bls .L3 > > > > .L5: > > > > mov w0, 0 > > > > ret > > > > .p2align 2,,3 > > > > .L2: > > > > cntw x5 > > > > mul w1, w1, w5 > > > > cbz w5, .L5 > > > > sxtw x1, w1 > > > > sub w5, w5, #1 > > > > add x5, x5, x1 > > > > add x6, x2, 3216 > > > > b .L6 > > > > .p2align 2,,3 > > > > .L14: > > > > str w0, [x2, x1, lsl 2] > > > > cmp x1, x5 > > > > beq .L5 > > > > mov x1, x4 > > > > .L6: > > > > ldr w3, [x2, x1, lsl 2] > > > > add w4, w0, w1 > > > > str w4, [x6, x1, lsl 2] > > > > add x4, x1, 1 > > > > cmp w0, w3 > > > > bcs .L14 > > > > mov w0, 0 > > > > ret > > > > > > > > On the workloads this work is based on we see between 2-3x > > > > performance uplift using this patch. > > > > > > > > Follow up plan: > > > > - Boolean vectorization has several shortcomings. I've filed > > > > PR110223 with > > > the > > > > bigger ones that cause vectorization to fail with this patch. > > > > - SLP support. This is planned for GCC 15 as for majority of the cases > build > > > > SLP itself fails. > > > > > > It would be nice to get at least single-lane SLP support working. I > > > think you need to treat the gcond as SLP root stmt and basically do > > > discovery on the condition as to as if it were a mask generating condition. > > > > Hmm ok, will give it a try. > > > > > > > > Code generation would then simply schedule the gcond root instances > > > first (that would get you the code motion automagically). > > > > Right, so you're saying treat the gcond's as the seed, and stores as a sink. > > And then schedule only the instances without a gcond around such that > > we can still vectorize in place to get the branches. Ok, makes sense. > > > > > > > > So, add a new slp_instance_kind, for example > > > slp_inst_kind_early_break, and record the gcond as root stmt. > > > Possibly "pattern" recognizing > > > > > > gcond <_1 != _2> > > > > > > as > > > > > > _mask = _1 != _2; > > > gcond <_mask != 0> > > > > > > makes the SLP discovery less fiddly (but in theory you can of course > > > handle gconds directly). > > > > > > Is there any part of the series that can be pushed independelty? If > > > so I'll try to look at those parts first. > > > > > > > Aside from: > > > > [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA > > form for early breaks [PATCH 7/21]middle-end: update IV update code to > > support early breaks and arbitrary exits > > > > The rest lie dormant and don't do anything or disrupt the tree until those > two are in. > > The rest all just touch up different parts piecewise. > > > > They do rely on the new field introduced in: > > > > [PATCH 3/21]middle-end: Implement code motion and dependency analysis > > for early breaks > > > > But can split them out. > > > > I'll start respinning no #4 and #7 with your latest changes now. > > OK, I'll simply go 1-n then. > > Richard. > > > Thanks, > > Tamar > > > > > Thanks, > > > Richard.
On Tue, 7 Nov 2023, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Tuesday, November 7, 2023 9:43 AM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto- > > vectorization > > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguenther@suse.de> > > > > Sent: Monday, November 6, 2023 2:25 PM > > > > To: Tamar Christina <Tamar.Christina@arm.com> > > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return > > > > auto- vectorization > > > > > > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > > > > > > > > > Hi All, > > > > > > > > > > This patch adds initial support for early break vectorization in GCC. > > > > > The support is added for any target that implements a vector > > > > > cbranch optab, this includes both fully masked and non-masked targets. > > > > > > > > > > Depending on the operation, the vectorizer may also require > > > > > support for boolean mask reductions using Inclusive OR. This is > > > > > however only checked then the comparison would produce multiple > > statements. > > > > > > > > > > Note: I am currently struggling to get patch 7 correct in all > > > > > cases and could > > > > use > > > > > some feedback there. > > > > > > > > > > Concretely the kind of loops supported are of the forms: > > > > > > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > <statements1> > > > > > if (<condition>) > > > > > { > > > > > ... > > > > > <action>; > > > > > } > > > > > <statements2> > > > > > } > > > > > > > > > > where <action> can be: > > > > > - break > > > > > - return > > > > > - goto > > > > > > > > > > Any number of statements can be used before the <action> occurs. > > > > > > > > > > Since this is an initial version for GCC 14 it has the following > > > > > limitations and > > > > > features: > > > > > > > > > > - Only fixed sized iterations and buffers are supported. That is to say any > > > > > vectors loaded or stored must be to statically allocated arrays with > > known > > > > > sizes. N must also be known. This limitation is because our primary > > target > > > > > for this optimization is SVE. For VLA SVE we can't easily do cross page > > > > > iteraion checks. The result is likely to also not be beneficial. For that > > > > > reason we punt support for variable buffers till we have First-Faulting > > > > > support in GCC. > > > > Btw, for this I wonder if you thought about marking memory accesses required > > for the early break condition as required to be vector-size aligned, thus peeling > > or versioning them for alignment? That should ensure they do not fault. > > > > OTOH I somehow remember prologue peeling isn't supported for early break > > vectorization? .. > > > > > > > - any stores in <statements1> should not be to the same objects as in > > > > > <condition>. Loads are fine as long as they don't have the possibility to > > > > > alias. More concretely, we block RAW dependencies when the > > > > > intermediate > > > > value > > > > > can't be separated fromt the store, or the store itself can't be moved. > > > > > - Prologue peeling, alignment peelinig and loop versioning are supported. > > > > .. but here you say it is. Not sure if peeling for alignment works for VLA vectors > > though. Just to say x86 doesn't support first-faulting loads. > > For VLA we support it through masking. i.e. if you need to peel N iterations, we > generate a masked copy of the loop vectorized which masks off the first N bits. > > This is not typically needed, but we do support it. But the problem with this > scheme and early break is obviously that the peeled loop needs to be vectorized > so you kinda end up with the same issue again. So Atm it rejects it for VLA. Hmm, I see. I thought peeling by masking is an optimization. Anyhow, I think it should still work here - since all accesses are aligned and we know that there's at least one original scalar iteration in the first masked and the following "unmasked" vector iterations there should never be faults for any of the aligned accesses. I think going via alignment is a way easier method to guarantee this than handwaving about "declared" arrays and niter. One can try that in addition of course - it's not always possible to align all vector loads we are going to speculate (for VLA one could also find common runtime (mis-)alignment and restrict the vector length based on that, for RISC-V it seems to be efficient, not sure whether altering that for SVE is though). Richard. > Regards, > Tamar > > > > > > > > - Fully masked loops, unmasked loops and partially masked loops > > > > > are supported > > > > > - Any number of loop early exits are supported. > > > > > - No support for epilogue vectorization. The only epilogue supported is > > the > > > > > scalar final one. Peeling code supports it but the code motion code > > cannot > > > > > find instructions to make the move in the epilog. > > > > > - Early breaks are only supported for inner loop vectorization. > > > > > > > > > > I have pushed a branch to > > > > > refs/users/tnfchris/heads/gcc-14-early-break > > > > > > > > > > With the help of IPA and LTO this still gets hit quite often. > > > > > During bootstrap it hit rather frequently. Additionally TSVC > > > > > s332, s481 and > > > > > s482 all pass now since these are tests for support for early exit > > > > vectorization. > > > > > > > > > > This implementation does not support completely handling the early > > > > > break inside the vector loop itself but instead supports adding > > > > > checks such that if we know that we have to exit in the current > > > > > iteration then we branch to scalar code to actually do the final > > > > > VF iterations which > > > > handles all the code in <action>. > > > > > > > > > > For the scalar loop we know that whatever exit you take you have > > > > > to perform at most VF iterations. For vector code we only case > > > > > about the state of fully performed iteration and reset the scalar > > > > > code to the (partially) > > > > remaining loop. > > > > > > > > > > That is to say, the first vector loop executes so long as the > > > > > early exit isn't needed. Once the exit is taken, the scalar code > > > > > will perform at most VF extra iterations. The exact number > > > > > depending on peeling > > > > and iteration start and which > > > > > exit was taken (natural or early). For this scalar loop, all early exits are > > > > > treated the same. > > > > > > > > > > When we vectorize we move any statement not related to the early > > > > > break itself and that would be incorrect to execute before the break (i.e. > > > > > has side effects) to after the break. If this is not possible we > > > > > decline to > > > > vectorize. > > > > > > > > > > This means that we check at the start of iterations whether we are > > > > > going to exit or not. During the analyis phase we check whether > > > > > we are allowed to do this moving of statements. Also note that we > > > > > only move the scalar statements, but only do so after peeling but > > > > > just before we > > > > start transforming statements. > > > > > > > > > > Codegen: > > > > > > > > > > for e.g. > > > > > > > > > > #define N 803 > > > > > unsigned vect_a[N]; > > > > > unsigned vect_b[N]; > > > > > > > > > > unsigned test4(unsigned x) > > > > > { > > > > > unsigned ret = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; > > > > > if (vect_a[i] > x) > > > > > break; > > > > > vect_a[i] = x; > > > > > > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > We generate for Adv. SIMD: > > > > > > > > > > test4: > > > > > adrp x2, .LC0 > > > > > adrp x3, .LANCHOR0 > > > > > dup v2.4s, w0 > > > > > add x3, x3, :lo12:.LANCHOR0 > > > > > movi v4.4s, 0x4 > > > > > add x4, x3, 3216 > > > > > ldr q1, [x2, #:lo12:.LC0] > > > > > mov x1, 0 > > > > > mov w2, 0 > > > > > .p2align 3,,7 > > > > > .L3: > > > > > ldr q0, [x3, x1] > > > > > add v3.4s, v1.4s, v2.4s > > > > > add v1.4s, v1.4s, v4.4s > > > > > cmhi v0.4s, v0.4s, v2.4s > > > > > umaxp v0.4s, v0.4s, v0.4s > > > > > fmov x5, d0 > > > > > cbnz x5, .L6 > > > > > add w2, w2, 1 > > > > > str q3, [x1, x4] > > > > > str q2, [x3, x1] > > > > > add x1, x1, 16 > > > > > cmp w2, 200 > > > > > bne .L3 > > > > > mov w7, 3 > > > > > .L2: > > > > > lsl w2, w2, 2 > > > > > add x5, x3, 3216 > > > > > add w6, w2, w0 > > > > > sxtw x4, w2 > > > > > ldr w1, [x3, x4, lsl 2] > > > > > str w6, [x5, x4, lsl 2] > > > > > cmp w0, w1 > > > > > bcc .L4 > > > > > add w1, w2, 1 > > > > > str w0, [x3, x4, lsl 2] > > > > > add w6, w1, w0 > > > > > sxtw x1, w1 > > > > > ldr w4, [x3, x1, lsl 2] > > > > > str w6, [x5, x1, lsl 2] > > > > > cmp w0, w4 > > > > > bcc .L4 > > > > > add w4, w2, 2 > > > > > str w0, [x3, x1, lsl 2] > > > > > sxtw x1, w4 > > > > > add w6, w1, w0 > > > > > ldr w4, [x3, x1, lsl 2] > > > > > str w6, [x5, x1, lsl 2] > > > > > cmp w0, w4 > > > > > bcc .L4 > > > > > str w0, [x3, x1, lsl 2] > > > > > add w2, w2, 3 > > > > > cmp w7, 3 > > > > > beq .L4 > > > > > sxtw x1, w2 > > > > > add w2, w2, w0 > > > > > ldr w4, [x3, x1, lsl 2] > > > > > str w2, [x5, x1, lsl 2] > > > > > cmp w0, w4 > > > > > bcc .L4 > > > > > str w0, [x3, x1, lsl 2] > > > > > .L4: > > > > > mov w0, 0 > > > > > ret > > > > > .p2align 2,,3 > > > > > .L6: > > > > > mov w7, 4 > > > > > b .L2 > > > > > > > > > > and for SVE: > > > > > > > > > > test4: > > > > > adrp x2, .LANCHOR0 > > > > > add x2, x2, :lo12:.LANCHOR0 > > > > > add x5, x2, 3216 > > > > > mov x3, 0 > > > > > mov w1, 0 > > > > > cntw x4 > > > > > mov z1.s, w0 > > > > > index z0.s, #0, #1 > > > > > ptrue p1.b, all > > > > > ptrue p0.s, all > > > > > .p2align 3,,7 > > > > > .L3: > > > > > ld1w z2.s, p1/z, [x2, x3, lsl 2] > > > > > add z3.s, z0.s, z1.s > > > > > cmplo p2.s, p0/z, z1.s, z2.s > > > > > b.any .L2 > > > > > st1w z3.s, p1, [x5, x3, lsl 2] > > > > > add w1, w1, 1 > > > > > st1w z1.s, p1, [x2, x3, lsl 2] > > > > > add x3, x3, x4 > > > > > incw z0.s > > > > > cmp w3, 803 > > > > > bls .L3 > > > > > .L5: > > > > > mov w0, 0 > > > > > ret > > > > > .p2align 2,,3 > > > > > .L2: > > > > > cntw x5 > > > > > mul w1, w1, w5 > > > > > cbz w5, .L5 > > > > > sxtw x1, w1 > > > > > sub w5, w5, #1 > > > > > add x5, x5, x1 > > > > > add x6, x2, 3216 > > > > > b .L6 > > > > > .p2align 2,,3 > > > > > .L14: > > > > > str w0, [x2, x1, lsl 2] > > > > > cmp x1, x5 > > > > > beq .L5 > > > > > mov x1, x4 > > > > > .L6: > > > > > ldr w3, [x2, x1, lsl 2] > > > > > add w4, w0, w1 > > > > > str w4, [x6, x1, lsl 2] > > > > > add x4, x1, 1 > > > > > cmp w0, w3 > > > > > bcs .L14 > > > > > mov w0, 0 > > > > > ret > > > > > > > > > > On the workloads this work is based on we see between 2-3x > > > > > performance uplift using this patch. > > > > > > > > > > Follow up plan: > > > > > - Boolean vectorization has several shortcomings. I've filed > > > > > PR110223 with > > > > the > > > > > bigger ones that cause vectorization to fail with this patch. > > > > > - SLP support. This is planned for GCC 15 as for majority of the cases > > build > > > > > SLP itself fails. > > > > > > > > It would be nice to get at least single-lane SLP support working. I > > > > think you need to treat the gcond as SLP root stmt and basically do > > > > discovery on the condition as to as if it were a mask generating condition. > > > > > > Hmm ok, will give it a try. > > > > > > > > > > > Code generation would then simply schedule the gcond root instances > > > > first (that would get you the code motion automagically). > > > > > > Right, so you're saying treat the gcond's as the seed, and stores as a sink. > > > And then schedule only the instances without a gcond around such that > > > we can still vectorize in place to get the branches. Ok, makes sense. > > > > > > > > > > > So, add a new slp_instance_kind, for example > > > > slp_inst_kind_early_break, and record the gcond as root stmt. > > > > Possibly "pattern" recognizing > > > > > > > > gcond <_1 != _2> > > > > > > > > as > > > > > > > > _mask = _1 != _2; > > > > gcond <_mask != 0> > > > > > > > > makes the SLP discovery less fiddly (but in theory you can of course > > > > handle gconds directly). > > > > > > > > Is there any part of the series that can be pushed independelty? If > > > > so I'll try to look at those parts first. > > > > > > > > > > Aside from: > > > > > > [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA > > > form for early breaks [PATCH 7/21]middle-end: update IV update code to > > > support early breaks and arbitrary exits > > > > > > The rest lie dormant and don't do anything or disrupt the tree until those > > two are in. > > > The rest all just touch up different parts piecewise. > > > > > > They do rely on the new field introduced in: > > > > > > [PATCH 3/21]middle-end: Implement code motion and dependency analysis > > > for early breaks > > > > > > But can split them out. > > > > > > I'll start respinning no #4 and #7 with your latest changes now. > > > > OK, I'll simply go 1-n then. > > > > Richard. > > > > > Thanks, > > > Tamar > > > > > > > Thanks, > > > > Richard. >
Catching up on backlog, so this might already be resolved, but: Richard Biener <rguenther@suse.de> writes: > On Tue, 7 Nov 2023, Tamar Christina wrote: > >> > -----Original Message----- >> > From: Richard Biener <rguenther@suse.de> >> > Sent: Tuesday, November 7, 2023 9:43 AM >> > To: Tamar Christina <Tamar.Christina@arm.com> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> >> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto- >> > vectorization >> > >> > On Mon, 6 Nov 2023, Tamar Christina wrote: >> > >> > > > -----Original Message----- >> > > > From: Richard Biener <rguenther@suse.de> >> > > > Sent: Monday, November 6, 2023 2:25 PM >> > > > To: Tamar Christina <Tamar.Christina@arm.com> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> >> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return >> > > > auto- vectorization >> > > > >> > > > On Mon, 6 Nov 2023, Tamar Christina wrote: >> > > > >> > > > > Hi All, >> > > > > >> > > > > This patch adds initial support for early break vectorization in GCC. >> > > > > The support is added for any target that implements a vector >> > > > > cbranch optab, this includes both fully masked and non-masked targets. >> > > > > >> > > > > Depending on the operation, the vectorizer may also require >> > > > > support for boolean mask reductions using Inclusive OR. This is >> > > > > however only checked then the comparison would produce multiple >> > statements. >> > > > > >> > > > > Note: I am currently struggling to get patch 7 correct in all >> > > > > cases and could >> > > > use >> > > > > some feedback there. >> > > > > >> > > > > Concretely the kind of loops supported are of the forms: >> > > > > >> > > > > for (int i = 0; i < N; i++) >> > > > > { >> > > > > <statements1> >> > > > > if (<condition>) >> > > > > { >> > > > > ... >> > > > > <action>; >> > > > > } >> > > > > <statements2> >> > > > > } >> > > > > >> > > > > where <action> can be: >> > > > > - break >> > > > > - return >> > > > > - goto >> > > > > >> > > > > Any number of statements can be used before the <action> occurs. >> > > > > >> > > > > Since this is an initial version for GCC 14 it has the following >> > > > > limitations and >> > > > > features: >> > > > > >> > > > > - Only fixed sized iterations and buffers are supported. That is to say any >> > > > > vectors loaded or stored must be to statically allocated arrays with >> > known >> > > > > sizes. N must also be known. This limitation is because our primary >> > target >> > > > > for this optimization is SVE. For VLA SVE we can't easily do cross page >> > > > > iteraion checks. The result is likely to also not be beneficial. For that >> > > > > reason we punt support for variable buffers till we have First-Faulting >> > > > > support in GCC. >> > >> > Btw, for this I wonder if you thought about marking memory accesses required >> > for the early break condition as required to be vector-size aligned, thus peeling >> > or versioning them for alignment? That should ensure they do not fault. >> > >> > OTOH I somehow remember prologue peeling isn't supported for early break >> > vectorization? .. >> > >> > > > > - any stores in <statements1> should not be to the same objects as in >> > > > > <condition>. Loads are fine as long as they don't have the possibility to >> > > > > alias. More concretely, we block RAW dependencies when the >> > > > > intermediate >> > > > value >> > > > > can't be separated fromt the store, or the store itself can't be moved. >> > > > > - Prologue peeling, alignment peelinig and loop versioning are supported. >> > >> > .. but here you say it is. Not sure if peeling for alignment works for VLA vectors >> > though. Just to say x86 doesn't support first-faulting loads. >> >> For VLA we support it through masking. i.e. if you need to peel N iterations, we >> generate a masked copy of the loop vectorized which masks off the first N bits. >> >> This is not typically needed, but we do support it. But the problem with this >> scheme and early break is obviously that the peeled loop needs to be vectorized >> so you kinda end up with the same issue again. So Atm it rejects it for VLA. > > Hmm, I see. I thought peeling by masking is an optimization. Yeah, it's an opt-in optimisation. No current Arm cores opt in though. > Anyhow, I think it should still work here - since all accesses are aligned > and we know that there's at least one original scalar iteration in the > first masked and the following "unmasked" vector iterations there > should never be faults for any of the aligned accesses. Peeling via masking works by using the main loop for the "peeled" iteration (so it's a bit of a misnomer). The vector pointers start out lower than the original scalar pointers, with some leading inactive elements. The awkwardness would be in skipping those leading inactive elements in the epilogue, if an early break occurs in the first vector iteration. Definitely doable, but I imagine not trivial. > I think going via alignment is a way easier method to guarantee this > than handwaving about "declared" arrays and niter. One can try that > in addition of course - it's not always possible to align all > vector loads we are going to speculate (for VLA one could also > find common runtime (mis-)alignment and restrict the vector length based > on that, for RISC-V it seems to be efficient, not sure whether altering > that for SVE is though). I think both techniques (alignment and reasoning about accessibility) are useful. And they each help with different cases. Like you say, if there are two vector loads that need to be aligned, we'd need to version for alignment on fixed-length architectures, with a scalar fallback when the alignment requirement isn't met. In contrast, static reasoning about accessibility allows the vector loop to be used for all relative misalignments. So I think the aim should be to support both techniques. But IMO it's reasonable to start with either one. It sounds from Tamar's results like starting with static reasoning does fire quite often, and it should have less runtime overhead than the alignment approach. Plus, when the loop operates on chars, it's hard to predict whether peeling for alignment pays for itself, or whether the scalar prologue will end up handling the majority of cases. If we have the option of not peeling for alignment, then it's probably worth taking it for chars. Capping the VL at runtime is possible on SVE. It's on the backlog for handling runtime aliases, where we can vectorise with a lower VF rather than falling back to scalar code. But first-faulting loads are likely to be better than halving or quartering the VL at runtime, so I don't think capping the VL would be the right SVE technique for early exits. Thanks, Richard
On Mon, 27 Nov 2023, Richard Sandiford wrote: > Catching up on backlog, so this might already be resolved, but: > > Richard Biener <rguenther@suse.de> writes: > > On Tue, 7 Nov 2023, Tamar Christina wrote: > > > >> > -----Original Message----- > >> > From: Richard Biener <rguenther@suse.de> > >> > Sent: Tuesday, November 7, 2023 9:43 AM > >> > To: Tamar Christina <Tamar.Christina@arm.com> > >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > >> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto- > >> > vectorization > >> > > >> > On Mon, 6 Nov 2023, Tamar Christina wrote: > >> > > >> > > > -----Original Message----- > >> > > > From: Richard Biener <rguenther@suse.de> > >> > > > Sent: Monday, November 6, 2023 2:25 PM > >> > > > To: Tamar Christina <Tamar.Christina@arm.com> > >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com> > >> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return > >> > > > auto- vectorization > >> > > > > >> > > > On Mon, 6 Nov 2023, Tamar Christina wrote: > >> > > > > >> > > > > Hi All, > >> > > > > > >> > > > > This patch adds initial support for early break vectorization in GCC. > >> > > > > The support is added for any target that implements a vector > >> > > > > cbranch optab, this includes both fully masked and non-masked targets. > >> > > > > > >> > > > > Depending on the operation, the vectorizer may also require > >> > > > > support for boolean mask reductions using Inclusive OR. This is > >> > > > > however only checked then the comparison would produce multiple > >> > statements. > >> > > > > > >> > > > > Note: I am currently struggling to get patch 7 correct in all > >> > > > > cases and could > >> > > > use > >> > > > > some feedback there. > >> > > > > > >> > > > > Concretely the kind of loops supported are of the forms: > >> > > > > > >> > > > > for (int i = 0; i < N; i++) > >> > > > > { > >> > > > > <statements1> > >> > > > > if (<condition>) > >> > > > > { > >> > > > > ... > >> > > > > <action>; > >> > > > > } > >> > > > > <statements2> > >> > > > > } > >> > > > > > >> > > > > where <action> can be: > >> > > > > - break > >> > > > > - return > >> > > > > - goto > >> > > > > > >> > > > > Any number of statements can be used before the <action> occurs. > >> > > > > > >> > > > > Since this is an initial version for GCC 14 it has the following > >> > > > > limitations and > >> > > > > features: > >> > > > > > >> > > > > - Only fixed sized iterations and buffers are supported. That is to say any > >> > > > > vectors loaded or stored must be to statically allocated arrays with > >> > known > >> > > > > sizes. N must also be known. This limitation is because our primary > >> > target > >> > > > > for this optimization is SVE. For VLA SVE we can't easily do cross page > >> > > > > iteraion checks. The result is likely to also not be beneficial. For that > >> > > > > reason we punt support for variable buffers till we have First-Faulting > >> > > > > support in GCC. > >> > > >> > Btw, for this I wonder if you thought about marking memory accesses required > >> > for the early break condition as required to be vector-size aligned, thus peeling > >> > or versioning them for alignment? That should ensure they do not fault. > >> > > >> > OTOH I somehow remember prologue peeling isn't supported for early break > >> > vectorization? .. > >> > > >> > > > > - any stores in <statements1> should not be to the same objects as in > >> > > > > <condition>. Loads are fine as long as they don't have the possibility to > >> > > > > alias. More concretely, we block RAW dependencies when the > >> > > > > intermediate > >> > > > value > >> > > > > can't be separated fromt the store, or the store itself can't be moved. > >> > > > > - Prologue peeling, alignment peelinig and loop versioning are supported. > >> > > >> > .. but here you say it is. Not sure if peeling for alignment works for VLA vectors > >> > though. Just to say x86 doesn't support first-faulting loads. > >> > >> For VLA we support it through masking. i.e. if you need to peel N iterations, we > >> generate a masked copy of the loop vectorized which masks off the first N bits. > >> > >> This is not typically needed, but we do support it. But the problem with this > >> scheme and early break is obviously that the peeled loop needs to be vectorized > >> so you kinda end up with the same issue again. So Atm it rejects it for VLA. > > > > Hmm, I see. I thought peeling by masking is an optimization. > > Yeah, it's an opt-in optimisation. No current Arm cores opt in though. > > > Anyhow, I think it should still work here - since all accesses are aligned > > and we know that there's at least one original scalar iteration in the > > first masked and the following "unmasked" vector iterations there > > should never be faults for any of the aligned accesses. > > Peeling via masking works by using the main loop for the "peeled" > iteration (so it's a bit of a misnomer). The vector pointers start > out lower than the original scalar pointers, with some leading > inactive elements. > > The awkwardness would be in skipping those leading inactive elements > in the epilogue, if an early break occurs in the first vector iteration. > Definitely doable, but I imagine not trivial. > > > I think going via alignment is a way easier method to guarantee this > > than handwaving about "declared" arrays and niter. One can try that > > in addition of course - it's not always possible to align all > > vector loads we are going to speculate (for VLA one could also > > find common runtime (mis-)alignment and restrict the vector length based > > on that, for RISC-V it seems to be efficient, not sure whether altering > > that for SVE is though). > > I think both techniques (alignment and reasoning about accessibility) > are useful. And they each help with different cases. Like you say, > if there are two vector loads that need to be aligned, we'd need to > version for alignment on fixed-length architectures, with a scalar > fallback when the alignment requirement isn't met. In contrast, > static reasoning about accessibility allows the vector loop to be > used for all relative misalignments. > > So I think the aim should be to support both techniques. But IMO it's > reasonable to start with either one. It sounds from Tamar's results > like starting with static reasoning does fire quite often, and it > should have less runtime overhead than the alignment approach. Fair enough, we need to fix the correctness issues then though (as said, correctness is way easier to assert for alignment). > Plus, when the loop operates on chars, it's hard to predict whether > peeling for alignment pays for itself, or whether the scalar prologue > will end up handling the majority of cases. If we have the option > of not peeling for alignment, then it's probably worth taking it > for chars. That's true. > Capping the VL at runtime is possible on SVE. It's on the backlog > for handling runtime aliases, where we can vectorise with a lower VF > rather than falling back to scalar code. But first-faulting loads > are likely to be better than halving or quartering the VL at runtime, > so I don't think capping the VL would be the right SVE technique for > early exits. For targets with no first-faulting loads we only have alignment as additional possibility then. I can look at this for next stage1. Richard. > Thanks, > Richard