From patchwork Sat Jul 13 15:49:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Xue OS X-Patchwork-Id: 1960187 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=os.amperecomputing.com header.i=@os.amperecomputing.com header.a=rsa-sha256 header.s=selector2 header.b=BCvZlCcz; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WLtFx3mz3z1xqx for ; Sun, 14 Jul 2024 01:50:13 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C98D93861033 for ; Sat, 13 Jul 2024 15:50:11 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from BN1PR04CU002.outbound.protection.outlook.com (mail-eastus2azlp170100000.outbound.protection.outlook.com [IPv6:2a01:111:f403:c110::]) by sourceware.org (Postfix) with ESMTPS id 5F79E386103C for ; Sat, 13 Jul 2024 15:49:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5F79E386103C Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 5F79E386103C Authentication-Results: server2.sourceware.org; arc=pass smtp.remote-ip=2a01:111:f403:c110:: ARC-Seal: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885793; cv=pass; b=rImO06vvD4YF60pE/jjdsnRPaGHDUMiVFD+Cz0gL2uVmqhqd5bupRKlXceUrye5HMutK5jUVQsVvvTzmWWTRZJ1L/7BjGQa8pd4DFJqN2muwJx/jArFEQ+bkrYObY8m5oUFKzlMVvw1SNMqE9bJONDHb0XcKaOpcIbLRSA15j54= ARC-Message-Signature: i=2; a=rsa-sha256; d=sourceware.org; s=key; t=1720885793; c=relaxed/simple; bh=caCkq9iObxWedpMo+JLzuQh7AtU9WWIufcd4UdLoEyM=; h=DKIM-Signature:From:To:Subject:Date:Message-ID:MIME-Version; b=anO0Yt3FLMtA0a6PrxMQlnIXrsQjvN8SbDTH1F6yyW4Zu2G4m87A5UkOm1uHMPs+Z4algxPUF90i0HH6DCtVb31tHYHJs4J0cctyGP3ll4/Df9X28bG4t9OuRM7CnLbqjgjmvKmdsMVLM+8ER/M148tzyDanq8AvZQWuduhObAk= ARC-Authentication-Results: i=2; server2.sourceware.org ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=BRqgECRShJU1EfEpO5FQ6iD/eut4Hc6RZZhGWAon25Y1ZkkJW2cF0grUdpIkiFJAX6xCY2IzlX2zGyNPeZOeWAHnaC6lnXmceup38ZJW3oULOAdZEllYy19ojRARvAkFs2B9YevoX+n8uVDfoalomxnOtlczh+CUDBBkwlvlrQRwUXRMsUqaLHuCyYyKXi8qsjE7DYDsH4TqMJPD7XsSvkuPfeOfV4ATCFR6i+EzfIaAK7Qs6I8q/D05QH6g9CS6fhCpRhp44Mx8GZaIddcEJ6TlxdzXbhrpr0tNN1YdjaFa/7r4Ysm33ROd0zDG2WqJHzjLZze2G3oFyF3BCSBr7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=jIlqLQ2q6DSJW8t9X3bwlAyqlzKJR0+xQiO49SEhxOc=; b=d11UAofBDH+Bgib7z7j/lWk8NVPYrZ4w8NzQ1+rvbX4xawjqZJ88DXGV+viSGHU+GW53vsAYArb5Jc2ugR3l3himiqIhR9T2ExzbUQk2+jaeAk+3Nt5UHcXh+OaVUspxGy3o1V+UAdIp8aii1NxaWPXzXpBYlDL5mGSW8TiqOmk9TiIVj2T7pFytiF3CXqdMB5y5G1Qsx8TNWVlp2axE2cHj0jG31Sb+bJreTm75qzw5oYecXKvuekc0AQa4oHAb7vF+do8egOBWInX1juLSVnT9K90jVxmqYIIyb28wCGp6KlRRw89stBIowzh8mCRezxrWMszZmoLhsbX3VFMbAg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=jIlqLQ2q6DSJW8t9X3bwlAyqlzKJR0+xQiO49SEhxOc=; b=BCvZlCczX0LE96xyewXtyjF/ycWeSqd+mS7QDJ0aTgVkz9iJWc90M93AopfNeVgFKbL2nxOtCseFYh0cu92851FcWlGk7mlZ7wpky7X8uU4qCgrNDuyfV5PSUGQWAYY9Lc2bPvn40UJAjqFkAPT3XGvlLqIeQGaBMfWNPzjRc20= Received: from LV2PR01MB7839.prod.exchangelabs.com (2603:10b6:408:14f::13) by PH0PR01MB7334.prod.exchangelabs.com (2603:10b6:510:10d::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7762.24; Sat, 13 Jul 2024 15:49:49 +0000 Received: from LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63]) by LV2PR01MB7839.prod.exchangelabs.com ([fe80::2ac3:5a77:36fd:9c63%4]) with mapi id 15.20.7762.020; Sat, 13 Jul 2024 15:49:49 +0000 From: Feng Xue OS To: Richard Biener , "gcc-patches@gcc.gnu.org" Subject: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles Thread-Topic: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles Thread-Index: AQHa1Tw2XxBk97P7e0W53AZRrQ8nig== Date: Sat, 13 Jul 2024 15:49:49 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2024-07-13T15:49:48.768Z; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0; MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: LV2PR01MB7839:EE_|PH0PR01MB7334:EE_ x-ms-office365-filtering-correlation-id: 324fe78c-1aed-4000-4d0a-08dca3536d40 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-2022-jp?b?bjRqdGF0VGpJb1BCMnpTWlU5?= =?iso-2022-jp?b?TkpkTnY1QkV2Rm5jbGY3cENkUTJWTTFYOHNXbjBUNUc1ZjhZRmRiL1Ar?= =?iso-2022-jp?b?Ym9sTnE0NEt1VGRmQWdNajNHYVNiemlXdXVJOE92T0d5MUw4N3d1UjBL?= =?iso-2022-jp?b?ejE4SzFjSEtPUC9lV0xKa3NwYiswUU8xOFQ2M3dtMjZPQThURWdaekRT?= =?iso-2022-jp?b?WlQvQkwzRmFnQTJkSzJwR0dtZ3E5cXpIVmJ1eU1ESkpINVhKV24vQjFD?= =?iso-2022-jp?b?MU1UQUlRYWVHMEVwTUN2TnRJQXlCRzRvcjFMMnZJY29tcEw4SHYyMXdH?= =?iso-2022-jp?b?MXlUVUV4WHZBNUxDYlRFaVJwUGtiOVhmTWlFdCtmc2hDWkQ4M2J0MjJi?= =?iso-2022-jp?b?bHNEb2h4VnpCaDYzYXB5UnJOWHNLdmdxWDVJZ0FpQVlQNndNMGUvZm5E?= =?iso-2022-jp?b?UVdKNDVvUnlBR1gybDBRbGRmRlZNbU8yVnNMUTBvS3BvZnJJenE3dGZm?= =?iso-2022-jp?b?Rkl2dDN3V3lrY1AwQUpkT3NiVEQvZlNXK1krK1Q5TGZvcG0rbE5oTUlB?= =?iso-2022-jp?b?T3h1ZU5wZ0JYWkV2YUx1RzY3SjU2YTFLWllXWXBGc0tQVVNUTVNnZy9n?= =?iso-2022-jp?b?OWVpcWY5WEhEUUFXSUV3RjNHdGRrdWxwZDlBMjF3YVVBNkVBZnpraC9U?= =?iso-2022-jp?b?WWhUOGMwbkpHaitLWkFoVjlYVGUrb1Noc2k0eHZkNExscFNYelRpZVli?= =?iso-2022-jp?b?ZmEvbmR6M0o0aXVtYmdQUXF2WjVMUXBLdFJGdk5hNGZ4akM2LzYrbzlr?= =?iso-2022-jp?b?RTdPWlpKNVRtYlFiSGw0c0pCdE1TVUswTE5MUnFVQVpHbkF6TXVZdnpK?= =?iso-2022-jp?b?cjFueFo2Zm55MVh1QWt2U1dXRlNlcDlqcUpmK2U1U0VEbUxhTlAwbWk4?= =?iso-2022-jp?b?eitKSXJMMmgyRXJHTmprNEtFWmEvZnRKREdna1BaTGZ6aGNQTUJibzU3?= =?iso-2022-jp?b?aW9DalBUNFpkR3RHR1RaSzFvQ0FXV2Izalc2aWdQS212K2N4VGw2VG5G?= =?iso-2022-jp?b?V0xqQXE5bFBDZmNLa2phVHphcTVHSlF0TlpZUDVxSUZySVRzanVOT1RC?= =?iso-2022-jp?b?aEplSE1ybFNROTE0TkxhRm52cjlodXowemR1WCsyNEdtbkhlOUJDTHJP?= =?iso-2022-jp?b?YVdjWEFnUENtU0MySGNzc244Y3pyR0pxcWlNa0dYUXA1cTdwcnVNSjky?= =?iso-2022-jp?b?TEhPVXdOZis3WUtQN2hxcWlDN0N2T1dTYTQ2ZlFjeVByNmkxaSt1ckNh?= =?iso-2022-jp?b?VzhZaTVYbmRKVjJVSk9JZVo0eHJraHZEeWV5ZkZHVGhNMVlCZnZTbExK?= =?iso-2022-jp?b?YmprcmxZdlAza3B4ZWRUaWNXMHdvOEJXSDlvbTM0cHpwUlZwdVgyYUJW?= =?iso-2022-jp?b?L1F2d0xsd0ZsejVKZWVoUGIveWMwUitqNmFLbVM3YnE3R2FLZ0l1ODVI?= =?iso-2022-jp?b?dlIwQ2hEVlVCOTNNTEhGcWlTRkgvTytyWWR2MWljTkRJamlxNjJYQlFu?= =?iso-2022-jp?b?VHRibVdzdWoxU0Z4NzMxczF0WENzR2lUNEtOV2cxcFl3b0NBVG9vb1Q2?= =?iso-2022-jp?b?cU5hRjU3MmZnQnBQbWU5RTh5Z0xMT01jK1YyRGhtN09BcEl0S0JWVGhT?= =?iso-2022-jp?b?Y2JXM2g1VGQ1TE5UMEVaa215RmF5R3IwL3BsMXBQTU1SUTErMTB2WTJP?= =?iso-2022-jp?b?WGtmNHozYkdjeVMrVDY1bTIrdFE5eHdseTNqSFdxL2hUcWxLeFBlNnJS?= =?iso-2022-jp?b?dzN2SmQvTzV3c0N0UGljN2dtUVd3ZDRDQ1dSQysrRy9qRTd5STRISkd3?= =?iso-2022-jp?b?MFhaWGNYTFpSMmd1NVhhK0F4ckNmRTVweGN1NVNsTFlhZ003ci9OdTBX?= =?iso-2022-jp?b?RW5HRkJGSlg0MHpkbzJlbUh6enB2emlaaVNqUnN4cHZKY0R6cmZjYXZK?= =?iso-2022-jp?b?N281UmUvVmk3N3oxdloza2lSZHpVZTZJdmN0cTRlQzlsT2Y4QVg3Uzl2?= =?iso-2022-jp?b?SEgyWVgwVi9vR1Q2ekYvalY5WG9UNS9lTGc9PQ==?= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:ja; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:LV2PR01MB7839.prod.exchangelabs.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(376014)(366016)(38070700018); DIR:OUT; SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-2022-jp?b?Rk5TL2tqblpIdVE3TjRY?= =?iso-2022-jp?b?T3JSeDJFbmpmckNJSkd1eHFUNUFGMUZPSmdBaHJGTU82STlVM2JRWDdC?= =?iso-2022-jp?b?a25EQ3ZjZ3VQelhJK3pTUTVJK2JSZUJ3K2ZrVnBYckM2NnBud3JGNXJF?= =?iso-2022-jp?b?dExMdXhxMktlaUM1Q3ljVWRoaWNLcWNjL0lVUFlsUDd6WlV4WGVqanVF?= =?iso-2022-jp?b?OU9NemZKTmJmSlVkTlh1ZExxUHJVRnJqU3hWb2xtTGtDK1NKa2k5Znlj?= =?iso-2022-jp?b?cFpabFI0VDl1OTFyWWc1ZlJFcHVDNS9XSkRkNFpkM01rTmowYURZM3pr?= =?iso-2022-jp?b?bk12OUwwQ0t4RitBeTQxZ25GK0ZUQkFUcmxmeE0wMzFhdHJtUThOc0xP?= =?iso-2022-jp?b?Z01EZmIzVWdBdnJWNVVETDJtaEZDNlpyano2aDNOS1BRamtyWnFRdjly?= =?iso-2022-jp?b?QTVoc2pzYWJQcHliOGt1QTJLWHo0ZFhPOTNHcjZwaGMwM0ZCWWpNVWFB?= =?iso-2022-jp?b?eHh3RHdoT3VzbHlOdEpwdFBzNGRGZTh2c1lScDVTWWNkY0RDZXlsVnNC?= =?iso-2022-jp?b?NTlyN1F4czgzTktYWU5XQ3VMazRxbXREcVNqT0hVWGUxQU1tK1ZlQ1VB?= =?iso-2022-jp?b?M3FGOUhBRnNFWWVuRFhqT012dlFVOE5KL1hGVU52NmlvNDM1a0pwSXpW?= =?iso-2022-jp?b?cjRTM05YTXM5MmtEN0QyZU1HM0xXYjR4WDNabEpBdGtGOStzYVBGTFZu?= =?iso-2022-jp?b?OHQ5eHF4Y2luMll1M0l2VGdONFY0cE93TkhpUWVnZXU1ZUJxTGxRaWNv?= =?iso-2022-jp?b?TGlJcG5WVStybmRUZmFBQ1pUVmtHSWJxTG5JNUYwaXFQK3ZQaE1Nb1Zl?= =?iso-2022-jp?b?V1J1bUl1ajgzaWlyem45VC9saEtwb3ZUN1ozNjJnOWN5SHBJMmNhdUZh?= =?iso-2022-jp?b?MkdUYUFYSHk4cGpqckdONHhBWTI1bk9BdWd3TDRkNlNsTy9XYk5iN0Vw?= =?iso-2022-jp?b?ak1VMzdCQ2ttY29NNHNUUEVETk5SQU9Ja3JOQ1FjRk5UdjM0QlFQaE1P?= =?iso-2022-jp?b?VEVjOTJEVlp4YmRVa1ZNQ3lQa2ZxVVRkVmtzSlN2c21Ma2J1NVhPR29L?= =?iso-2022-jp?b?bFFQRE5MOWFnYUllTGp2NUJreGU1K2g4THhncjRxQXZVZjBrSVptelNz?= =?iso-2022-jp?b?K1JMSGZTeml6ZHhmSGlpTUpVSmNBcHF3T1BkUTNSZzFzWllNakQyTCsv?= =?iso-2022-jp?b?Qk8wZng1ZHRyVXNIbmNZUmVUeU1ySjhqd1FocTZITXp1N25PajlFeHBI?= =?iso-2022-jp?b?Nk5NSTBmaytTVXNuU0h6NWxOVDNFTXhITkEzYWV5NThZSFZ2MmhmQlAx?= =?iso-2022-jp?b?RFV2ZWFZa0ZOSnhGZ0V2bEREVzBWM3I4MlB5bCs0WnVETkxIOW9hSjFJ?= =?iso-2022-jp?b?K3d0K3NnU1dlRFpUOTdNVHJCYUY2SGVKek15cGFkcHBpT0RDUzVsSzcr?= =?iso-2022-jp?b?TDhpUjRkTkVENmRackdSU0VhcE9WYnBleXB0Z2lxUGNCTmMyUjNxUzM2?= =?iso-2022-jp?b?amZJYUIzbmRCakQ1OEM5M2JoM2kyODhucEdBL2oxVXVJWWw5dlNyVnpP?= =?iso-2022-jp?b?WmxnQzhuUGtHVStYUmlDUlpxU1BkYStZcGNyU1AwRTFnR1E2VFpQWHUz?= =?iso-2022-jp?b?UnczSGc2eWhJTHR2Y2F3MTVaeHZMemlOUmthNUVLNjNabUZsZzEyOGUv?= =?iso-2022-jp?b?K21lK3hmcWE0NzhJN0RtOUNDeWpuTVBlK05zT0x3S2lRWW5NVmNRTzk0?= =?iso-2022-jp?b?ZjNwQnA4SHFpMWo0LzhldXM3SlhGUjRiZjQ1RzFGcEM1bks3N1MxbHM5?= =?iso-2022-jp?b?bHJyVW5JQmFVRU9yN2NTcHdhSkFRZ2pTNVh1YVo0OTMzaHd6bmJ1bkJZ?= =?iso-2022-jp?b?TjBCS3BFd3JzekJsTE9haGdXa0JGMnVsODVjSGV5VGJ2WnJiWFJIajlx?= =?iso-2022-jp?b?Y1R4YmpOSkI1aU5qcHYzMFVKWDhoK0VzY1hpbDVZSFp0eG51TzY2OGVS?= =?iso-2022-jp?b?YVpLZHdkNWNNTFVBWHA2ZjhXYW80T3hoanVrNTdmMFV5NzhvcWh2aUJy?= =?iso-2022-jp?b?OEFWT0p0Z3Y3OW5TTFphR0ZaSVBFWGV5RFh6WmpYcmpxbStBV3FNVGll?= =?iso-2022-jp?b?RzNMTG9EZ2NLZGlYWFhYRCt5cEU5aW1wcmJHK3lFVlJnLzVOK2Rrbk11?= =?iso-2022-jp?b?QmVhdEI3NHhpd3FWRnJDL3VmWk5FUUsxbUZBUld6RUNVQUFWRS94ME1X?= =?iso-2022-jp?b?NzNwdHM2QlkrYjFUOWo4R1hsbFlVYXdibGZ3emc2ZA==?= MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: LV2PR01MB7839.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: 324fe78c-1aed-4000-4d0a-08dca3536d40 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jul 2024 15:49:49.0076 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: nMN2G5WR7OuCKIAfy5TVEkVCcSymfSLmO05IOOdiXzSirLsVgrwwXyHt3BXGwsMSRN6WjrGrLiSSnw0kiS9wYbi1zOgpQHKUY2mG6/BKBMM= X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR01MB7334 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++----- gcc/tree-vectorizer.h | 6 ++++ 2 files changed, 63 insertions(+), 7 deletions(-) From f3d2bff96f8e29f775e2cb12ef43ad464b819fcf Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 29 May 2024 17:28:14 +0800 Subject: [PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++++++++++++++++++++++++++++++++++++++----- gcc/tree-vectorizer.h | 6 ++++ 2 files changed, 63 insertions(+), 7 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index e72d692ffa3..5bc6e526d43 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ + Moreover, for a higher instruction parallelism in final vectorized + loop, it is considered to make those effective vector lane-reducing + ops be distributed evenly among all def-use cycles. In the above + example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate + cycles, instruction dependency among them could be eliminated. */ unsigned effec_ncopies = vec_oprnds[0].length (); unsigned total_ncopies = vec_oprnds[reduc_index].length (); @@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vec_oprnds[i].safe_grow_cleared (total_ncopies); } } + + tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + gcc_assert (reduc_vectype_in); + + unsigned effec_reduc_ncopies + = vect_get_num_copies (loop_vinfo, slp_node, reduc_vectype_in); + + gcc_assert (effec_ncopies <= effec_reduc_ncopies); + + if (effec_ncopies < effec_reduc_ncopies) + { + /* Find suitable def-use cycles to generate vectorized statements + into, and reorder operands based on the selection. */ + unsigned curr_pos = reduc_info->reduc_result_pos; + unsigned next_pos = (curr_pos + effec_ncopies) % effec_reduc_ncopies; + + gcc_assert (curr_pos < effec_reduc_ncopies); + reduc_info->reduc_result_pos = next_pos; + + if (curr_pos) + { + unsigned count = effec_reduc_ncopies - effec_ncopies; + unsigned start = curr_pos - count; + + if ((int) start < 0) + { + count = curr_pos; + start = 0; + } + + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + for (unsigned j = effec_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } + } } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 62121f63f18..b6fdbc651d6 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1