Message ID | 1587710475-27567-2-git-send-email-ego@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | sensors: occ: Couple of fixes | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (0f1937ef40fca0c3212a9dff1010b832a24fb063) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-04-24 12:11:14]: > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> > > commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu > systems") assumes that presence of "ibm,power9-npu" compatible node > indicates the presence of GPUs. However this is incorrect, as even > OpenCAPI is supported via NPU. Thus ZZ systems, which have OpenCAPI > connectors but not GPUs will have "ibm,power9-npu" compatible nodes. > This results in OPAL creating device-tree entries for the GPU sensors > on ZZ systems which don't even have GPUs. > > This patch fixes the GPU detection code in occ-sensors, by first > checking for "ibm,ioda2-npu2-phb" compatible node which indicates the > presence of nvlink. Only if such a node exists, do we check with the > OCC for presence of GPUs on systems to confirm the presence of the > GPU. Otherwise, we cut the GPU sensors. > > Thanks to Frederic Barrat <fbarrat@linux.ibm.com> for suggesting > "ibm,ioda2-npu2-phb" for detecting the presence of nvlink GPUs. > > Fixes: commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu > systems") > Reported-by: Pavaman Subramaniyam <pavsubra@in.ibm.com> > Tested-by: Pavaman Subramaniyam <pavsubra@in.ibm.com> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> > --- > hw/occ-sensor.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c > index 524d00f..a5d0974 100644 > --- a/hw/occ-sensor.c > +++ b/hw/occ-sensor.c > @@ -521,8 +521,26 @@ bool occ_sensors_init(void) > dt_add_property_cells(sg, "#address-cells", 1); > dt_add_property_cells(sg, "#size-cells", 0); > > - if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu")) > - has_gpu = true; > + /* > + * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a > + * GPU NVlink. > + */ > + if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) { > + > + for_each_chip(chip) { > + int max_gpus_per_chip = 2, i; ^3 Max nr gpus is 3 :) > + > + for(i = 0; i < max_gpus_per_chip; i++) { max_gpus_per_chip-1; Index of GPU are 0,1,2 > + has_gpu = occ_get_gpu_presence(chip, i); > + > + if (has_gpu) > + break; > + } > + > + if (has_gpu) > + break; > + } > + } > > for_each_chip(chip) { > struct occ_sensor_data_header *hb; Thanks for the fix. Minor nits above. --Vaidy
diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c index 524d00f..a5d0974 100644 --- a/hw/occ-sensor.c +++ b/hw/occ-sensor.c @@ -521,8 +521,26 @@ bool occ_sensors_init(void) dt_add_property_cells(sg, "#address-cells", 1); dt_add_property_cells(sg, "#size-cells", 0); - if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu")) - has_gpu = true; + /* + * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a + * GPU NVlink. + */ + if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) { + + for_each_chip(chip) { + int max_gpus_per_chip = 2, i; + + for(i = 0; i < max_gpus_per_chip; i++) { + has_gpu = occ_get_gpu_presence(chip, i); + + if (has_gpu) + break; + } + + if (has_gpu) + break; + } + } for_each_chip(chip) { struct occ_sensor_data_header *hb;