Message ID | 1587711555-28817-2-git-send-email-ego@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | sensors: occ: Couple of fixes | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (0f1937ef40fca0c3212a9dff1010b832a24fb063) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | success | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
Le 24/04/2020 à 08:59, Gautham R. Shenoy a écrit : > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> > > commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu > systems") assumes that presence of "ibm,power9-npu" compatible node > indicates the presence of GPUs. However this is incorrect, as even > OpenCAPI is supported via NPU. Thus ZZ systems, which have OpenCAPI > connectors but not GPUs will have "ibm,power9-npu" compatible nodes. > This results in OPAL creating device-tree entries for the GPU sensors > on ZZ systems which don't even have GPUs. > > This patch fixes the GPU detection code in occ-sensors, by first > checking for "ibm,ioda2-npu2-phb" compatible node which indicates the > presence of nvlink. Only if such a node exists, do we check with the > OCC for presence of GPUs on systems to confirm the presence of the > GPU. Otherwise, we cut the GPU sensors. > > Thanks to Frederic Barrat <fbarrat@linux.ibm.com> for suggesting > "ibm,ioda2-npu2-phb" for detecting the presence of nvlink GPUs. > > Fixes: commit bebe096ee242 ("sensors: occ: Skip GPU sensors for non-gpu > systems") > Reported-by: Pavaman Subramaniyam <pavsubra@in.ibm.com> > Tested-by: Pavaman Subramaniyam <pavsubra@in.ibm.com> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> > --- > hw/occ-sensor.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c > index 524d00f..a5d0974 100644 > --- a/hw/occ-sensor.c > +++ b/hw/occ-sensor.c > @@ -521,8 +521,26 @@ bool occ_sensors_init(void) > dt_add_property_cells(sg, "#address-cells", 1); > dt_add_property_cells(sg, "#size-cells", 0); > > - if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu")) > - has_gpu = true; > + /* > + * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a > + * GPU NVlink. > + */ > + if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) { > + > + for_each_chip(chip) { > + int max_gpus_per_chip = 2, i; > + > + for(i = 0; i < max_gpus_per_chip; i++) { > + has_gpu = occ_get_gpu_presence(chip, i); > + > + if (has_gpu) > + break; > + } > + > + if (has_gpu) > + break; > + } > + } So on platform other than witherspoon, we skip the full check. On witherspoon, we go check the presence of each individual GPU slots. So I think that works. Out of curiosity, what's the behavior if we have GPUs but not on all slots (it may not be an official configuration, I don't know)? Would we see errors? Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Fred > > for_each_chip(chip) { > struct occ_sensor_data_header *hb; >
Hello Frederic, On Fri, Apr 24, 2020 at 06:29:57PM +0200, Frederic Barrat wrote: [..snip..] > >--- > > hw/occ-sensor.c | 22 ++++++++++++++++++++-- > > 1 file changed, 20 insertions(+), 2 deletions(-) > > > >diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c > >index 524d00f..a5d0974 100644 > >--- a/hw/occ-sensor.c > >+++ b/hw/occ-sensor.c > >@@ -521,8 +521,26 @@ bool occ_sensors_init(void) > > dt_add_property_cells(sg, "#address-cells", 1); > > dt_add_property_cells(sg, "#size-cells", 0); > >- if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu")) > >- has_gpu = true; > >+ /* > >+ * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a > >+ * GPU NVlink. > >+ */ > >+ if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) { > >+ > >+ for_each_chip(chip) { > >+ int max_gpus_per_chip = 2, i; > >+ > >+ for(i = 0; i < max_gpus_per_chip; i++) { > >+ has_gpu = occ_get_gpu_presence(chip, i); > >+ > >+ if (has_gpu) > >+ break; > >+ } > >+ > >+ if (has_gpu) > >+ break; > >+ } > >+ } > > So on platform other than witherspoon, we skip the full check. > On witherspoon, we go check the presence of each individual GPU slots. > So I think that works. > Out of curiosity, what's the behavior if we have GPUs but not on all slots > (it may not be an official configuration, I don't know)? Would we see > errors? No there won't be errors. Those GPU slots would show the values 0 for temperature and POWER. Example below: On a Witherspoon with only two GPUs ================================================================= Chip 0 GPU 0 : +33.0°C (lowest = +23.0°C, highest = +33.0°C) Chip 0 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 0 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 0 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 0 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 0 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 8 GPU 0 : +29.0°C (lowest = +23.0°C, highest = +29.0°C) Chip 8 GPU 1 : +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 8 GPU 2 : +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 8 GPU 0 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 8 GPU 1 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) Chip 8 GPU 2 MEM: +0.0°C (lowest = +0.0°C, highest = +0.0°C) ================================================================= > > Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Thanks! > > Fred > > > > > for_each_chip(chip) { > > struct occ_sensor_data_header *hb; > > Thanks and Regards gautham.
diff --git a/hw/occ-sensor.c b/hw/occ-sensor.c index 524d00f..a5d0974 100644 --- a/hw/occ-sensor.c +++ b/hw/occ-sensor.c @@ -521,8 +521,26 @@ bool occ_sensors_init(void) dt_add_property_cells(sg, "#address-cells", 1); dt_add_property_cells(sg, "#size-cells", 0); - if (dt_find_compatible_node(dt_root, NULL, "ibm,power9-npu")) - has_gpu = true; + /* + * On POWER9, ibm,ioda2-npu2-phb indicates the presence of a + * GPU NVlink. + */ + if (dt_find_compatible_node(dt_root, NULL, "ibm,ioda2-npu2-phb")) { + + for_each_chip(chip) { + int max_gpus_per_chip = 2, i; + + for(i = 0; i < max_gpus_per_chip; i++) { + has_gpu = occ_get_gpu_presence(chip, i); + + if (has_gpu) + break; + } + + if (has_gpu) + break; + } + } for_each_chip(chip) { struct occ_sensor_data_header *hb;