mbox series

[V8,00/15] Add data type profiling support for powerpc

Message ID 20240718084358.72242-1-atrajeev@linux.vnet.ibm.com (mailing list archive)
Headers show
Series Add data type profiling support for powerpc | expand

Message

Athira Rajeev July 18, 2024, 8:43 a.m. UTC
The patchset from Namhyung added support for data type profiling
in perf tool. This enabled support to associate PMU samples to data
types they refer using DWARF debug information. With the upstream
perf, currently it possible to run perf report or perf annotate to
view the data type information on x86.

Initial patchset posted here had changes need to enable data type
profiling support for powerpc.

https://lore.kernel.org/all/6e09dc28-4a2e-49d8-a2b5-ffb3396a9952@csgroup.eu/T/

Main change were:
1. powerpc instruction nmemonic table to associate load/store
instructions with move_ops which is use to identify if instruction
is a memory access one.
2. To get register number and access offset from the given
instruction, code uses fields from "struct arch" -> objump.
Added entry for powerpc here.
3. A get_arch_regnum to return register number from the
register name string.

But the apporach used in the initial patchset used parsing of
disassembled code which the current perf tool implementation does.

Example: lwz     r10,0(r9)

This line "lwz r10,0(r9)" is parsed to extract instruction name,
registers names and offset. Also to find whether there is a memory
reference in the operands, "memory_ref_char" field of objdump is used.
For x86, "(" is used as memory_ref_char to tackle instructions of the
form "mov  (%rax), %rcx".

In case of powerpc, not all instructions using "(" are the only memory
instructions. Example, above instruction can also be of extended form (X
form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
and extract the source/target registers, second patchset added support to use
raw instruction. With raw instruction, macros are added to extract opcode
and register fields.
Link to second patchset:
https://lore.kernel.org/all/20240506121906.76639-1-atrajeev@linux.vnet.ibm.com/

Example representation using --show-raw-insn in objdump gives result:

38 01 81 e8     ld      r4,312(r1)

Here "38 01 81 e8" is the raw instruction representation. In powerpc,
this translates to instruction form: "ld RT,DS(RA)" and binary code
as:
  _____________________________________
  | 58 |  RT  |  RA |      DS       | |
  -------------------------------------
0    6     11    16              30 31

Second patchset used "objdump" again to read the raw instruction.
But since there is no need to disassemble and binary code can be read
directly from the DSO, third patchset (ie this patchset) uses below
apporach. The apporach preferred in powerpc to parse sample for data
type profiling in V3 patchset is:
- Read directly from DSO using dso__data_read_offset
- If that fails for any case, fallback to using libcapstone
- If libcapstone is not supported, approach will use objdump

Patchset adds support to pick the opcode and reg fields from this
raw/binary instruction code. This approach came in from review comment
by Segher Boessenkool and Christophe for the initial patchset.

Apart from that, instruction tracking is enabled for powerpc.

These are handled in this patchset.

- Patch 1 is to rearrange register state type structures to header file
so that it can referred from other arch specific files
- Patch 2 is to make instruction tracking as a callback to"struct arch"
so that it can be implemented by other archs easily and defined in arch
specific files
- Patch 3 is to handle state type regs array size for x86 and powerpc
- Patch 4 adds support to capture and parse raw instruction in powerpc
using dso__data_read_offset utility
- Patch 4 also adds logic to support using objdump when doing default "perf
report" or "perf annotate" since it that needs disassembled instruction.
- Patch 5 adds disasm_line__parse to parse raw instruction for powerpc
- Patch 6 update parameters for reg extract functions to use raw
instruction on powerpc
- Patch 7 updates ins__find to carry raw_insn and also adds parse
callback for memory instructions for powerpc
- Patch 8 add support to identify memory instructions of opcode 31 in
powerpc
- Patch 9 adds more instructions to support instruction tracking in powerpc
- Patch 10 and 11 handles instruction tracking for powerpc.
- Patch 12, 13 and 14 add support to use libcapstone in powerpc
- Patch 15 handles insn-stat option for perf annotate

Note:
- There are remaining unknowns (25%) as seen in annotate Instruction stats
below.
- This patchset is not tested on powerpc32. In next step of enhancements
along with handling remaining unknowns, plan to cover powerpc32 changes
based on how testing goes.
- Patch changes to handle global register support will be posted in
follow up after addressing review comment from Namhung here:
https://lore.kernel.org/linux-perf-users/CAM9d7cg5MYvLeOoBuKqp1pw7uvRfqCw1fXpLtgct0npL96JaYg@mail.gmail.com/T/#mb309184ebf5d2c9b539cd86d61ffe2a3696eac71

With the current patchset:

 ./perf record -a -e mem-loads sleep 1
 ./perf report -s type,typeoff --hierarchy --group --stdio
 ./perf annotate --data-type --insn-stat

perf annotate logs:
==================

Annotate Instruction stats
total 609, ok 266 (43.7%), bad 343 (56.3%)

  Name/opcode         :  Good   Bad
-----------------------------------------------------------
  58                  :   172   231
  32                  :    48    44
  34                  :    19    25
  OP_31_XOP_LDX       :     8    20
  40                  :     9    14
  OP_31_XOP_LWARX     :     5     1
  OP_31_XOP_LWZX      :     2     3
  OP_31_XOP_LDARX     :     3     0
  33                  :     0     2
  OP_31_XOP_LBZX      :     0     1
  OP_31_XOP_LWAX      :     0     1
  OP_31_XOP_LHZX      :     0     1
 
perf report logs:
=================

  Total Lost Samples: 0

  Samples: 1K of event 'mem-loads'
  Event count (approx.): 937238

    Overhead  Data Type / Data Type Offset
    ........  .........  ................
     1.73%     struct task_struct
        1.29%     struct task_struct +276 (flags)
        0.10%     struct task_struct +1872 (mm)
        0.08%     struct task_struct +3632 (thread.regs)
        0.08%     struct task_struct +1072 (sched_class)
        0.05%     struct task_struct +1928 ()
        0.03%     struct task_struct +1064 (dl_server)
        0.02%     struct task_struct +0 (thread_info.preempt_count)
        0.02%     struct task_struct +120 (thread_info)
        0.02%     struct task_struct +2216 (worker_private)
        0.02%     struct task_struct +24 (thread_info.slb_preload_nr)
        0.02%     struct task_struct +8 (thread_info.local_flags)
        0.01%     struct task_struct +1416 (nr_cpus_allowed)
     1.68%     struct file
        0.58%     struct file +176 (f_op)
        0.53%     struct file +24 (f_count.counter)
        0.46%     struct file +20 (f_mode)
        0.11%     struct file +200 (private_data)
     1.63%     struct rq
        0.76%     struct rq +2856 (cpu)
        0.21%     struct rq +2592 (curr)
        0.11%     struct rq +2632 (clock_update_flags)
        0.08%     struct rq +2600 (idle)
        0.06%     struct rq +2688 (clock_task)
        0.05%     struct rq +16 (numa_migrate_on)
        0.04%     struct rq +2864 (cfs_tasks.next)

Thanks
Athira Rajeev

Changelog:
From v7 -> v8:
  Split patches 1 to 14 and patch 18 as basic patches.
  Patch 15, 16 and 17 is related to global register handling
  and will be handled separately.
  Added Reviewed-and-tested-by from Kajol Jain
  Added Reviewed-by from Namhyung for patches 1 to 14, patch 18

From v6 -> v7:
- Addressed review comments from Namhyung
  Changed format string space to %-20s while printing
  instruction stats in patch 18.
  Use cmp_null in patch 17 while comparing var_name to
  properly sort with correct order.

From v5 -> v6:
- Addressed review comments from Namhyung
  Conditionally define TYPE_STATE_MAX_REGS based on arch.
  Added macro for defining width of the raw codes and spaces
  in disasm_line__parse_powerpc.
  Call disasm_line__parse from disasm_line__parse_powerpc
  for generic code.
  Renamed symbol__disassemble_dso to symbol__disassemble_raw.
  Fixed find_data_type_global_reg to correclty free var_types
  and change indent level.
  Fixed data_type_cmp and sort__typeoff_sort to include var_name
  in comparing data type entries.
  
From v4 -> v5:
- Addressed review comments from Namhyung
  Handle max number of type state regs as 16 for x86 and 32 for
  powerpc.
  Added generic support for objdump patch first and DSO read
  optimisation next
  combined patch 3 and patch 4 in patchseries V4 to one patch
  Changed reference for "raw_insn" to use "u32"
  Splitted "parse" callback patch changes and "ins__find" patch
  changes into two
  Instead of making weak function, added get_powerpc_regs to
  extract register and offset fields for powerpc
- Addressed complation fail when "dwarf.h" is not present ie
  elfutils devel is not present. Used includes for #ifdef HAVE_DWARF_SUPPORT
  when including functions that use Dwarf references. Also
  conditionally include some of the header files.

From v3->v4:
- Addressed review comments from Ian by using capston_init from
  "util/print_insn.c" instead of "open_capston_handle".
- Addressed review comment from Namhyung by moving "opcode"
  field from "struct ins" to "struct disasm_line"

From v2->v3:
- Addressed review comments from Christophe and Namhyung for V2
- Changed the apporach in powerpc to parse sample for data
  type profiling as:
  Read directly from DSO using dso__data_read_offset
  If that fails for any case, fallback to using libcapstone
  If libcapstone is not supported, approach will use objdump
- Include instructions with opcode as 31 and correctly categorize
  them as memory or arithmetic instructions.
- Include more instructions for instruction tracking in powerpc

From v1->v2:
- Addressed suggestion from Christophe Leroy and Segher Boessenkool
  to use the binary code (raw insn) to fetch opcode, register and
  offset fields.
- Added support for instruction tracking in powerpc
- Find the register defined variables (r13 and r1 which points to
  local_paca and current_stack_pointer in powerpc)

Athira Rajeev (15):
  tools/perf: Move the data structures related to register type to
    header file
  tools/perf: Add "update_insn_state" callback function to handle arch
    specific instruction tracking
  tools/perf: Update TYPE_STATE_MAX_REGS to include max of regs in
    powerpc
  tools/perf: Add disasm_line__parse to parse raw instruction for
    powerpc
  tools/perf: Add support to capture and parse raw instruction in
    powerpc using dso__data_read_offset utility
  tools/perf: Update parameters for reg extract functions to use raw
    instruction on powerpc
  tools/perf: Add parse function for memory instructions in powerpc
  tools/perf: Add support to identify memory instructions of opcode 31
    in powerpc
  tools/perf: Add some of the arithmetic instructions to support
    instruction tracking in powerpc
  tools/perf: Add more instructions for instruction tracking
  tools/perf: Update instruction tracking for powerpc
  tools/perf: Make capstone_init non-static so that it can be used
    during symbol disassemble
  tools/perf: Use capstone_init and remove open_capstone_handle from
    disasm.c
  tools/perf: Add support to use libcapstone in powerpc
  tools/perf: Set instruction name to be used with insn-stat when using
    raw instruction

 tools/include/linux/string.h                  |   2 +
 tools/lib/string.c                            |  13 +
 tools/perf/arch/arm64/annotate/instructions.c |   3 +-
 .../arch/loongarch/annotate/instructions.c    |   6 +-
 .../perf/arch/powerpc/annotate/instructions.c | 254 ++++++++++
 tools/perf/arch/powerpc/util/dwarf-regs.c     |  53 ++
 tools/perf/arch/s390/annotate/instructions.c  |   5 +-
 tools/perf/arch/x86/annotate/instructions.c   | 377 ++++++++++++++
 tools/perf/builtin-annotate.c                 |   4 +-
 tools/perf/util/annotate-data.c               | 453 +----------------
 tools/perf/util/annotate-data.h               |  83 ++++
 tools/perf/util/annotate.c                    |  21 +-
 tools/perf/util/annotate.h                    |   5 +-
 tools/perf/util/disasm.c                      | 468 ++++++++++++++++--
 tools/perf/util/disasm.h                      |  19 +-
 tools/perf/util/include/dwarf-regs.h          |  11 +
 tools/perf/util/print_insn.c                  |  15 +-
 tools/perf/util/print_insn.h                  |   5 +
 18 files changed, 1301 insertions(+), 496 deletions(-)

Comments

Arnaldo Carvalho de Melo July 23, 2024, 8:07 p.m. UTC | #1
On Thu, Jul 18, 2024 at 02:13:43PM +0530, Athira Rajeev wrote:
> The patchset from Namhyung added support for data type profiling
> in perf tool. This enabled support to associate PMU samples to data
> types they refer using DWARF debug information. With the upstream
> perf, currently it possible to run perf report or perf annotate to
> view the data type information on x86.
> 
> Initial patchset posted here had changes need to enable data type
> profiling support for powerpc.

I have this in tmp.perf-tools-next now, will do more tests before moving
to perf-tools-next.

I see that Namhyung reviewed the patch series, and that it was tested,
in general I thought we need to go on improving things go avoid having
arch specific code in the non-arch code, I guess right now its important
to land these to try and encourage people to try data-type profiling
more widely, hopefully finding and improving data structures in the
kernel.

We need to write about such cases so as to provide success stories on
using perf for data-type profiling.

Thanks for working on this!

- Arnaldo
 
> https://lore.kernel.org/all/6e09dc28-4a2e-49d8-a2b5-ffb3396a9952@csgroup.eu/T/
> 
> Main change were:
> 1. powerpc instruction nmemonic table to associate load/store
> instructions with move_ops which is use to identify if instruction
> is a memory access one.
> 2. To get register number and access offset from the given
> instruction, code uses fields from "struct arch" -> objump.
> Added entry for powerpc here.
> 3. A get_arch_regnum to return register number from the
> register name string.
> 
> But the apporach used in the initial patchset used parsing of
> disassembled code which the current perf tool implementation does.
> 
> Example: lwz     r10,0(r9)
> 
> This line "lwz r10,0(r9)" is parsed to extract instruction name,
> registers names and offset. Also to find whether there is a memory
> reference in the operands, "memory_ref_char" field of objdump is used.
> For x86, "(" is used as memory_ref_char to tackle instructions of the
> form "mov  (%rax), %rcx".
> 
> In case of powerpc, not all instructions using "(" are the only memory
> instructions. Example, above instruction can also be of extended form (X
> form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
> and extract the source/target registers, second patchset added support to use
> raw instruction. With raw instruction, macros are added to extract opcode
> and register fields.
> Link to second patchset:
> https://lore.kernel.org/all/20240506121906.76639-1-atrajeev@linux.vnet.ibm.com/
> 
> Example representation using --show-raw-insn in objdump gives result:
> 
> 38 01 81 e8     ld      r4,312(r1)
> 
> Here "38 01 81 e8" is the raw instruction representation. In powerpc,
> this translates to instruction form: "ld RT,DS(RA)" and binary code
> as:
>   _____________________________________
>   | 58 |  RT  |  RA |      DS       | |
>   -------------------------------------
> 0    6     11    16              30 31
> 
> Second patchset used "objdump" again to read the raw instruction.
> But since there is no need to disassemble and binary code can be read
> directly from the DSO, third patchset (ie this patchset) uses below
> apporach. The apporach preferred in powerpc to parse sample for data
> type profiling in V3 patchset is:
> - Read directly from DSO using dso__data_read_offset
> - If that fails for any case, fallback to using libcapstone
> - If libcapstone is not supported, approach will use objdump
> 
> Patchset adds support to pick the opcode and reg fields from this
> raw/binary instruction code. This approach came in from review comment
> by Segher Boessenkool and Christophe for the initial patchset.
> 
> Apart from that, instruction tracking is enabled for powerpc.
> 
> These are handled in this patchset.
> 
> - Patch 1 is to rearrange register state type structures to header file
> so that it can referred from other arch specific files
> - Patch 2 is to make instruction tracking as a callback to"struct arch"
> so that it can be implemented by other archs easily and defined in arch
> specific files
> - Patch 3 is to handle state type regs array size for x86 and powerpc
> - Patch 4 adds support to capture and parse raw instruction in powerpc
> using dso__data_read_offset utility
> - Patch 4 also adds logic to support using objdump when doing default "perf
> report" or "perf annotate" since it that needs disassembled instruction.
> - Patch 5 adds disasm_line__parse to parse raw instruction for powerpc
> - Patch 6 update parameters for reg extract functions to use raw
> instruction on powerpc
> - Patch 7 updates ins__find to carry raw_insn and also adds parse
> callback for memory instructions for powerpc
> - Patch 8 add support to identify memory instructions of opcode 31 in
> powerpc
> - Patch 9 adds more instructions to support instruction tracking in powerpc
> - Patch 10 and 11 handles instruction tracking for powerpc.
> - Patch 12, 13 and 14 add support to use libcapstone in powerpc
> - Patch 15 handles insn-stat option for perf annotate
> 
> Note:
> - There are remaining unknowns (25%) as seen in annotate Instruction stats
> below.
> - This patchset is not tested on powerpc32. In next step of enhancements
> along with handling remaining unknowns, plan to cover powerpc32 changes
> based on how testing goes.
> - Patch changes to handle global register support will be posted in
> follow up after addressing review comment from Namhung here:
> https://lore.kernel.org/linux-perf-users/CAM9d7cg5MYvLeOoBuKqp1pw7uvRfqCw1fXpLtgct0npL96JaYg@mail.gmail.com/T/#mb309184ebf5d2c9b539cd86d61ffe2a3696eac71
> 
> With the current patchset:
> 
>  ./perf record -a -e mem-loads sleep 1
>  ./perf report -s type,typeoff --hierarchy --group --stdio
>  ./perf annotate --data-type --insn-stat
> 
> perf annotate logs:
> ==================
> 
> Annotate Instruction stats
> total 609, ok 266 (43.7%), bad 343 (56.3%)
> 
>   Name/opcode         :  Good   Bad
> -----------------------------------------------------------
>   58                  :   172   231
>   32                  :    48    44
>   34                  :    19    25
>   OP_31_XOP_LDX       :     8    20
>   40                  :     9    14
>   OP_31_XOP_LWARX     :     5     1
>   OP_31_XOP_LWZX      :     2     3
>   OP_31_XOP_LDARX     :     3     0
>   33                  :     0     2
>   OP_31_XOP_LBZX      :     0     1
>   OP_31_XOP_LWAX      :     0     1
>   OP_31_XOP_LHZX      :     0     1
>  
> perf report logs:
> =================
> 
>   Total Lost Samples: 0
> 
>   Samples: 1K of event 'mem-loads'
>   Event count (approx.): 937238
> 
>     Overhead  Data Type / Data Type Offset
>     ........  .........  ................
>      1.73%     struct task_struct
>         1.29%     struct task_struct +276 (flags)
>         0.10%     struct task_struct +1872 (mm)
>         0.08%     struct task_struct +3632 (thread.regs)
>         0.08%     struct task_struct +1072 (sched_class)
>         0.05%     struct task_struct +1928 ()
>         0.03%     struct task_struct +1064 (dl_server)
>         0.02%     struct task_struct +0 (thread_info.preempt_count)
>         0.02%     struct task_struct +120 (thread_info)
>         0.02%     struct task_struct +2216 (worker_private)
>         0.02%     struct task_struct +24 (thread_info.slb_preload_nr)
>         0.02%     struct task_struct +8 (thread_info.local_flags)
>         0.01%     struct task_struct +1416 (nr_cpus_allowed)
>      1.68%     struct file
>         0.58%     struct file +176 (f_op)
>         0.53%     struct file +24 (f_count.counter)
>         0.46%     struct file +20 (f_mode)
>         0.11%     struct file +200 (private_data)
>      1.63%     struct rq
>         0.76%     struct rq +2856 (cpu)
>         0.21%     struct rq +2592 (curr)
>         0.11%     struct rq +2632 (clock_update_flags)
>         0.08%     struct rq +2600 (idle)
>         0.06%     struct rq +2688 (clock_task)
>         0.05%     struct rq +16 (numa_migrate_on)
>         0.04%     struct rq +2864 (cfs_tasks.next)
> 
> Thanks
> Athira Rajeev
> 
> Changelog:
> >From v7 -> v8:
>   Split patches 1 to 14 and patch 18 as basic patches.
>   Patch 15, 16 and 17 is related to global register handling
>   and will be handled separately.
>   Added Reviewed-and-tested-by from Kajol Jain
>   Added Reviewed-by from Namhyung for patches 1 to 14, patch 18
> 
> >From v6 -> v7:
> - Addressed review comments from Namhyung
>   Changed format string space to %-20s while printing
>   instruction stats in patch 18.
>   Use cmp_null in patch 17 while comparing var_name to
>   properly sort with correct order.
> 
> >From v5 -> v6:
> - Addressed review comments from Namhyung
>   Conditionally define TYPE_STATE_MAX_REGS based on arch.
>   Added macro for defining width of the raw codes and spaces
>   in disasm_line__parse_powerpc.
>   Call disasm_line__parse from disasm_line__parse_powerpc
>   for generic code.
>   Renamed symbol__disassemble_dso to symbol__disassemble_raw.
>   Fixed find_data_type_global_reg to correclty free var_types
>   and change indent level.
>   Fixed data_type_cmp and sort__typeoff_sort to include var_name
>   in comparing data type entries.
>   
> >From v4 -> v5:
> - Addressed review comments from Namhyung
>   Handle max number of type state regs as 16 for x86 and 32 for
>   powerpc.
>   Added generic support for objdump patch first and DSO read
>   optimisation next
>   combined patch 3 and patch 4 in patchseries V4 to one patch
>   Changed reference for "raw_insn" to use "u32"
>   Splitted "parse" callback patch changes and "ins__find" patch
>   changes into two
>   Instead of making weak function, added get_powerpc_regs to
>   extract register and offset fields for powerpc
> - Addressed complation fail when "dwarf.h" is not present ie
>   elfutils devel is not present. Used includes for #ifdef HAVE_DWARF_SUPPORT
>   when including functions that use Dwarf references. Also
>   conditionally include some of the header files.
> 
> >From v3->v4:
> - Addressed review comments from Ian by using capston_init from
>   "util/print_insn.c" instead of "open_capston_handle".
> - Addressed review comment from Namhyung by moving "opcode"
>   field from "struct ins" to "struct disasm_line"
> 
> >From v2->v3:
> - Addressed review comments from Christophe and Namhyung for V2
> - Changed the apporach in powerpc to parse sample for data
>   type profiling as:
>   Read directly from DSO using dso__data_read_offset
>   If that fails for any case, fallback to using libcapstone
>   If libcapstone is not supported, approach will use objdump
> - Include instructions with opcode as 31 and correctly categorize
>   them as memory or arithmetic instructions.
> - Include more instructions for instruction tracking in powerpc
> 
> >From v1->v2:
> - Addressed suggestion from Christophe Leroy and Segher Boessenkool
>   to use the binary code (raw insn) to fetch opcode, register and
>   offset fields.
> - Added support for instruction tracking in powerpc
> - Find the register defined variables (r13 and r1 which points to
>   local_paca and current_stack_pointer in powerpc)
> 
> Athira Rajeev (15):
>   tools/perf: Move the data structures related to register type to
>     header file
>   tools/perf: Add "update_insn_state" callback function to handle arch
>     specific instruction tracking
>   tools/perf: Update TYPE_STATE_MAX_REGS to include max of regs in
>     powerpc
>   tools/perf: Add disasm_line__parse to parse raw instruction for
>     powerpc
>   tools/perf: Add support to capture and parse raw instruction in
>     powerpc using dso__data_read_offset utility
>   tools/perf: Update parameters for reg extract functions to use raw
>     instruction on powerpc
>   tools/perf: Add parse function for memory instructions in powerpc
>   tools/perf: Add support to identify memory instructions of opcode 31
>     in powerpc
>   tools/perf: Add some of the arithmetic instructions to support
>     instruction tracking in powerpc
>   tools/perf: Add more instructions for instruction tracking
>   tools/perf: Update instruction tracking for powerpc
>   tools/perf: Make capstone_init non-static so that it can be used
>     during symbol disassemble
>   tools/perf: Use capstone_init and remove open_capstone_handle from
>     disasm.c
>   tools/perf: Add support to use libcapstone in powerpc
>   tools/perf: Set instruction name to be used with insn-stat when using
>     raw instruction
> 
>  tools/include/linux/string.h                  |   2 +
>  tools/lib/string.c                            |  13 +
>  tools/perf/arch/arm64/annotate/instructions.c |   3 +-
>  .../arch/loongarch/annotate/instructions.c    |   6 +-
>  .../perf/arch/powerpc/annotate/instructions.c | 254 ++++++++++
>  tools/perf/arch/powerpc/util/dwarf-regs.c     |  53 ++
>  tools/perf/arch/s390/annotate/instructions.c  |   5 +-
>  tools/perf/arch/x86/annotate/instructions.c   | 377 ++++++++++++++
>  tools/perf/builtin-annotate.c                 |   4 +-
>  tools/perf/util/annotate-data.c               | 453 +----------------
>  tools/perf/util/annotate-data.h               |  83 ++++
>  tools/perf/util/annotate.c                    |  21 +-
>  tools/perf/util/annotate.h                    |   5 +-
>  tools/perf/util/disasm.c                      | 468 ++++++++++++++++--
>  tools/perf/util/disasm.h                      |  19 +-
>  tools/perf/util/include/dwarf-regs.h          |  11 +
>  tools/perf/util/print_insn.c                  |  15 +-
>  tools/perf/util/print_insn.h                  |   5 +
>  18 files changed, 1301 insertions(+), 496 deletions(-)
> 
> -- 
> 2.43.0
Athira Rajeev July 24, 2024, 7:59 a.m. UTC | #2
> On 24 Jul 2024, at 1:37 AM, Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> 
> On Thu, Jul 18, 2024 at 02:13:43PM +0530, Athira Rajeev wrote:
>> The patchset from Namhyung added support for data type profiling
>> in perf tool. This enabled support to associate PMU samples to data
>> types they refer using DWARF debug information. With the upstream
>> perf, currently it possible to run perf report or perf annotate to
>> view the data type information on x86.
>> 
>> Initial patchset posted here had changes need to enable data type
>> profiling support for powerpc.
> 
> I have this in tmp.perf-tools-next now, will do more tests before moving
> to perf-tools-next.
> 
> I see that Namhyung reviewed the patch series, and that it was tested,
> in general I thought we need to go on improving things go avoid having
> arch specific code in the non-arch code, I guess right now its important
> to land these to try and encourage people to try data-type profiling
> more widely, hopefully finding and improving data structures in the
> kernel.
> 
> We need to write about such cases so as to provide success stories on
> using perf for data-type profiling.
> 
> Thanks for working on this!
Sure, Thanks Arnaldo for taking look at the patchset and pulling to tmp.perf-tools-next

- Athira
> 
> - Arnaldo
> 
>> https://lore.kernel.org/all/6e09dc28-4a2e-49d8-a2b5-ffb3396a9952@csgroup.eu/T/
>> 
>> Main change were:
>> 1. powerpc instruction nmemonic table to associate load/store
>> instructions with move_ops which is use to identify if instruction
>> is a memory access one.
>> 2. To get register number and access offset from the given
>> instruction, code uses fields from "struct arch" -> objump.
>> Added entry for powerpc here.
>> 3. A get_arch_regnum to return register number from the
>> register name string.
>> 
>> But the apporach used in the initial patchset used parsing of
>> disassembled code which the current perf tool implementation does.
>> 
>> Example: lwz     r10,0(r9)
>> 
>> This line "lwz r10,0(r9)" is parsed to extract instruction name,
>> registers names and offset. Also to find whether there is a memory
>> reference in the operands, "memory_ref_char" field of objdump is used.
>> For x86, "(" is used as memory_ref_char to tackle instructions of the
>> form "mov  (%rax), %rcx".
>> 
>> In case of powerpc, not all instructions using "(" are the only memory
>> instructions. Example, above instruction can also be of extended form (X
>> form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
>> and extract the source/target registers, second patchset added support to use
>> raw instruction. With raw instruction, macros are added to extract opcode
>> and register fields.
>> Link to second patchset:
>> https://lore.kernel.org/all/20240506121906.76639-1-atrajeev@linux.vnet.ibm.com/
>> 
>> Example representation using --show-raw-insn in objdump gives result:
>> 
>> 38 01 81 e8     ld      r4,312(r1)
>> 
>> Here "38 01 81 e8" is the raw instruction representation. In powerpc,
>> this translates to instruction form: "ld RT,DS(RA)" and binary code
>> as:
>>  _____________________________________
>>  | 58 |  RT  |  RA |      DS       | |
>>  -------------------------------------
>> 0    6     11    16              30 31
>> 
>> Second patchset used "objdump" again to read the raw instruction.
>> But since there is no need to disassemble and binary code can be read
>> directly from the DSO, third patchset (ie this patchset) uses below
>> apporach. The apporach preferred in powerpc to parse sample for data
>> type profiling in V3 patchset is:
>> - Read directly from DSO using dso__data_read_offset
>> - If that fails for any case, fallback to using libcapstone
>> - If libcapstone is not supported, approach will use objdump
>> 
>> Patchset adds support to pick the opcode and reg fields from this
>> raw/binary instruction code. This approach came in from review comment
>> by Segher Boessenkool and Christophe for the initial patchset.
>> 
>> Apart from that, instruction tracking is enabled for powerpc.
>> 
>> These are handled in this patchset.
>> 
>> - Patch 1 is to rearrange register state type structures to header file
>> so that it can referred from other arch specific files
>> - Patch 2 is to make instruction tracking as a callback to"struct arch"
>> so that it can be implemented by other archs easily and defined in arch
>> specific files
>> - Patch 3 is to handle state type regs array size for x86 and powerpc
>> - Patch 4 adds support to capture and parse raw instruction in powerpc
>> using dso__data_read_offset utility
>> - Patch 4 also adds logic to support using objdump when doing default "perf
>> report" or "perf annotate" since it that needs disassembled instruction.
>> - Patch 5 adds disasm_line__parse to parse raw instruction for powerpc
>> - Patch 6 update parameters for reg extract functions to use raw
>> instruction on powerpc
>> - Patch 7 updates ins__find to carry raw_insn and also adds parse
>> callback for memory instructions for powerpc
>> - Patch 8 add support to identify memory instructions of opcode 31 in
>> powerpc
>> - Patch 9 adds more instructions to support instruction tracking in powerpc
>> - Patch 10 and 11 handles instruction tracking for powerpc.
>> - Patch 12, 13 and 14 add support to use libcapstone in powerpc
>> - Patch 15 handles insn-stat option for perf annotate
>> 
>> Note:
>> - There are remaining unknowns (25%) as seen in annotate Instruction stats
>> below.
>> - This patchset is not tested on powerpc32. In next step of enhancements
>> along with handling remaining unknowns, plan to cover powerpc32 changes
>> based on how testing goes.
>> - Patch changes to handle global register support will be posted in
>> follow up after addressing review comment from Namhung here:
>> https://lore.kernel.org/linux-perf-users/CAM9d7cg5MYvLeOoBuKqp1pw7uvRfqCw1fXpLtgct0npL96JaYg@mail.gmail.com/T/#mb309184ebf5d2c9b539cd86d61ffe2a3696eac71
>> 
>> With the current patchset:
>> 
>> ./perf record -a -e mem-loads sleep 1
>> ./perf report -s type,typeoff --hierarchy --group --stdio
>> ./perf annotate --data-type --insn-stat
>> 
>> perf annotate logs:
>> ==================
>> 
>> Annotate Instruction stats
>> total 609, ok 266 (43.7%), bad 343 (56.3%)
>> 
>>  Name/opcode         :  Good   Bad
>> -----------------------------------------------------------
>>  58                  :   172   231
>>  32                  :    48    44
>>  34                  :    19    25
>>  OP_31_XOP_LDX       :     8    20
>>  40                  :     9    14
>>  OP_31_XOP_LWARX     :     5     1
>>  OP_31_XOP_LWZX      :     2     3
>>  OP_31_XOP_LDARX     :     3     0
>>  33                  :     0     2
>>  OP_31_XOP_LBZX      :     0     1
>>  OP_31_XOP_LWAX      :     0     1
>>  OP_31_XOP_LHZX      :     0     1
>> 
>> perf report logs:
>> =================
>> 
>>  Total Lost Samples: 0
>> 
>>  Samples: 1K of event 'mem-loads'
>>  Event count (approx.): 937238
>> 
>>    Overhead  Data Type / Data Type Offset
>>    ........  .........  ................
>>     1.73%     struct task_struct
>>        1.29%     struct task_struct +276 (flags)
>>        0.10%     struct task_struct +1872 (mm)
>>        0.08%     struct task_struct +3632 (thread.regs)
>>        0.08%     struct task_struct +1072 (sched_class)
>>        0.05%     struct task_struct +1928 ()
>>        0.03%     struct task_struct +1064 (dl_server)
>>        0.02%     struct task_struct +0 (thread_info.preempt_count)
>>        0.02%     struct task_struct +120 (thread_info)
>>        0.02%     struct task_struct +2216 (worker_private)
>>        0.02%     struct task_struct +24 (thread_info.slb_preload_nr)
>>        0.02%     struct task_struct +8 (thread_info.local_flags)
>>        0.01%     struct task_struct +1416 (nr_cpus_allowed)
>>     1.68%     struct file
>>        0.58%     struct file +176 (f_op)
>>        0.53%     struct file +24 (f_count.counter)
>>        0.46%     struct file +20 (f_mode)
>>        0.11%     struct file +200 (private_data)
>>     1.63%     struct rq
>>        0.76%     struct rq +2856 (cpu)
>>        0.21%     struct rq +2592 (curr)
>>        0.11%     struct rq +2632 (clock_update_flags)
>>        0.08%     struct rq +2600 (idle)
>>        0.06%     struct rq +2688 (clock_task)
>>        0.05%     struct rq +16 (numa_migrate_on)
>>        0.04%     struct rq +2864 (cfs_tasks.next)
>> 
>> Thanks
>> Athira Rajeev
>> 
>> Changelog:
>>> From v7 -> v8:
>>  Split patches 1 to 14 and patch 18 as basic patches.
>>  Patch 15, 16 and 17 is related to global register handling
>>  and will be handled separately.
>>  Added Reviewed-and-tested-by from Kajol Jain
>>  Added Reviewed-by from Namhyung for patches 1 to 14, patch 18
>> 
>>> From v6 -> v7:
>> - Addressed review comments from Namhyung
>>  Changed format string space to %-20s while printing
>>  instruction stats in patch 18.
>>  Use cmp_null in patch 17 while comparing var_name to
>>  properly sort with correct order.
>> 
>>> From v5 -> v6:
>> - Addressed review comments from Namhyung
>>  Conditionally define TYPE_STATE_MAX_REGS based on arch.
>>  Added macro for defining width of the raw codes and spaces
>>  in disasm_line__parse_powerpc.
>>  Call disasm_line__parse from disasm_line__parse_powerpc
>>  for generic code.
>>  Renamed symbol__disassemble_dso to symbol__disassemble_raw.
>>  Fixed find_data_type_global_reg to correclty free var_types
>>  and change indent level.
>>  Fixed data_type_cmp and sort__typeoff_sort to include var_name
>>  in comparing data type entries.
>> 
>>> From v4 -> v5:
>> - Addressed review comments from Namhyung
>>  Handle max number of type state regs as 16 for x86 and 32 for
>>  powerpc.
>>  Added generic support for objdump patch first and DSO read
>>  optimisation next
>>  combined patch 3 and patch 4 in patchseries V4 to one patch
>>  Changed reference for "raw_insn" to use "u32"
>>  Splitted "parse" callback patch changes and "ins__find" patch
>>  changes into two
>>  Instead of making weak function, added get_powerpc_regs to
>>  extract register and offset fields for powerpc
>> - Addressed complation fail when "dwarf.h" is not present ie
>>  elfutils devel is not present. Used includes for #ifdef HAVE_DWARF_SUPPORT
>>  when including functions that use Dwarf references. Also
>>  conditionally include some of the header files.
>> 
>>> From v3->v4:
>> - Addressed review comments from Ian by using capston_init from
>>  "util/print_insn.c" instead of "open_capston_handle".
>> - Addressed review comment from Namhyung by moving "opcode"
>>  field from "struct ins" to "struct disasm_line"
>> 
>>> From v2->v3:
>> - Addressed review comments from Christophe and Namhyung for V2
>> - Changed the apporach in powerpc to parse sample for data
>>  type profiling as:
>>  Read directly from DSO using dso__data_read_offset
>>  If that fails for any case, fallback to using libcapstone
>>  If libcapstone is not supported, approach will use objdump
>> - Include instructions with opcode as 31 and correctly categorize
>>  them as memory or arithmetic instructions.
>> - Include more instructions for instruction tracking in powerpc
>> 
>>> From v1->v2:
>> - Addressed suggestion from Christophe Leroy and Segher Boessenkool
>>  to use the binary code (raw insn) to fetch opcode, register and
>>  offset fields.
>> - Added support for instruction tracking in powerpc
>> - Find the register defined variables (r13 and r1 which points to
>>  local_paca and current_stack_pointer in powerpc)
>> 
>> Athira Rajeev (15):
>>  tools/perf: Move the data structures related to register type to
>>    header file
>>  tools/perf: Add "update_insn_state" callback function to handle arch
>>    specific instruction tracking
>>  tools/perf: Update TYPE_STATE_MAX_REGS to include max of regs in
>>    powerpc
>>  tools/perf: Add disasm_line__parse to parse raw instruction for
>>    powerpc
>>  tools/perf: Add support to capture and parse raw instruction in
>>    powerpc using dso__data_read_offset utility
>>  tools/perf: Update parameters for reg extract functions to use raw
>>    instruction on powerpc
>>  tools/perf: Add parse function for memory instructions in powerpc
>>  tools/perf: Add support to identify memory instructions of opcode 31
>>    in powerpc
>>  tools/perf: Add some of the arithmetic instructions to support
>>    instruction tracking in powerpc
>>  tools/perf: Add more instructions for instruction tracking
>>  tools/perf: Update instruction tracking for powerpc
>>  tools/perf: Make capstone_init non-static so that it can be used
>>    during symbol disassemble
>>  tools/perf: Use capstone_init and remove open_capstone_handle from
>>    disasm.c
>>  tools/perf: Add support to use libcapstone in powerpc
>>  tools/perf: Set instruction name to be used with insn-stat when using
>>    raw instruction
>> 
>> tools/include/linux/string.h                  |   2 +
>> tools/lib/string.c                            |  13 +
>> tools/perf/arch/arm64/annotate/instructions.c |   3 +-
>> .../arch/loongarch/annotate/instructions.c    |   6 +-
>> .../perf/arch/powerpc/annotate/instructions.c | 254 ++++++++++
>> tools/perf/arch/powerpc/util/dwarf-regs.c     |  53 ++
>> tools/perf/arch/s390/annotate/instructions.c  |   5 +-
>> tools/perf/arch/x86/annotate/instructions.c   | 377 ++++++++++++++
>> tools/perf/builtin-annotate.c                 |   4 +-
>> tools/perf/util/annotate-data.c               | 453 +----------------
>> tools/perf/util/annotate-data.h               |  83 ++++
>> tools/perf/util/annotate.c                    |  21 +-
>> tools/perf/util/annotate.h                    |   5 +-
>> tools/perf/util/disasm.c                      | 468 ++++++++++++++++--
>> tools/perf/util/disasm.h                      |  19 +-
>> tools/perf/util/include/dwarf-regs.h          |  11 +
>> tools/perf/util/print_insn.c                  |  15 +-
>> tools/perf/util/print_insn.h                  |   5 +
>> 18 files changed, 1301 insertions(+), 496 deletions(-)
>> 
>> -- 
>> 2.43.0