From patchwork Sat Jul 2 11:33:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alberto Faria X-Patchwork-Id: 1651461 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=CwsZW5qp; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4LZqjf72GTz9s07 for ; Sat, 2 Jul 2022 21:34:46 +1000 (AEST) Received: from localhost ([::1]:59184 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1o7bOe-0003Dn-OC for incoming@patchwork.ozlabs.org; Sat, 02 Jul 2022 07:34:44 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:51790) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7bNk-0002uc-N4 for qemu-devel@nongnu.org; Sat, 02 Jul 2022 07:33:48 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:49709) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7bNh-0007V6-2v for qemu-devel@nongnu.org; Sat, 02 Jul 2022 07:33:47 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656761624; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AVsjRvUFO7IRtcut4g0ir534oBEQ2f8ft4+r+hQg0BA=; b=CwsZW5qpPUVMQl5+E+BCovhIdgU5B/iFRhwpZR+qBePSCisB2LG/XbjnuZujvsqXIcHhlJ KbEPtfU+l/Z0FpMrSbm4uuqmQyXaEX94+BriyNdS2j3JHrfRR0RdCP7pY2M/tFmC6M1JoS KICNBNryLIR2VaTd1Cv/UPO6tiMFRy4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-172-iacSVY3DPBqkML1rw_IxMA-1; Sat, 02 Jul 2022 07:33:41 -0400 X-MC-Unique: iacSVY3DPBqkML1rw_IxMA-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A36F82999B2C; Sat, 2 Jul 2022 11:33:40 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.39.194.114]) by smtp.corp.redhat.com (Postfix) with ESMTP id 606C5492C3B; Sat, 2 Jul 2022 11:33:37 +0000 (UTC) From: Alberto Faria To: qemu-devel@nongnu.org Cc: Paolo Bonzini , qemu-block@nongnu.org, "Denis V. Lunev" , Emanuele Giuseppe Esposito , Stefan Hajnoczi , Ronnie Sahlberg , Hanna Reitz , Stefano Garzarella , Kevin Wolf , Peter Xu , Alberto Garcia , John Snow , Eric Blake , Fam Zheng , Markus Armbruster , Vladimir Sementsov-Ogievskiy , Peter Lieven , Alberto Faria Subject: [RFC 1/8] Add an extensible static analyzer Date: Sat, 2 Jul 2022 12:33:24 +0100 Message-Id: <20220702113331.2003820-2-afaria@redhat.com> In-Reply-To: <20220702113331.2003820-1-afaria@redhat.com> References: <20220702113331.2003820-1-afaria@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9 Received-SPF: pass client-ip=170.10.133.124; envelope-from=afaria@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -21 X-Spam_score: -2.2 X-Spam_bar: -- X-Spam_report: (-2.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" Add a static-analyzer.py script that uses libclang's Python bindings to provide a common framework on which arbitrary static analysis checks can be developed and run against QEMU's code base. As an example, a simple check is included that verifies that the return value of static, non-void functions is used by at least one caller. Signed-off-by: Alberto Faria --- static-analyzer.py | 509 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 509 insertions(+) create mode 100755 static-analyzer.py diff --git a/static-analyzer.py b/static-analyzer.py new file mode 100755 index 0000000000..010cc92212 --- /dev/null +++ b/static-analyzer.py @@ -0,0 +1,509 @@ +#!/usr/bin/env python3 +# ---------------------------------------------------------------------------- # + +from __future__ import annotations + +from dataclasses import dataclass +import json +import os +import os.path +import subprocess +import sys +import re +from argparse import ArgumentParser, Namespace, RawDescriptionHelpFormatter +from multiprocessing import Pool +from pathlib import Path +from typing import ( + Any, + Callable, + Dict, + Iterable, + List, + NoReturn, + Optional, + Mapping, + Sequence, + Tuple, +) + +# ---------------------------------------------------------------------------- # + +from clang.cindex import ( # type: ignore + Cursor, + CursorKind, + Diagnostic, + StorageClass, + TranslationUnit, + TranslationUnitLoadError, + TypeKind, +) + +Cursor.__hash__ = lambda self: self.hash # so `Cursor`s can be dict keys + +# ---------------------------------------------------------------------------- # +# Usage + + +def parse_args() -> Namespace: + + available_checks = "\n".join(f" {name}" for (name, _) in CHECKS) + + parser = ArgumentParser( + formatter_class=RawDescriptionHelpFormatter, + epilog=f""" +available checks: +{available_checks} + +exit codes: + 0 No problems found. + 1 Analyzer failure. + 2 Bad usage. + 3 Problems found in a translation unit. +""", + ) + + parser.add_argument("build_dir", type=Path) + + parser.add_argument( + "translation_units", + type=Path, + nargs="*", + help=( + "Analyze only translation units whose root source file matches or" + " is under one of the given paths." + ), + ) + + parser.add_argument( + "-c", + "--check", + metavar="CHECK", + dest="check_names", + choices=[name for (name, _) in CHECKS], + action="append", + help=( + "Enable the given check. Can be given multiple times. If not given," + " all checks are enabled." + ), + ) + + parser.add_argument( + "-j", + "--jobs", + dest="threads", + type=int, + help=( + "Number of threads to employ. Defaults to the number of logical" + " processors." + ), + ) + + parser.add_argument( + "--profile", + action="store_true", + help="Profile execution. Forces single-threaded execution.", + ) + + return parser.parse_args() + + +# ---------------------------------------------------------------------------- # +# Main + + +def main() -> NoReturn: + + args = parse_args() + + compile_commands = load_compilation_database(args) + contexts = get_translation_unit_contexts(args, compile_commands) + + analyze_translation_units(args, contexts) + + +def load_compilation_database(args: Namespace) -> Sequence[Mapping[str, str]]: + + # clang.cindex.CompilationDatabase.getCompileCommands() apparently produces + # entries for files not listed in compile_commands.json in a best-effort + # manner, which we don't want, so we parse the JSON ourselves instead. + + path = args.build_dir / "compile_commands.json" + + try: + with path.open("r") as f: + return json.load(f) + except FileNotFoundError: + fatal(f"{path} does not exist") + + +def get_translation_unit_contexts( + args: Namespace, compile_commands: Iterable[Mapping[str, str]] +) -> Sequence[TranslationUnitContext]: + + system_include_paths = get_clang_system_include_paths() + check_names = args.check_names or [name for (name, _) in CHECKS] + + contexts = ( + TranslationUnitContext( + absolute_path=str(Path(cmd["directory"], cmd["file"]).resolve()), + compilation_working_dir=cmd["directory"], + compilation_command=cmd["command"], + system_include_paths=system_include_paths, + check_names=check_names, + ) + for cmd in compile_commands + ) + + if args.translation_units: + + allowed_prefixes = [ + # ensure path exists and is slash-terminated (even if it is a file) + os.path.join(path.resolve(strict=True), "") + for path in args.translation_units + ] + + contexts = ( + ctx + for ctx in contexts + if any( + (ctx.absolute_path + "/").startswith(prefix) + for prefix in allowed_prefixes + ) + ) + + context_list = list(contexts) + + if not context_list: + fatal("No translation units to analyze") + + return context_list + + +def get_clang_system_include_paths() -> Sequence[str]: + + # libclang does not automatically include clang's standard system include + # paths, so we ask clang what they are and include them ourselves. + + # TODO: Is there a less hacky way to do this? + + result = subprocess.run( + ["clang", "-E", "-", "-v"], + stdin=subprocess.DEVNULL, + stdout=subprocess.DEVNULL, + stderr=subprocess.PIPE, + universal_newlines=True, # decode stdout/stderr using default encoding + check=True, + ) + + # Module `re` does not support repeated group captures. + pattern = ( + r"#include <...> search starts here:\n" + r"((?: \S*\n)+)" + r"End of search list." + ) + + match = re.search(pattern, result.stderr, re.MULTILINE) + assert match is not None + + return [line[1:] for line in match.group(1).splitlines()] + + +def fatal(message: str) -> NoReturn: + print(f"\033[0;31mERROR: {message}\033[0m") + sys.exit(1) + + +# ---------------------------------------------------------------------------- # +# Analysis + + +@dataclass +class TranslationUnitContext: + absolute_path: str + compilation_working_dir: str + compilation_command: str + system_include_paths: Sequence[str] + check_names: Sequence[str] + + +def analyze_translation_units( + args: Namespace, contexts: Sequence[TranslationUnitContext] +) -> NoReturn: + + results: List[bool] + + if not args.profile: + + with Pool(processes=args.threads) as pool: + results = pool.map(analyze_translation_unit, contexts) + + else: + + import cProfile + import pstats + + profile = cProfile.Profile() + + try: + results = profile.runcall( + lambda: list(map(analyze_translation_unit, contexts)) + ) + finally: + stats = pstats.Stats(profile, stream=sys.stderr) + stats.strip_dirs() + stats.sort_stats("tottime") + stats.print_stats() + + print( + f"\033[0;34mAnalyzed {len(contexts)}" + f" translation unit{'' if len(contexts) == 1 else 's'}.\033[0m" + ) + + sys.exit(0 if all(results) else 3) + + +def analyze_translation_unit(context: TranslationUnitContext) -> bool: + + # relative to script's original working directory + relative_path = os.path.relpath(context.absolute_path) + + # load translation unit + + command = context.compilation_command.split() + + adjusted_command = [ + # keep the original compilation command name + command[0], + # ignore unknown GCC warning options + "-Wno-unknown-warning-option", + # add clang system include paths + *( + arg + for path in context.system_include_paths + for arg in ("-isystem", path) + ), + # keep all other arguments but the last, which is the file name + *command[1:-1], + # replace relative path to get absolute location information + context.absolute_path, + ] + + original_cwd = os.getcwd() + os.chdir(context.compilation_working_dir) # for options like -I to work + + try: + tu = TranslationUnit.from_source(filename=None, args=adjusted_command) + except TranslationUnitLoadError as e: + raise RuntimeError(f"Failed to load {relative_path}") from e + + os.chdir(original_cwd) # to have proper relative paths in messages + + # check for fatal diagnostics + + found_problems = False + + for diag in tu.diagnostics: + # consider only Fatal diagnostics, like missing includes + if diag.severity >= Diagnostic.Fatal: + found_problems = True + location = format_location(diag, default=relative_path) + print( + f"\033[0;33m{location}: {diag.spelling}; this may lead to false" + f" positives and negatives\033[0m" + ) + + # analyze translation unit + + def log(node: Cursor, message: str) -> None: + nonlocal found_problems + found_problems = True + print(f"{format_location(node)}: {message}") + + try: + for (name, checker) in CHECKS: + if name in context.check_names: + checker(tu, context.absolute_path, log) + except Exception as e: + raise RuntimeError(f"Error analyzing {relative_path}") from e + + return not found_problems + + +# obj must have a location field/property that is a `SourceLocation`. +def format_location(obj: Any, *, default: str = "(none)") -> str: + + location = obj.location + + if location.file is None: + return default + else: + abs_path = Path(location.file.name).resolve() + rel_path = os.path.relpath(abs_path) + return f"{rel_path}:{location.line}:{location.column}" + + +# ---------------------------------------------------------------------------- # +# Checks + +Checker = Callable[[TranslationUnit, str, Callable[[Cursor, str], None]], None] + +CHECKS: List[Tuple[str, Checker]] = [] + + +def check(name: str) -> Callable[[Checker], Checker]: + def decorator(checker: Checker) -> Checker: + CHECKS.append((name, checker)) + return checker + + return decorator + + +@check("return-value-never-used") +def check_return_value_never_used( + translation_unit: TranslationUnit, + translation_unit_path: str, + log: Callable[[Cursor, str], None], +) -> None: + """ + Report static functions with a non-void return value that no caller ever + uses. + + This check is best effort, but should never report false positives (positive + being error). + """ + + def function_occurrence_might_use_return_value( + ancestors: Sequence[Cursor], node: Cursor + ) -> bool: + + if ancestors[-1].kind.is_statement(): + + return False + + elif ( + ancestors[-1].kind == CursorKind.CALL_EXPR + and ancestors[-1].referenced == node.referenced + ): + + if not ancestors[-2].kind.is_statement(): + return True + + if ancestors[-2].kind in [ + CursorKind.IF_STMT, + CursorKind.SWITCH_STMT, + CursorKind.WHILE_STMT, + CursorKind.DO_STMT, + CursorKind.RETURN_STMT, + ]: + return True + + if ancestors[-2].kind == CursorKind.FOR_STMT: + [_init, cond, _adv] = ancestors[-2].get_children() + if ancestors[-1] == cond: + return True + + return False + + else: + + # might be doing something with a pointer to the function + return True + + # Maps canonical function `Cursor`s to whether we found a place that maybe + # uses their return value. Only includes static functions that don't return + # void and belong to the translation unit's root file (i.e, were not brought + # in by an #include). + funcs: Dict[Cursor, bool] = {} + + for [*ancestors, node] in all_paths(translation_unit.cursor): + + if ( + node.kind == CursorKind.FUNCTION_DECL + and node.storage_class == StorageClass.STATIC + and node.location.file.name == translation_unit_path + and node.type.get_result().get_canonical().kind != TypeKind.VOID + ): + funcs.setdefault(node.canonical, False) + + if ( + node.kind == CursorKind.DECL_REF_EXPR + and node.referenced.kind == CursorKind.FUNCTION_DECL + and node.referenced.canonical in funcs + and function_occurrence_might_use_return_value(ancestors, node) + ): + funcs[node.referenced.canonical] = True + + # --- + + for (cursor, return_value_maybe_used) in funcs.items(): + if not return_value_maybe_used: + log(cursor, f"{cursor.spelling}() return value is never used") + + +# ---------------------------------------------------------------------------- # +# Traversal + +# Hides nodes of kind UNEXPOSED_EXPR. +def all_paths(root: Cursor) -> Iterable[Sequence[Cursor]]: + + path = [] + + def aux(node: Cursor) -> Iterable[Sequence[Cursor]]: + nonlocal path + + if node.kind != CursorKind.UNEXPOSED_EXPR: + path.append(node) + yield path + + for child in node.get_children(): + yield from aux(child) + + if node.kind != CursorKind.UNEXPOSED_EXPR: + path.pop() + + return aux(root) + + +# ---------------------------------------------------------------------------- # +# Utilities handy for development + + +def print_node(node: Cursor) -> None: + + print(f"{format_location(node)}: kind = {node.kind.name}", end="") + + if node.spelling: + print(f", name = {node.spelling}", end="") + + if node.type is not None: + print(f", type = {node.type.get_canonical().spelling}", end="") + + if node.referenced is not None: + print(f", referenced = {node.referenced.spelling}", end="") + + print() + + +def print_tree( + node: Cursor, *, max_depth: Optional[int] = None, indentation_level: int = 0 +) -> None: + + if max_depth is None or max_depth >= 0: + + print(" " * indentation_level, end="") + print_node(node) + + for child in node.get_children(): + print_tree( + child, + max_depth=None if max_depth is None else max_depth - 1, + indentation_level=indentation_level + 1, + ) + + +# ---------------------------------------------------------------------------- # + +if __name__ == "__main__": + main() + +# ---------------------------------------------------------------------------- #