Parse, don't Validate: Aid Static Analysis for C
[essay] 2 min read
Abstract
The “Parse, Don’t Validate” philosophy, typically seen as a way to improve program logic in type-safe languages, can significantly enhance static analysis by providing analyzers with powerful, type-based hints. A compelling application is the systematic classification of function pointers in the C language to refine analysis precision.
본문
The “Parse, Don’t Validate”[^ParseDontValidate] philosophy, a zeitgeist for type-driven design, argues that replacing runtime validation checks with an upfront parsing step at the system’s boundary makes program logic more robust and simple. For example, a validation-based approach for head returns an optional, monadic type, burdening the data pipeline with repetitive, often unnecessary downstream checks. In contrast, the parsing approach restricts the function’s domain to a non-empty list type, thereby handling potential fallacies before any downstream computation.
This philosophy can alleviate the complexity of static analyzers, since the function signature will help to precisely constrain the abstract domains of function variables. Also, the philosophy itself could not replace the static analysis. While languages like Lean use dependent types to formally prove properties within the type system itself, their practical application for coding is limited as being Turing-incomplete.
This advantage is particularly relevant in less type-safe languages like C, where the “trust the programmer” philosophy provides minimal type hints for analyzers. A primary challenge for static analysis in C is the pervasive use of function pointers, which obscures type information. For example, a project like MP4Box loses crucial domain information by using generic function pointers to pass various distinct read functions with different input and output restriction. However, by classifying them based on their semantic input/output contracts, and using separate parsing functions for each type, static analysis could leverage this classification to infer more precise abstract domains.