Introduction

Clearblue: Software X-ray For a New Era

What is and Why Do We Need Clearblue?

If X-ray is vital for human health, it is regrettable that we do not have an equivalent technology for the health of software, leading to high-impact quality and security incidents. To address this problem, we have developed the tool Clearblue, to automatically reason what ultra large-scale software is really doing.

Targets of Clearblue

Clearblue, often likened to an “X-ray” for software, is engineered with the following design objectives:

  • Scalable: Capture an “image” of the software’s behavior precisely and efficiently, ensuring that the analysis can scale to meet the demands of ultra large-scale software systems.
  • Incremental: Support both spatial and temporal asychronous analysis, allowing for comprehensive and dynamic insights into software behavior over time and across different system components.
  • Direct: Take the final artifact, binary, as the input, allowing for a direct and unobstructed view of the software’s operational state without the need for source code.
  • Customizable: Provides users with the flexibility to define specific analysis queries using Clearblue’s APIs and Domain-Specific Languages (DSLs), enabling tailored insights that meet unique investigative needs.

Clearblue’s Advanced Capabilities

Clearblue has been engineered to consolidate our pioneering research breakthroughs, setting a new standard in software analysis:

  • Full Path Sensitivity: Clearblue boasts unprecedented precision in its analysis, achieving full path sensitivity with a depth of six call layers, which is the most advanced to our knowledge, as recognized in prominent research [PLDI ‘18, PLDI ‘21].

  • Asynchronous Analysis: Clearblue innovates by separating search operations from the program imaging process:

    • The target program is efficiently distributed, converted into a digital format, and preserved on disk (currently under development).
    • The program’s behavior is meticulously indexed and fetched selectively as needed [OOPSLA ‘21].
  • Non-intrusive Lifting: Clearblue utilizes binary input to generate Intermediate Representation (IR), all while leaving the build process untouched:

    • It extends support across various architectures including X86-64, ARM (PE, Mach-O, and ELF), as well as JVM bytecode.
  • Clearblue API: Clearblue introduces an array of SDKs and APIs to facilitate diverse queries:

    • Tailoring security scans (e.g., for NULL Pointer Dereference) and inquiries about program specifics (e.g., identifying callees of a particular function) is streamlined and intuitive.
    • It equips users with querying capabilities analogous to SQL for databases, enhancing accessibility and usability.

These features position Clearblue at the forefront of the “Code as Database” paradigm, shaping the framework for a revolutionary approach to code analysis.

Clearblue Architecture

Clearblue Architecture

Clearblue Components

  • Plankton: Binary lifter - A tool that lifts binary code to a higher-level representation.
  • Jellyfish: A tool that translates Java Virtual Machine (JVM) programs to the Low-Level Virtual Machine (LLVM) representation.
  • Swordfish: Parallel pointer analysis - A module for performing pointer analysis in parallel, which is essential for building Clearblue IR.
  • Cod: Numerical analysis - A module for tracking value ranges statically.
  • Coral: Call graph analysis - A module for analyzing the function call graph of programs.
  • Manta: Type inference for stripped binary, making analysis more accurate.
  • Sponge: Software behavior database - A module for loading and storing software behavior and manages data storage efficiently on demand.
  • Sailfish: Path-sensitive vulnerability scanner - A parallel vulnerability scanner analyzes vulnerabilty on Clearblue IR (SEG).

Clearblue’s Imaging Process: An Overview

Clearblue’s sophisticated imaging process encompasses several key steps, meticulously designed to transform binary data into actionable insights:

  • LLVM IR Lifting: The journey begins with the binary being elevated to LLVM Intermediate Representation (IR):

    • Plankton: Utilizes available debugging information to reclaim source code details.
    • Manta: Deduces type information from stripped binaries, enhancing the reconstruction.
  • IR Generation and Persistence: A nuanced bottom-up data flow analysis culminates in the creation of the proprietary Clearblue IR, as presented at [PLDI ‘18], which is then systematically stored within a behavior database [ICSE ‘24]:

    • Scheduling Optimization: Highly parallelized data-flow analysis [ICSE ‘20].
    • Path Condition Embedding: Innovatively condenses the space expression of path conditions [PLDI ‘21].
    • Data Flow Indexing: Makes data flow shortcuts, facilitating rapid retrieval and analysis [OOPSLA ‘22].

Versus Other Tools

Here we list the main differences to main-stream analysis tools on the market.

FeatureSVFInferCSAClearblueImpact
PrecisionFlow- and Context-sensitivePartially path-sensitive (up to 20 states)Path-sensitive (File-level)All-sensitive (6 layers of calls)Clearblue produces fewer false positives
IR GenerationIntrusiveIntrusiveIntrusiveNon-intrusiveClearblue can analyze all software artifacts
Scalability1 MLoC1 MLoC10 MLoC (File-level)10 MLoCClearblue can analyze system and cloud software
Asynchronous AnalysisNoNoNoYesClearblue can analyze software with complex supply-chain libraries.
CustomizationSVF APIs on LLVM IRNoClang ASTClearblue APIs on SEGClearblue can easily support a wide range of requirements.

Next Step

Feedback

Was this page helpful?


Last modified February 8, 2024: intro revise (d11d7d1)