Generate Program Query

This guide provides detailed instructions on how to use SEG for generate program query.

Users can leverage the previously introduced SEG API and Call Graph API to query customized program features. Given that these APIs are built on graph structures, data and relations can be extracted from these graphs. The following are some examples demonstrating the capabilities of these APIs.

Running Example

#include <stdlib.h>
#include <string.h>

void init(char* token, char* password){
  password = token;
}

void f(char *password) {
  char localToken[256];
  init(localToken, password);
  memset(password, ' ', strlen(password));
  memset(localToken, ' ', strlen(localToken));
  free(password);
}

int main(){
  f("password");
}

We are going to use the running example above to demonstrate several queries.

List callee/caller for each function.

Basic idea:

  1. As the callee/caller relationships captured in CallGraphNode, we start by retrieving the corresponding CallGraphNode for a function using the iterator begin() : end() and operator[] APIs which are provided by the CallGraph.

  2. Once we have the CallGraphNode, we list the called functions or caller functions via two iterators: begin() : end() and caller_begin() : caller_end().

void list_callee_and_cs_foreach_function() {
    // Callee
    for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {
        auto func = cg_b->first;
        if (func && !func->isDeclaration()) {
            outs() << "Function: " << cg_b->first->getName() << "\n";
            for (auto cg_node_b = cg_b->second->begin(); cg_node_b != cg_b->second->end(); cg_node_b++) {
            outs() << "CS in Line " << DIA->getSrcLine(cg_node_b->first) << " calls ";
            if (Function *callee = cg_node_b->second->getFunction())
                outs() << "function '" << callee->getName() << "'\n";
            else
                outs() << "external node\n";
            }
            outs() << '\n';
        }
    }
    // Caller
    for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {
        auto func = cg_b->first;
        if (func && !func->isDeclaration()) {
            outs() << "Function: " << cg_b->first->getName() << "\n";
            for (auto cg_node_b = cg_b->second->caller_begin(); cg_node_b != cg_b->second->caller_end(); cg_node_b++) {
            if (cg_node_b->first) {
                outs() << "is called in Line " << DIA->getSrcLine(cg_node_b->first) << " by";
                Function *caller = cg_node_b->second->getFunction();
                if (caller) {
                outs() << "function '" << caller->getName() << "'\n";
                }
            }
            }
            outs() << '\n';
        }
    }
}
  1. The outer loop (for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++)) use the call graph iterator to iterate through all the functions in the call graph.

  2. The if condition (if (func && !func->isDeclaration())) checks whether the function is defined in the analyzed code (i.e., it is not a declaration or an external function).

  3. For each defined function, it prints the function name.

  4. The inner loop (for (auto cg_node_b = cg_b->second->begin(); cg_node_b != cg_b->second->end(); cg_node_b++)) use the call graph node callee iterator to iterate through all calls made by the function.

  5. For each call, it prints the line where the call is made and the function that is being called. It prints “external node” if an external function is called.

Listing callers (calling functions) for each function

  1. The outer loop, as before, iterates through all functions in the call graph.

  2. Similarly, the if condition checks whether the function is defined in the analyzed code.

  3. For each defined function, it prints the function name.

  4. The inner loop (for (auto cg_node_b = cg_b->second->caller_begin(); cg_node_b != cg_b->second->caller_end(); cg_node_b++)) use the call graph node caller iterator to iterate through all functions that call the function.

  5. For each calling function, it prints the line where the call is made and the function that made the call.

Hence, this piece of code provides a comprehensive list of all functions called by and calling each function in the analyzed program.

Outputs

Function: init
Function: f                                                                                                                                                                                             
CS in Line 10 calls function 'init'                                                                                                                                                                     
CS in Line 11 calls function 'strlen'                                                                                                                                                                   
CS in Line 12 calls function 'strlen'                                                                                                                                                                   
CS in Line 13 calls function 'free'                                                                                                                                                                     
                                                                                                                                                                                                        
Function: main                                                                                                                                                                                          
CS in Line 17 calls function 'f'                                                                                                                                                                        
                                                                                                                                                                                                        
Function: init                                                                                                                                                                                          
is called in Line 10 function 'f'                                                                                                                                                                       
                                                                                                                                                                                                        
Function: f                                                                                                                                                                                             
is called in Line 17 function 'main'                                                                                                                                                                    
                                                                                                                                                                                                        
Function: main

Interpretation:

  • In the function f:

    • init is called at line 10
    • strlen is called at lines 11 and 12
    • free is called at line 13
  • In the main function, f is called at line 17.

  • The function init is called by function f at line 10.

  • The function f is called by the main function at line 17.

From the analysis, it’s clear that the code successfully lists the calling and called functions for each function in the target program.

List transitive callee/caller for each function.

Basic idea:

  1. The CallGraphNode presents directly called functions and caller functions through two iterators: begin() : end() and caller_begin() : caller_end(). To fetch the transitive callee/caller, we have to recursively compute the CallGraphNode for the functions called or caller functions listed from the prior CallGraphNode, until no more CallGraphNode can be collected.

  2. We apply this recursive process to every function in the call graph, enabling the collection of transitive callees/callers.

The following code snippet showcases how we transitively collect callees utilizing a helper function forward_traverse_call_graph_from(). This recursive function uses the callee begin() : end() iterator of each reached CallGraphNode cg_node to gather the callee and store them in a set called visited.


void forward_traverse_call_graph_from(Function* A, std::set<Function*> &visited) {
  CBCallGraphNode *cg_node = cg->operator[](A);
  if (!cg_node) {
    return;
  }
  if (visited.count(A)){
    return;
  }
  visited.insert(A);

  for (auto cg_node_b = cg_node->begin(); cg_node_b != cg_node->end(); cg_node_b++) {
    if (Function *callee = cg_node_b->second->getFunction()) {

      forward_traverse_call_graph_from(callee, visited);
    }
  }
}
void list_all_callee_foreach_function() {
    for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {

        auto func = cg_b->first;
        if (func && !func->isDeclaration()) {
            std::set<Function*> *visited = new std::set<Function*>();
            forward_traverse_call_graph_from(const_cast<Function *>(func), *visited);
            outs() << "For function: " << func->getName() << " its all callees are: \n";
            for (auto callee : *visited) {
            outs() << "\t" << callee->getName() << "\n";
            }
        }
    }
}

The C++ code is designed to list all transitive callees (functions being called directly or indirectly) for each function in the program being analyzed. Here’s a breakdown of the process:

  1. forward_traverse_call_graph_from(Function* A, std::set<Function*> &visited): This function performs a forward traversal of the call graph starting from function A. It uses recursion to visit each callee (called function) and its subsequent callees. The visited callee functions are stored in the visited set to avoid revisiting and infinite recursion.

  2. list_all_callee_foreach_function(): This function iterates over all functions present in the call graph.

  3. For each function, it creates a new set visited to keep track of the visited callee functions.

  4. It then initiates a forward traversal of the call graph starting from the current function using the forward_traverse_call_graph_from() function.

  5. Finally, it prints the name of the current function and the names of all its transitive callees (both directly and indirectly called functions).

With the help of these functions, you can efficiently analyze a given codebase, understanding the relationship and dependencies between different functions.

Outputs:

For function: init its all callees are:                                                                                                                                                                 
        init                                                                                                                                                                                            
For function: f its all callees are:                                                                                                                                                                    
        init                                                                                                                                                                                            
        f                                                                                                                                                                                               
        strlen                                                                                                                                                                                          
        free                                                                                                                                                                                            
For function: main its all callees are:                                                                                                                                                                 
        init                                                                                                                                                                                            
        f                                                                                                                                                                                               
        strlen                                                                                                                                                                                          
        free                                                                                                                                                                                            
        main

The output presents the set of all transitive callees (directly or indirectly called functions) for each function in the running example.

Here’s what the output means for each function:

  • Function init: Its only callee is itself (init), which means it does not call any other function.

  • Function f: Its callees are init, strlen, free, and itself (f). So, function f calls init, strlen, and free directly.

  • Function main: Its callees are init, f, strlen, free, and itself (main). This means that function main calls f directly, and f then calls init, strlen, and free. So main calls init, strlen, and free indirectly, via f.

Practice

Users could have a try to modifiy the function to collect the transitive callers.

Determine whether function A is called by function B

Basic idea:

Function A might be called by Function B through several intermediate functions. Therefore, to ascertain this relationship, we need to examine the transitive relations. By applying the forward_traverse_call_graph_from() helper function, we can collect all callees of Function B and subsequently check if Function A is included.


void is_function_A_called_by_function_B(Function* A, Function* B) {
  std::set<Function*> *visited = new std::set<Function*>();
  forward_traverse_call_graph_from(const_cast<Function *>(B), *visited);
  if (visited->count(A)) {
    outs() << "Yes\n";
  } else {
    outs() << "No\n";
  }
}

The C++ code determines whether function A is called by function B, either directly or indirectly, using the prior forward_traverse_call_graph_from() function. Here’s how it works:

  1. It first creates a new set, visited, to store the functions visited during the traversal of the call graph.

  2. It then initiates a forward traversal of the call graph starting from function B using the forward_traverse_call_graph_from() function. This function populates the visited set with all of the callees of function B (including both direct and indirect callees).

  3. After the traversal, it checks whether function A is in the visited set using the count() function.

    • If function A is in the visited set, it means function A is called by function B, either directly or indirectly. The program outputs a “Yes” as a result.

    • If function A is not in the visited set, it indicates function A is not called by function B. The program outputs a “No” in this case.

List all pointer type variable

Basic idea:

  1. Since variable information is contained within the SEG, we first need to acquire the SEG for a function through the getSymbolicExprGraph(&f) API offered by the SymbolicExprGraphBuilder class.

  2. To enumerate pointer type variables, we must check the type of all variables in the function. We begin by iterating over the operand node via the operand iterator value_node_begin() : value_node_end(), and then determine the type information for each operand through access to the SEGValue.


void list_all_pointer_type_variable(Module &M) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);
      for (auto item = seg->value_node_begin(); item != seg->value_node_end(); item++) {
        if (item->second->getSEGValue()->getValue()->getType()->isPointerTy()) {
          outs() << item->second->getSEGValue()->getValueName() << "\n";
        }
      }
      outs() << "\n";
    }
  }
}

Outputs:


Function: init                                                                                                                                                                                          
token                                                                                                                                                                                                   
password                                                                                                                                                                                                
                                                                                                                                                                                                        
Function: f                                                                                                                                                                                             
password                                                                                                                                                                                                
localToken                                                                                                                                                                                                                                                                                                                                                                                                          

Function: main                                                                                                                                                                                          
.str

This C++ code list all the pointer type variables in each function in a codebase using Symbolic Expression Graph (SEG) APIs. Here’s a breakdown of the process:

  1. It iterates over each function f in module M. It skips the functions that are just declarations and not defined in the code.

  2. For each function defined in the code, it fetches the Symbolic Expression Graph (SEG) associated with the function using segBuilder->getSymbolicExprGraph(&f). The SEG represents all expressions in the function wherein the nodes are the values and the edges are the data dependencies.

  3. It then loops through all nodes (item->second) in the SEG representative of the computed values within the function (symbol or expression).

  4. For each computed value (node) in the SEG, it checks if the corresponding LLVM Value is of a pointer type through getType()->isPointerTy().

  5. If the value is a pointer type, it prints the corresponding name of the variable.

In the given example, the code outputs all pointer variables in each function of the module:

  • In function init, it identified two pointer variables: token and password.
  • In function f, it caught two pointer variables: password and localToken.
  • In function main, it found the pointer string literal .str.

Practice

We can even fulfill more complex requirements, such as determining if a function, which resides in a pointer-type variable, can be called by another function present in a different pointer-type variable.

Users can achieve this by combining the list_all_pointer_type_variable() function from this example with the is_function_A_called_by_function_B() function from the previous example. Firstly, adjust the list_all_pointer_type_variable() function to select two pointer-type variables. Next, use the getParentFunction() API, offered by SEG, to determine their corresponding residing functions. Finally, utilize the is_function_A_called_by_function_B() function to check if there is a call relationship between these two functions.

List value flow

Basic idea:

The variable’s data-dependence/value-flow can be accessed using several iterators provided by the SEG API. In the following code snippet, we use the vflow_begin() : vflow_end() iterator to access a variable’s value-flow. Like the call relation, these value-flow relations accessed by iterators are directed. Transitive relations can be collected by recursively accessing the value flow iterator of a variable.

We have encapsulated several high-level APIs that enable users to query the relation between any two vertices of interest. For more information, please refer to the Vulnerability Detection section.


void list_data_dependence(Module &M) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);
      for (auto item = seg->value_node_begin(); item != seg->value_node_end(); item++) {
        if (item->second->getSEGValue()->getValue()->getType()->isPointerTy()) {
          outs() << item->second->getSEGValue()->getValueName() << "\n";
          for (auto parent = item->second->vflow_begin() ; parent != item->second->vflow_end(); parent++) {
            outs() << **parent << "\n";
          }
        }
      }
      outs() << "\n";
    }
  }
}

The C++ code lists all data dependencies related to each pointer type variable within every function in a codebase, leveraging Symbolic Expression Graph (SEG) APIs. Here’s the step-by-step of what this code does:

  1. The code iterates over each function f in the module M, excluding the functions that are just declarations and not defined in the code.

  2. For each defined function, it retrieves the Symbolic Expression Graph (SEG) for the function using segBuilder->getSymbolicExprGraph(&f). The SEG represents all expressions in the function, with nodes representing values and edges symbolizing data dependencies.

  3. It then loops through all nodes (item->second) in SEG, representative of the calculated values within the function (symbol or expression).

  4. For each calculated value (node) in the SEG, it verifies if the corresponding LLVM Value is of a pointer type using getType()->isPointerTy().

  5. If the value is of pointer type, it prints the corresponding name of the variable, and afterward, it iterates over its parents (data dependents) via the vflow iterator (vflow_begin() -> vflow_end()).

  6. For each directly dependent value or parent, it outputs details about this dependency.

Complete Implementation

Provided below is the complete implementation of the above examples.

After setting up the environment, you can execute this on ClearBlue. For more details on the setup, please refer to Preparing Coding Environment.

//
// Created by yongchao on 2/6/24.
//

#include "Analysis/Bitcode/DebugInfoAnalysis.h"
#include "Analysis/CallGraph/CBCallGraph.h"
#include "Language/C-langs/C++/Mangling.h"

#include "IR/SEG/SEGArgumentNode.h"
#include "IR/SEG/SEGCallSite.h"
#include "IR/SEG/SEGReturnNode.h"
#include "IR/SEG/SEGReturnSite.h"
#include "IR/SEG/SymbolicExprGraphBuilder.h"

using namespace llvm;

class querySEG : public ModulePass {
private:
  SymbolicExprGraphBuilder *segBuilder = nullptr;
  CBCallGraph *cg = nullptr;
  DebugInfoAnalysis *DIA = nullptr;

public:
  static char ID;

  querySEG();
  ~querySEG() override;

  void getAnalysisUsage(AnalysisUsage &AU) const override;

  //	void initializeAnalysis(Pass *P);

  bool runOnModule(Module &M) override;

  void list_callee_and_cs_foreach_function();

  void list_all_callee_foreach_function();
  void forward_traverse_call_graph_from(Function *A,
                                        std::set<Function *> &visited);

  void is_function_A_called_by_function_B(Function *A, Function *B);

  void list_all_pointer_type_variable(Module &M);

  void list_data_dependence(Module &M);
};

char querySEG::ID = 0;

void querySEG::getAnalysisUsage(llvm::AnalysisUsage &AU) const {
  AU.setPreservesAll();
  AU.addRequired<SymbolicExprGraphBuilder>();
  AU.addRequired<CBCallGraphWrapperPass>();
  AU.addRequired<DebugInfoAnalysis>();
}

void function_boundary_example(Module &M,
                               SymbolicExprGraphBuilder *segBuilder) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);

      int numCommonArg = seg->getNumCommonArgument();
      outs() << "\tNumCommonArg: " << numCommonArg << "\n";
      for (int i = 0; i < numCommonArg; i++) {
        auto commonArg = seg->getCommonArgument(i);
        outs() << "\t\t" << commonArg->getSEGValue()->getValue()->getName()
               << "\n";
        if (commonArg == seg->findNode(commonArg->getSEGValue())) {
          outs() << "\tFind the common arg by SEGValue\n";
        }
      }

      int numPseudoArg = seg->getNumPseudoArgument();
      outs() << "\tNumPseudoArg: " << numPseudoArg << "\n";
      for (int i = 0; i < numPseudoArg; i++) {
        auto pseudoArg = seg->getPseudoArgument(i);
        outs() << "\t\t" << pseudoArg->getSEGValue()->getValue()->getName()
               << "\n";
      }

      if (seg->hasCommonReturnNode()) {
        if (seg->getCommonReturn()->getSEGValue() != nullptr) {
          outs() << "\tCommonRet: "
                 << seg->getCommonReturn()->getSEGValue()->getValue()->getName()
                 << "\n";
        }
      }

      int numPseudoRet = seg->getNumPseudoReturn();
      outs() << "\tNumPseudoRet: " << numPseudoRet << "\n";
      for (int i = 0; i < numPseudoRet; i++) {
        auto pseudoRet = seg->getPseudoReturn(i);
        outs() << "\t\t" << pseudoRet->getSEGValue()->getValue()->getName()
               << "\n";
      }
    }
  }
}

void iterator_example(Module &M, SymbolicExprGraphBuilder *segBuilder) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);
      for (auto cs_b = seg->seg_callsite_begin();
           cs_b != seg->seg_callsite_end(); cs_b++) {
        SEGCallSite *cs = *cs_b;
        outs() << "\tCallee: " << cs->getCalledFunction()->getName() << "\n";
      }
    }
  }
}

void querySEG::list_callee_and_cs_foreach_function() {
  //  for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {
  //    auto func = cg_b->first;
  //    if (func && !func->isDeclaration()) {
  //      outs() << cg_b->first->getName() << "\n";
  //      cg_b->second->print(outs(), *DIA);
  //    }
  //  }
  // Callee
  for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {
    auto func = cg_b->first;
    if (func && !func->isDeclaration()) {
      outs() << "Function: " << cg_b->first->getName() << "\n";
      for (auto cg_node_b = cg_b->second->begin();
           cg_node_b != cg_b->second->end(); cg_node_b++) {
        outs() << "CS in Line " << DIA->getSrcLine(cg_node_b->first)
               << " calls ";
        if (Function *callee = cg_node_b->second->getFunction())
          outs() << "function '" << callee->getName() << "'\n";
        else
          outs() << "external node\n";
      }
      outs() << '\n';
    }
  }
  // Caller
  for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {
    auto func = cg_b->first;
    if (func && !func->isDeclaration()) {
      outs() << "Function: " << cg_b->first->getName() << "\n";
      for (auto cg_node_b = cg_b->second->caller_begin();
           cg_node_b != cg_b->second->caller_end(); cg_node_b++) {
        if (cg_node_b->first) {
          outs() << "is called in Line " << DIA->getSrcLine(cg_node_b->first)
                 << " by";
          Function *caller = cg_node_b->second->getFunction();
          if (caller) {
            outs() << "function '" << caller->getName() << "'\n";
          }
        }
      }
      outs() << '\n';
    }
  }
}

void querySEG::forward_traverse_call_graph_from(Function *A,
                                                std::set<Function *> &visited) {
  CBCallGraphNode *cg_node = cg->operator[](A);
  if (!cg_node) {
    return;
  }
  if (visited.count(A)) {
    return;
  }
  visited.insert(A);

  for (auto cg_node_b = cg_node->begin(); cg_node_b != cg_node->end();
       cg_node_b++) {
    if (Function *callee = cg_node_b->second->getFunction()) {

      forward_traverse_call_graph_from(callee, visited);
    }
  }
}

void querySEG::list_all_callee_foreach_function() {
  for (auto cg_b = cg->begin(); cg_b != cg->end(); cg_b++) {

    auto func = cg_b->first;
    if (func && !func->isDeclaration()) {
      std::set<Function *> *visited = new std::set<Function *>();
      forward_traverse_call_graph_from(const_cast<Function *>(func), *visited);
      outs() << "For function: " << func->getName()
             << " its all callees are: \n";
      for (auto callee : *visited) {
        outs() << "\t" << callee->getName() << "\n";
      }
    }
  }
}

void querySEG::is_function_A_called_by_function_B(Function *A, Function *B) {
  std::set<Function *> *visited = new std::set<Function *>();
  forward_traverse_call_graph_from(const_cast<Function *>(B), *visited);
  if (visited->count(A)) {
    outs() << "Yes\n";
  } else {
    outs() << "No\n";
  }
}

void querySEG::list_all_pointer_type_variable(Module &M) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);
      for (auto item = seg->value_node_begin(); item != seg->value_node_end();
           item++) {
        if (item->second->getSEGValue()->getValue()->getType()->isPointerTy()) {
          outs() << item->second->getSEGValue()->getValueName() << "\n";
        }
      }
      outs() << "\n";
    }
  }
}

void querySEG::list_data_dependence(llvm::Module &M) {
  for (Function &f : M) {
    if (!f.isDeclaration()) {
      outs() << "Function: " << f.getName() << "\n";
      auto seg = segBuilder->getSymbolicExprGraph(&f);
      for (auto item = seg->value_node_begin(); item != seg->value_node_end();
           item++) {
        if (item->second->getSEGValue()->getValue()->getType()->isPointerTy()) {
          outs() << item->second->getSEGValue()->getValueName() << "\n";
          for (auto parent = item->second->vflow_begin();
               parent != item->second->vflow_end(); parent++) {
            outs() << **parent << "\n";
          }
        }
      }
      outs() << "\n";
    }
  }
}

bool querySEG::runOnModule(llvm::Module &M) {
  outs() << M.getName() << "\n";
  segBuilder = &getAnalysis<SymbolicExprGraphBuilder>();
  cg = &getAnalysis<CBCallGraphWrapperPass>().getCallGraph();
  DIA = &getAnalysis<DebugInfoAnalysis>();
  function_boundary_example(M, segBuilder);
  iterator_example(M, segBuilder);
  list_callee_and_cs_foreach_function();
  list_all_callee_foreach_function();
  list_all_pointer_type_variable(M);
  list_data_dependence(M);
  return false;
}

querySEG::querySEG() : ModulePass(ID) {}

querySEG::~querySEG() {}

static RegisterPass<querySEG> Y("query", "SEG query", false, true);

Feedback

Was this page helpful?


Last modified February 22, 2024: change the order of each charpter (721c1e0)