The SEG API

APIs to access and manipulate the Symbolic Expression Graph

In this section, we outline the APIs to operate over the SEG, which are centered around graph operations. We will illustrate some of the important C++ classes and methods and discuss their usages. Readers should refer to the previous section for understanding the general concepts of program dependencies and SEG, and consult our documentation for the full list of available APIs.

SEGValue

SEGValue denotes a value in the SSA-ed program code, e.g., a variable, a constant, etc. It encapsulates the internal details of the programming language and provides a simplistic view of the program:

class SEGValue {

public:
  enum SEGValueKind {
    value,
    basicblock,
    instruction,
    function,
  };
  ...  

An SEGValue could be a value, a basicblock, an instruction, or a function. The concepts are standard in compilers and mirror those in LLVM IR, and we refer readers to the LLVM Language Reference for details.

Below are some of the APIs on SEGValue:

// Determine whether this value equals "constantInt"
bool isConstantInt(int constantInt) const; 
// Determine whether this value represents a function call site
bool isCallOrInvoke() const;
// Obtaint the name of the called function, return "" if invalid
StringRef getFuncName() const;

In our current implementation, SEGValue wraps LLVM::Value. If users wish to examine the full details of the LLVM IR, they may use SEGValue::getValue() to obtain the underlying LLVM::Value, though this is not recommended.

SEGObject

This is the base class for both the SEG nodes and SEG sites. Recall that an SEG node is associated with an value while an SEG site is associated with an instruction that uses a value, and they are related by the “use site” relation (c.f. the previous section). The complete class inheritance diagram is depicted below:

classSEGOperandNode
SEGObject Inheritance Diagram

SEGNodeBase

This the base class for SEG nodes. An SEGNodeBase is either an SEGOperandNode or an SEGOpcodeNode. The former denotes “operand” used in the program, such as variables and constants (e.g., a, b, c, 10, etc). The latter is associated with an program operation and denotes the result of intermediate computation (e.g., the node labeled with + in the previous section is an SEGOpcodeNode).

Here are some important APIs common to all SEG nodes:

// Graph operations
SEGNodeBase *getChild(unsigned I) const;
unsigned getNumChildren() const;
unsigned getNumParents() const;
key_iterator<std::map<const SEGNodeBase *, float>::const_iterator> parent_begin() const;
key_iterator<std::map<const SEGNodeBase *, float>::const_iterator> parent_end();

ValueFlowIterator vflow_begin() const;
ValueFlowIterator vflow_end() const;

std::vector<SEGSiteBase *>::const_iterator use_site_begin() const;
std::vector<SEGSiteBase *>::const_iterator use_site_end() const;
size_t use_site_size() const;

We highlight some key points:

  1. Parent-Child relation: The value flow edges are represented implicitly by storing at each SEG node a list of its child nodes. The first five APIs above are used to retrieve or iterate over parent or children nodes (the meanings are evident from their names).

Recall the example int c = a+b introduced in the previous section. Its relevant SEG is duplicated here for ease of exposition.

In this example, the arrows point from a node to its parent, e.g., the node labeled with c is a parent of the node labeled with +.

  1. Value flow relation: The value of N can “flow to” the parents of N. This information can be obtained by iteraing over the range defined by the iterator pairs (N->vflow_begin(), N->vflow_end()).

  2. Use site relation: The set of use sites that use the current node can be visited by the last three APIs.

SEGOperandNode

SEGOperandNode is the base class for SEG nodes that represent operand values, i.e., any value in the program excluding the result of intermediate computation. Hence, each SEGOperandNode has an associated SEGValue and can be retrieved by:

SEGValue* getSEGValue()

Here we list the most common kinds of SEGOperandNode:

  1. SEGSimpleOperandNode: most of the operand nodes belong to this class. Examples include variables and constant values. For instance, the nodes associated with a, b, c are all SEGSimpleOperandNode.

  2. SEGArgumentNode (including SEGCommonArgumentNode and SEGPseudoArgumentNode), SEGCallSiteOutputNode (includeing SEGCallSiteCommonOutputNode and SEGCallSitePseudoOutputNode), SEGCallSitePseudoInputNode, SEGReturnNode (including SEGCommonReturnNode and SEGPseudoReturnNode): they are all nodes used by Clearblue to connect data dependencies accross function boundaries. Their meanings are described in details with examples in the previous section.

  3. SEGLoadMemNode and SEGStoreMemNode: each load instruction x= *p causes the SEG to create a SEGLoadMemNode to represent the value loaded from the memory location, while each store instruction *p = y causes the SEG to create a SEGStoreMemNode. They are mainly used to connect indirect data dependencies, which we have described in details with examples in the previous section.

SEGOpcodeNode

SEGOpcodeNode is the base class for SEG nodes that represent the result of performing an arithmetic operation, which usually corresponds to the intermediate result computed before assigning to a variable. We differentiate SEGOpcodeNode based on the opcode of the operation, which lead to SEGBinaryWithIntConstNode (denoting binary operations with one operand being a constant), SEGCastNode (denoting type cast operations), and SEGSimpleOpcodeNode(denoting all other operations).

SEGSiteBase

Now we introduce the other kind of SEGObject, which denotes a use site. Conceptually, an SEGSiteBase should be viewed as an instruction that may use other values represented in SEGNodeBase. We illustrate three types of SEGSiteBase that are the most important: SEGCallSite, SEGCmpSite, and SEGReturnSite.

SEGCallSite represents a function call statement such as foo(1, a1, a2); SEGCmpSite represents a comparison statement that is normally used to form branch conditions such as if (x < 1); SEGReturnSite represents the instruction that returns from the current function such as return res.

SEGCallSite provides APIs to extract relevant information about a function call:

const SEGOperandNode * 	getCommonInput (size_t Index) const;
const SEGOperandNode * 	getPseudoInput (Function *, size_t Index) const;
const SEGCallSiteCommonOutputNode * getCommonOutput () const;
const SEGCallSitePseudoOutputNode * getPseudoOutput (Function *, size_t Index) const;
CalleeIterator 	callee_begin () const;
CalleeIterator 	callee_end () const;
size_t 	callee_size () const;

SymbolicExprGraph

For each function, there will be a SEG that is constructed for it. SymbolicExprGraph is the implementation of the SEG graph data structure. Information of an SEG can be quried by three groups of APIs, which we detail below.

APIs for Accessing Argument And Return Nodes

// Returns the count of common argument nodes in the SEG
size_t getNumCommonArgument() const;

// Returns the count of pseudo argument nodes in the SEG
size_t getNumPseudoArgument() const;

// Returns the count of pseudo return nodes in the SEG
size_t getNumPseudoReturn() const;

// Verifies whether a common return node exists in the SEG
bool hasCommonReturnNode() const;

// Returns a pointer to the common return node in the SEG, if it exists
const SEGCommonReturnNode* getCommonReturn() const;

// Returns a pointer to the pseudo return node at the specified index in the SEG
const SEGPseudoReturnNode* getPseudoReturn(size_t Index) const;

// Returns a pointer to the common argument node at the specified index in the SEG
const SEGCommonArgumentNode* getCommonArgument(size_t Index) const;

// Returns a pointer to the pseudo argument node at the specified index in the SEG
const SEGPseudoArgumentNode* getPseudoArgument(size_t Index) const;

Example Usage

//owasp-example: respec-5798
void init(char* token, char* password){
  password = token;
}

void f(char *password) {
  char localToken[256];
  init(localToken, password);
  memset(password, ' ', strlen(password)); 
  memset(localToken, ' ', strlen(localToken));
  free(password);
}

int main(){
  f("password");
}

For the code above, we illustrate how to access the SymbolicExprGraph for each function and visit their agument and return nodes (SEGArgumentNode and SEGReturnNode):

for (Function &f : M) {
  if (!f.isDeclaration()) {
    outs() << "Function: " << f.getName() << "\n";
    auto seg = segBuilder->getSymbolicExprGraph(&f);

    int numCommonArg = seg->getNumCommonArgument();
    outs() << "\tNumCommonArg: " << numCommonArg << "\n";
    for (int i = 0; i < numCommonArg; i++) {
      auto commonArg = seg->getCommonArgument(i);
      outs() << "\t\t" << commonArg->getSEGValue()->getValue()->getName() << "\n";
    }

    int numPseudoArg = seg->getNumPseudoArgument();
    outs() << "\tNumPseudoArg: " << numPseudoArg << "\n";
    for (int i = 0; i < numPseudoArg; i++) {
      auto pseudoArg = seg->getPseudoArgument(i);
      outs() << "\t\t" << pseudoArg->getSEGValue()->getValue()->getName() << "\n";
    }

    if (seg->hasCommonReturnNode()) {
      if (seg->getCommonReturn()->getSEGValue() != nullptr) {
        outs() << "\tCommonRet: " << seg->getCommonReturn()->getSEGValue()->getValue()->getName() << "\n";
      }
    }

    int numPseudoRet = seg->getNumPseudoReturn();
    outs() << "\tNumPseudoRet: " << numPseudoRet << "\n";
    for (int i = 0; i < numPseudoRet; i++) {
      auto pseudoRet = seg->getPseudoReturn(i);
      outs() << "\t\t" << pseudoRet->getSEGValue()->getValue()->getName() << "\n";
    }
  }
}

The output are dumped as follows;

// Example output:
Function: init                                                                                                                                                                                          
        NumCommonArg: 2                                                                                                                                                                                 
                token                                                                                                                                                                                   
                password                                                                                                                                                                                
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 0                                                                                                                                                                                 
Function: f                                                                                                                                                                                             
        NumCommonArg: 1                                                                                                                                                                                 
                password                                                                                                                                                                                
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 1                                                                                                                                                                                 
                pseudo_ret_1                                                                                                                                                                            
Function: main                                                                                                                                                                                          
        NumCommonArg: 0                                                                                                                                                                                 
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 1                                                                                                                                                                                 
                pseudo_ret_1                             

We print out the counts for common arguments, pseudo arguments, common returns, and pseudo returns are enumerated, along with their respective values, which are retrieved as SEGValue.

APIs for Determining Vertex Containment.

The next group of graph operation APIs determine whether a specific vertex exists within the Symbolic Execution Graph (SEG). As introduced in the conceptual framework of the SEG, there are two types of vertices in the SEG model: Site and Node. Accordingly, there are different APIs, findSite() and findNode(), for determining the existence of each type of vertex within a graph.


// Finds and returns a node from the SEG using the given Instruction pointer
SiteTy* findSite(Instruction *I) const;

// Finds and returns a specific type of site node from the SEG, using the given SEGValue object, with a specific comparison method
template<class SiteTy>
SiteTy* findSite(SEGValue *sValue) const;

// Finds and returns the operand node from the SEG corresponding to the given Value
SEGOperandNode* findNode(Value *) const;

// Finds and returns the operand node from the SEG corresponding to the given SEGValue
SEGOperandNode* findNode(SEGValue *) const;

Example Usage

Let commonArg denote the ith argument of the SEG, i.e., commonArg = seg->getCommonArgument(i). SEG maintains the invariant that commonArg == seg->findNode(commonArg->getSEGValue()):

...
for (int i = 0; i < numCommonArg; i++) {
  auto commonArg = seg->getCommonArgument(i);
  outs() << "\t\t" << commonArg->getSEGValue()->getValue()->getName()
          << "\n";
  if (commonArg == seg->findNode(commonArg->getSEGValue())) {
    outs() << "\tFind the common arg by SEGValue\t";
  } else {
    assert(false);
  }
}
...

Function: init
        NumCommonArg: 2
                token        
        Find the common arg by SEGValue                                                                                                                                                                 
                password                                                                                                                                                                                
        Find the common arg by SEGValue                                                                                                                                                                 
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 0                                                                                                                                                                                 
Function: f                                                                                                                                                                                             
        NumCommonArg: 1                                                                                                                                                                                 
                password                                                                                                                                                                                
        Find the common arg by SEGValue                                                                                                                                                                 
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 1                                                                                                                                                                                 
                pseudo_ret_1                                                                                                                                                                            
Function: main                                                                                                                                                                                          
        NumCommonArg: 0                                                                                                                                                                                 
        NumPseudoArg: 0                                                                                                                                                                                 
        NumPseudoRet: 1                                                                                                                                                                                 
                pseudo_ret_1 

APIs for Graph Iterators

We also offer methods to iterate through different groups of vertices within the SEG. A brief introduction of each iterator is provided below:

  • value_node iterator: This iterates through all operand nodes (represented as value nodes).
  • value_node_pair iterator: This iterates through all pairs of operand nodes and their corresponding values.
  • non_value_node iterator: This iterates through all vertices that are not operand nodes.
  • site iterator: This iterates through all the site vertices.
  • return iterator: This iterates through all return nodes.
  • pseudo_return iterator: This iterates through all pseudo return nodes.
  • arg iterator: This iterates through all argument nodes.
  • common_arg iterator: This iterates through all common argument nodes.
  • pseudo_arg iterator: This iterates through all pseudo argument nodes.
  • var_arg iterator: This iterates through all variadic argument nodes.
  • seg_callsite iterator: This iterates through all call site nodes.

Each of these iterators allows for a traversal across a specific group of structures in the graph, providing flexibility in how the SEG is navigated and processed.


// Returns the iterator to the beginning of the Value to SEGOperandNode map
std::map<Value *, SEGOperandNode *>::const_iterator value_node_begin() const;

// Returns the iterator to the end of the Value to SEGOperandNode map
std::map<Value *, SEGOperandNode *>::const_iterator value_node_end() const;

// Returns the iterator to the beginning of the Value to SEGOperandNode pairs vector
std::vector<std::pair<Value *, SEGOperandNode *>>::const_iterator value_node_pair_begin() const;

// Returns the iterator to the end of the Value to SEGOperandNode pairs vector
std::vector<std::pair<Value *, SEGOperandNode *>>::const_iterator value_node_pair_end() const;

// Returns the iterator to the beginning of the non-Value SEGNodeBase set
std::set<SEGNodeBase *>::const_iterator non_value_node_begin() const;

// Returns the iterator to the end of the non-Value SEGNodeBase set
std::set<SEGNodeBase *>::const_iterator non_value_node_end() const;

// Returns the iterator to the beginning of the Instruction to SEGSiteBase map
std::map<Instruction *, SEGSiteBase *>::const_iterator site_begin() const;

// Returns the iterator to the end of the Instruction to SEGSiteBase map
std::map<Instruction *, SEGSiteBase *>::const_iterator site_end() const;

// Returns the iterator to the beginning of the Return sequence
ReturnIterator return_begin() const;

// Returns the iterator to the end of the Return sequence
ReturnIterator return_end() const;

// Returns the iterator to the beginning of the Pseudo Return sequence
ReturnIterator pseudo_return_begin() const;

// Returns the iterator to the end of the Pseudo Return sequence
ReturnIterator pseudo_return_end() const;

// Returns the iterator to the beginning of the Argument sequence
ArgumentIterator arg_begin() const;

// Returns the iterator to the end of the Argument sequence
ArgumentIterator arg_end() const;

// Returns the iterator to the beginning of the Common Argument sequence
ArgumentIterator common_arg_begin() const;

// Returns the iterator to the end of the Common Argument sequence
ArgumentIterator common_arg_end() const;

// Returns the iterator to the beginning of the Pseudo Argument sequence
ArgumentIterator pseudo_arg_begin() const;

// Returns the iterator to the end of the Pseudo Argument sequence
ArgumentIterator pseudo_arg_end() const;

// Returns the iterator to the beginning of the Variable Argument sequence
ArgumentIterator var_arg_begin() const;

// Returns the iterator to the end of the Variable Argument sequence
ArgumentIterator var_arg_end() const;

// Returns the iterator to the beginning of the SEGCallSite sequence
SEGCallSiteIterator seg_callsite_begin() const;

// Returns the iterator to the end of the SEGCallSite sequence
SEGCallSiteIterator seg_callsite_end() const;

Example Usage

Here, we’ll use seg_callsite iterator as an example and list all the called function within each function.

for (Function &f : M) {
  if (!f.isDeclaration()) {
    outs() << "Function: " << f.getName() << "\n";
    auto seg = segBuilder->getSymbolicExprGraph(&f);
    for (auto cs_b = seg->seg_callsite_begin(); cs_b != seg->seg_callsite_end(); cs_b++) {
      SEGCallSite *cs = *cs_b;
      outs() << "\tCallee: " <<  cs->getCalledFunction()->getName() << "\n";
    }
  }
}

The results are:

Function: init                                                                                                                                                                                          
Function: f                                                                                                                                                                                             
        Callee: init                                                                                                                                                                                    
        Callee: strlen                                                                                                                                                                                  
        Callee: strlen                                                                                                                                                                                  
        Callee: free                                                                                                                                                                                    
Function: main                                                                                                                                                                                          
        Callee: f 

Case study: using the SEG APIs to write a static analysis

In this example, we demonstrate how to apply the introduced SEG APIs to write a simple static analysis. The goal of our analysis is to find out the reachable pairs of (function argument, function ret), which we call a “summary” of the function. The result is useful in answering questions like “which value passed in may be returned back to the caller of the function”.

/**
  * depth-first search from an arg node. if it can reach a return node, we record a summary in summaryMap
  */
std::unordered_map<SEGArgumentNode*, std::unordered_set<SEGReturnNode*>> summaryMap;
void dfs(SEGArgumentNode* arg) {
    // Create a stack to perform DFS traversal
    std::stack<SEGNodeBase*> dfsStack;
    dfsStack.push(arg);

    // Create a set to keep track of visited nodes
    std::unordered_set<SEGNodeBase*> visited;

    // Perform DFS traversal
    while (!dfsStack.empty()) {
        // Pop the top node from the stack
        SEGNodeBase* top = dfsStack.top();
        dfsStack.pop();

        // Mark the node as visited
        visited.insert(top);

        // If reaches a return node, add it to the summary map
        // to form a arg-ret summary
        if (auto returnNode = dyn_cast<SEGReturnNode>(top)) {
            if (summaryMap.count(arg)) {
                summaryMap[arg].insert(returnNode);
            } else {
                std::unordered_set<SEGReturnNode*> returnNodes;
                returnNodes.insert(returnNode);
                summaryMap[arg] = returnNodes;
            }
        }

        // *Advanced step*: 
        // Check whether there are call sites that use the current node
        std::unordered_set<SEGNodeBase*> outputNodes;
        for (auto site : top->getUseSites()) {
            if (auto callSite = dyn_cast<SEGCallSite>(site)) {
                // For each callee of the call site, 
                // build a transfer summary
                auto callee = callSite->getCalledFunction();
                transfer(callSite, callee, top, outputNodes);
            }
        }
        // *Advanced step*:
        // Add all return nodes from the callee to the stack for traversal
        for (auto p : outputNodes) {
            if (!visited.count(p)) {
                dfsStack.push(p);
            }
        }

        // Add all parent nodes of the current node to the stack for traversal
        for (auto p : top->getParents()) {
            if (!visited.count(p)) {
                dfsStack.push(p);
            }
        }
    }
}

/**
  * Utilizing the argument and return value information of the callee 
  * to build a transfer summary that can be used directly by the 
  * caller, without requiring any runtime searches.
  */
void transfer(SEGCallSite cs, Function* callee, SEGNodeBase* input, std::set<SEGNodeBase*>& outputNodes){
  ...
}

Starting from nodes representing arguments, we perform DFS search over the SEG. To locate the return value of the function, we need to check the type of current node top. Specifically, the dyn_cast<> operator is used to check whether top is of type SEGReturnNode. Based on the result of the check, the path is either dropped or recorded as an arg-ret summary. Then getParents interface is used to obtain the next node to be traversed.

When top is passed as a parameter through a function invocation to a callee function, we should as well consider the propagation of data dependencies within the callee. Therefore, when a use site of top is an SEGCallSite, we utilize the transfer function to pre-build the data flow of the callee function (implementation not shown), providing the summary for the current search.

In the above figure, there are two paths that belong to the arg-ret summary, one starting from Argument1 and the other from Argument2, both leading to Return. If we do not consider the impact of function calls on data flow propagation, then the path from Argument2 to Return will be ignored during our analysis, resulting in a loss of accuracy in our analysis.

Feedback

Was this page helpful?


Last modified February 11, 2024: Update 3-2_The_SEG_API.md (c206b18)