The SEG API
In this section, we outline the APIs to operate over the SEG, which are centered around graph operations. We will illustrate some of the important C++ classes and methods and discuss their usages. Readers should refer to the previous section for understanding the general concepts of program dependencies and SEG, and consult our documentation for the full list of available APIs.
SEGValue
SEGValue
denotes a value in the SSA-ed program code, e.g., a variable, a constant, etc. It encapsulates the internal details of the programming language and provides a simplistic view of the program:
class SEGValue {
public:
enum SEGValueKind {
value,
basicblock,
instruction,
function,
};
...
An SEGValue
could be a value, a basicblock, an instruction, or a function. The concepts are standard in compilers and mirror those in LLVM IR, and we refer readers to the LLVM Language Reference for details.
Below are some of the APIs on SEGValue
:
// Determine whether this value equals "constantInt"
bool isConstantInt(int constantInt) const;
// Determine whether this value represents a function call site
bool isCallOrInvoke() const;
// Obtaint the name of the called function, return "" if invalid
StringRef getFuncName() const;
In our current implementation, SEGValue
wraps LLVM::Value
. If users wish to examine the full details of the LLVM IR, they may use SEGValue::getValue()
to obtain the underlying LLVM::Value
, though this is not recommended.
SEGObject
This is the base class for both the SEG nodes and SEG sites. Recall that an SEG node is associated with an value while an SEG site is associated with an instruction that uses a value, and they are related by the “use site” relation (c.f. the previous section). The complete class inheritance diagram is depicted below:
SEGNodeBase
This the base class for SEG nodes. An SEGNodeBase
is either an SEGOperandNode
or an SEGOpcodeNode
. The former denotes “operand” used in the program, such as variables and constants (e.g., a, b, c, 10
, etc). The latter is associated with an program operation and denotes the result of intermediate computation (e.g., the node labeled with +
in the previous section is an SEGOpcodeNode
).
Here are some important APIs common to all SEG nodes:
// Graph operations
SEGNodeBase *getChild(unsigned I) const;
unsigned getNumChildren() const;
unsigned getNumParents() const;
key_iterator<std::map<const SEGNodeBase *, float>::const_iterator> parent_begin() const;
key_iterator<std::map<const SEGNodeBase *, float>::const_iterator> parent_end();
ValueFlowIterator vflow_begin() const;
ValueFlowIterator vflow_end() const;
std::vector<SEGSiteBase *>::const_iterator use_site_begin() const;
std::vector<SEGSiteBase *>::const_iterator use_site_end() const;
size_t use_site_size() const;
We highlight some key points:
- Parent-Child relation: The value flow edges are represented implicitly by storing at each SEG node a list of its child nodes. The first five APIs above are used to retrieve or iterate over parent or children nodes (the meanings are evident from their names).
Recall the example int c = a+b
introduced in the previous section. Its relevant SEG is duplicated here for ease of exposition.
In this example, the arrows point from a node to its parent, e.g., the node labeled with c
is a parent of the node labeled with +
.
Value flow relation: The value of
N
can “flow to” the parents ofN
. This information can be obtained by iteraing over the range defined by the iterator pairs(N->vflow_begin(), N->vflow_end())
.Use site relation: The set of use sites that use the current node can be visited by the last three APIs.
SEGOperandNode
SEGOperandNode
is the base class for SEG nodes that represent operand values, i.e., any value in the program excluding the result of intermediate computation. Hence, each SEGOperandNode
has an associated SEGValue
and can be retrieved by:
SEGValue* getSEGValue()
Here we list the most common kinds of SEGOperandNode
:
SEGSimpleOperandNode
: most of the operand nodes belong to this class. Examples include variables and constant values. For instance, the nodes associated witha
,b
,c
are allSEGSimpleOperandNode
.SEGArgumentNode
(includingSEGCommonArgumentNode
andSEGPseudoArgumentNode
),SEGCallSiteOutputNode
(includeingSEGCallSiteCommonOutputNode
andSEGCallSitePseudoOutputNode
),SEGCallSitePseudoInputNode
,SEGReturnNode
(includingSEGCommonReturnNode
andSEGPseudoReturnNode
): they are all nodes used by Clearblue to connect data dependencies accross function boundaries. Their meanings are described in details with examples in the previous section.SEGLoadMemNode
andSEGStoreMemNode
: each load instructionx= *p
causes the SEG to create aSEGLoadMemNode
to represent the value loaded from the memory location, while each store instruction*p = y
causes the SEG to create aSEGStoreMemNode
. They are mainly used to connect indirect data dependencies, which we have described in details with examples in the previous section.
SEGOpcodeNode
SEGOpcodeNode
is the base class for SEG nodes that represent the result of performing an arithmetic operation, which usually corresponds to the intermediate result computed before assigning to a variable.
We differentiate SEGOpcodeNode
based on the opcode of the operation, which lead to SEGBinaryWithIntConstNode
(denoting binary operations with one operand being a constant), SEGCastNode
(denoting type cast operations), and SEGSimpleOpcodeNode
(denoting all other operations).
SEGSiteBase
Now we introduce the other kind of SEGObject
, which denotes a use site. Conceptually, an SEGSiteBase
should be viewed as an instruction that may use other values represented in SEGNodeBase
.
We illustrate three types of SEGSiteBase
that are the most important: SEGCallSite
, SEGCmpSite
, and SEGReturnSite
.
SEGCallSite
represents a function call statement such as foo(1, a1, a2)
; SEGCmpSite
represents a comparison statement that is normally used to form branch conditions such as if (x < 1)
; SEGReturnSite
represents the instruction that returns from the current function such as return res
.
SEGCallSite
provides APIs to extract relevant information about a function call:
const SEGOperandNode * getCommonInput (size_t Index) const;
const SEGOperandNode * getPseudoInput (Function *, size_t Index) const;
const SEGCallSiteCommonOutputNode * getCommonOutput () const;
const SEGCallSitePseudoOutputNode * getPseudoOutput (Function *, size_t Index) const;
CalleeIterator callee_begin () const;
CalleeIterator callee_end () const;
size_t callee_size () const;
SymbolicExprGraph
For each function, there will be a SEG that is constructed for it. SymbolicExprGraph
is the implementation of the SEG graph data structure.
Information of an SEG can be quried by three groups of APIs, which we detail below.
APIs for Accessing Argument And Return Nodes
// Returns the count of common argument nodes in the SEG
size_t getNumCommonArgument() const;
// Returns the count of pseudo argument nodes in the SEG
size_t getNumPseudoArgument() const;
// Returns the count of pseudo return nodes in the SEG
size_t getNumPseudoReturn() const;
// Verifies whether a common return node exists in the SEG
bool hasCommonReturnNode() const;
// Returns a pointer to the common return node in the SEG, if it exists
const SEGCommonReturnNode* getCommonReturn() const;
// Returns a pointer to the pseudo return node at the specified index in the SEG
const SEGPseudoReturnNode* getPseudoReturn(size_t Index) const;
// Returns a pointer to the common argument node at the specified index in the SEG
const SEGCommonArgumentNode* getCommonArgument(size_t Index) const;
// Returns a pointer to the pseudo argument node at the specified index in the SEG
const SEGPseudoArgumentNode* getPseudoArgument(size_t Index) const;
Example Usage
//owasp-example: respec-5798
void init(char* token, char* password){
password = token;
}
void f(char *password) {
char localToken[256];
init(localToken, password);
memset(password, ' ', strlen(password));
memset(localToken, ' ', strlen(localToken));
free(password);
}
int main(){
f("password");
}
For the code above, we
illustrate how to access the SymbolicExprGraph
for each function and visit their agument and return nodes (SEGArgumentNode
and SEGReturnNode
):
for (Function &f : M) {
if (!f.isDeclaration()) {
outs() << "Function: " << f.getName() << "\n";
auto seg = segBuilder->getSymbolicExprGraph(&f);
int numCommonArg = seg->getNumCommonArgument();
outs() << "\tNumCommonArg: " << numCommonArg << "\n";
for (int i = 0; i < numCommonArg; i++) {
auto commonArg = seg->getCommonArgument(i);
outs() << "\t\t" << commonArg->getSEGValue()->getValue()->getName() << "\n";
}
int numPseudoArg = seg->getNumPseudoArgument();
outs() << "\tNumPseudoArg: " << numPseudoArg << "\n";
for (int i = 0; i < numPseudoArg; i++) {
auto pseudoArg = seg->getPseudoArgument(i);
outs() << "\t\t" << pseudoArg->getSEGValue()->getValue()->getName() << "\n";
}
if (seg->hasCommonReturnNode()) {
if (seg->getCommonReturn()->getSEGValue() != nullptr) {
outs() << "\tCommonRet: " << seg->getCommonReturn()->getSEGValue()->getValue()->getName() << "\n";
}
}
int numPseudoRet = seg->getNumPseudoReturn();
outs() << "\tNumPseudoRet: " << numPseudoRet << "\n";
for (int i = 0; i < numPseudoRet; i++) {
auto pseudoRet = seg->getPseudoReturn(i);
outs() << "\t\t" << pseudoRet->getSEGValue()->getValue()->getName() << "\n";
}
}
}
The output are dumped as follows;
// Example output:
Function: init
NumCommonArg: 2
token
password
NumPseudoArg: 0
NumPseudoRet: 0
Function: f
NumCommonArg: 1
password
NumPseudoArg: 0
NumPseudoRet: 1
pseudo_ret_1
Function: main
NumCommonArg: 0
NumPseudoArg: 0
NumPseudoRet: 1
pseudo_ret_1
We print out the counts for common arguments, pseudo arguments, common returns, and pseudo returns are enumerated, along with their respective values, which are retrieved as SEGValue
.
APIs for Determining Vertex Containment.
The next group of graph operation APIs determine whether a specific vertex exists within the Symbolic Execution Graph (SEG).
As introduced in the conceptual framework of the SEG, there are two types of vertices in the SEG model: Site and Node.
Accordingly, there are different APIs, findSite()
and findNode()
, for determining the existence of each type of vertex within a graph.
// Finds and returns a node from the SEG using the given Instruction pointer
SiteTy* findSite(Instruction *I) const;
// Finds and returns a specific type of site node from the SEG, using the given SEGValue object, with a specific comparison method
template<class SiteTy>
SiteTy* findSite(SEGValue *sValue) const;
// Finds and returns the operand node from the SEG corresponding to the given Value
SEGOperandNode* findNode(Value *) const;
// Finds and returns the operand node from the SEG corresponding to the given SEGValue
SEGOperandNode* findNode(SEGValue *) const;
Example Usage
Let commonArg
denote the ith argument of the SEG, i.e., commonArg = seg->getCommonArgument(i)
. SEG maintains the invariant that
commonArg == seg->findNode(commonArg->getSEGValue())
:
...
for (int i = 0; i < numCommonArg; i++) {
auto commonArg = seg->getCommonArgument(i);
outs() << "\t\t" << commonArg->getSEGValue()->getValue()->getName()
<< "\n";
if (commonArg == seg->findNode(commonArg->getSEGValue())) {
outs() << "\tFind the common arg by SEGValue\t";
} else {
assert(false);
}
}
...
Function: init
NumCommonArg: 2
token
Find the common arg by SEGValue
password
Find the common arg by SEGValue
NumPseudoArg: 0
NumPseudoRet: 0
Function: f
NumCommonArg: 1
password
Find the common arg by SEGValue
NumPseudoArg: 0
NumPseudoRet: 1
pseudo_ret_1
Function: main
NumCommonArg: 0
NumPseudoArg: 0
NumPseudoRet: 1
pseudo_ret_1
APIs for Graph Iterators
We also offer methods to iterate through different groups of vertices within the SEG. A brief introduction of each iterator is provided below:
- value_node iterator: This iterates through all operand nodes (represented as value nodes).
- value_node_pair iterator: This iterates through all pairs of operand nodes and their corresponding values.
- non_value_node iterator: This iterates through all vertices that are not operand nodes.
- site iterator: This iterates through all the site vertices.
- return iterator: This iterates through all return nodes.
- pseudo_return iterator: This iterates through all pseudo return nodes.
- arg iterator: This iterates through all argument nodes.
- common_arg iterator: This iterates through all common argument nodes.
- pseudo_arg iterator: This iterates through all pseudo argument nodes.
- var_arg iterator: This iterates through all variadic argument nodes.
- seg_callsite iterator: This iterates through all call site nodes.
Each of these iterators allows for a traversal across a specific group of structures in the graph, providing flexibility in how the SEG is navigated and processed.
// Returns the iterator to the beginning of the Value to SEGOperandNode map
std::map<Value *, SEGOperandNode *>::const_iterator value_node_begin() const;
// Returns the iterator to the end of the Value to SEGOperandNode map
std::map<Value *, SEGOperandNode *>::const_iterator value_node_end() const;
// Returns the iterator to the beginning of the Value to SEGOperandNode pairs vector
std::vector<std::pair<Value *, SEGOperandNode *>>::const_iterator value_node_pair_begin() const;
// Returns the iterator to the end of the Value to SEGOperandNode pairs vector
std::vector<std::pair<Value *, SEGOperandNode *>>::const_iterator value_node_pair_end() const;
// Returns the iterator to the beginning of the non-Value SEGNodeBase set
std::set<SEGNodeBase *>::const_iterator non_value_node_begin() const;
// Returns the iterator to the end of the non-Value SEGNodeBase set
std::set<SEGNodeBase *>::const_iterator non_value_node_end() const;
// Returns the iterator to the beginning of the Instruction to SEGSiteBase map
std::map<Instruction *, SEGSiteBase *>::const_iterator site_begin() const;
// Returns the iterator to the end of the Instruction to SEGSiteBase map
std::map<Instruction *, SEGSiteBase *>::const_iterator site_end() const;
// Returns the iterator to the beginning of the Return sequence
ReturnIterator return_begin() const;
// Returns the iterator to the end of the Return sequence
ReturnIterator return_end() const;
// Returns the iterator to the beginning of the Pseudo Return sequence
ReturnIterator pseudo_return_begin() const;
// Returns the iterator to the end of the Pseudo Return sequence
ReturnIterator pseudo_return_end() const;
// Returns the iterator to the beginning of the Argument sequence
ArgumentIterator arg_begin() const;
// Returns the iterator to the end of the Argument sequence
ArgumentIterator arg_end() const;
// Returns the iterator to the beginning of the Common Argument sequence
ArgumentIterator common_arg_begin() const;
// Returns the iterator to the end of the Common Argument sequence
ArgumentIterator common_arg_end() const;
// Returns the iterator to the beginning of the Pseudo Argument sequence
ArgumentIterator pseudo_arg_begin() const;
// Returns the iterator to the end of the Pseudo Argument sequence
ArgumentIterator pseudo_arg_end() const;
// Returns the iterator to the beginning of the Variable Argument sequence
ArgumentIterator var_arg_begin() const;
// Returns the iterator to the end of the Variable Argument sequence
ArgumentIterator var_arg_end() const;
// Returns the iterator to the beginning of the SEGCallSite sequence
SEGCallSiteIterator seg_callsite_begin() const;
// Returns the iterator to the end of the SEGCallSite sequence
SEGCallSiteIterator seg_callsite_end() const;
Example Usage
Here, we’ll use seg_callsite iterator as an example and list all the called function within each function.
for (Function &f : M) {
if (!f.isDeclaration()) {
outs() << "Function: " << f.getName() << "\n";
auto seg = segBuilder->getSymbolicExprGraph(&f);
for (auto cs_b = seg->seg_callsite_begin(); cs_b != seg->seg_callsite_end(); cs_b++) {
SEGCallSite *cs = *cs_b;
outs() << "\tCallee: " << cs->getCalledFunction()->getName() << "\n";
}
}
}
The results are:
Function: init
Function: f
Callee: init
Callee: strlen
Callee: strlen
Callee: free
Function: main
Callee: f
Case study: using the SEG APIs to write a static analysis
In this example, we demonstrate how to apply the introduced SEG APIs to write a simple static analysis. The goal of our analysis is to find out the reachable pairs of (function argument, function ret)
, which we call a “summary” of the function. The result is useful in answering questions like “which value passed in may be returned back to the caller of the function”.
/**
* depth-first search from an arg node. if it can reach a return node, we record a summary in summaryMap
*/
std::unordered_map<SEGArgumentNode*, std::unordered_set<SEGReturnNode*>> summaryMap;
void dfs(SEGArgumentNode* arg) {
// Create a stack to perform DFS traversal
std::stack<SEGNodeBase*> dfsStack;
dfsStack.push(arg);
// Create a set to keep track of visited nodes
std::unordered_set<SEGNodeBase*> visited;
// Perform DFS traversal
while (!dfsStack.empty()) {
// Pop the top node from the stack
SEGNodeBase* top = dfsStack.top();
dfsStack.pop();
// Mark the node as visited
visited.insert(top);
// If reaches a return node, add it to the summary map
// to form a arg-ret summary
if (auto returnNode = dyn_cast<SEGReturnNode>(top)) {
if (summaryMap.count(arg)) {
summaryMap[arg].insert(returnNode);
} else {
std::unordered_set<SEGReturnNode*> returnNodes;
returnNodes.insert(returnNode);
summaryMap[arg] = returnNodes;
}
}
// *Advanced step*:
// Check whether there are call sites that use the current node
std::unordered_set<SEGNodeBase*> outputNodes;
for (auto site : top->getUseSites()) {
if (auto callSite = dyn_cast<SEGCallSite>(site)) {
// For each callee of the call site,
// build a transfer summary
auto callee = callSite->getCalledFunction();
transfer(callSite, callee, top, outputNodes);
}
}
// *Advanced step*:
// Add all return nodes from the callee to the stack for traversal
for (auto p : outputNodes) {
if (!visited.count(p)) {
dfsStack.push(p);
}
}
// Add all parent nodes of the current node to the stack for traversal
for (auto p : top->getParents()) {
if (!visited.count(p)) {
dfsStack.push(p);
}
}
}
}
/**
* Utilizing the argument and return value information of the callee
* to build a transfer summary that can be used directly by the
* caller, without requiring any runtime searches.
*/
void transfer(SEGCallSite cs, Function* callee, SEGNodeBase* input, std::set<SEGNodeBase*>& outputNodes){
...
}
Starting from nodes representing arguments, we perform DFS search over the SEG. To locate the return value of the function,
we need to check the type of current node top
.
Specifically, the dyn_cast<>
operator is used to check whether top
is of type SEGReturnNode
. Based on the result of the check, the path is either dropped or recorded as an arg-ret summary.
Then getParents
interface is used to obtain the next node to be traversed.
When top
is passed as a parameter through a function invocation to a callee function, we should as well consider
the propagation of data dependencies within the callee. Therefore, when a use site of top
is an SEGCallSite
, we utilize
the transfer
function to pre-build the data flow of the callee function (implementation not shown), providing the summary for the current search.
In the above figure, there are two paths that belong to the arg-ret summary, one starting from Argument1
and the other from Argument2
, both leading to Return
. If we do not consider the impact of function calls on data flow propagation, then the path from Argument2
to Return
will be ignored during our analysis, resulting in a loss of accuracy in our analysis.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.