序言

杨柳散和风，青山澹吾虑。

CodeQL搭建环境，初步上手。

背景介绍

Semmle公司最早独创性的开创了一种QL语言，Semmle QL，并且运行在自家LGTM平台上。

LGTM平台上存放的就是一些开源项目，用户可以选择分析的语言，编写ql语句进行程序安全性查询。

2019年，GitHub（背后是微软）收购了Semmle公司，开源了CodeQL分析引擎。

CodeQL 引擎

CodeQL引擎：

将源码通过Extractor模块进行代码信息分析&提取，构建一套自己的关系型数据库Snapshot Database。

编译型语言：Extractor观察编译器的编译过程，捕获编译器生成的AST，语义信息(名称绑定、类型信息、运算操作等)，控制流，数据流信息，外加一份源码。

解释型语言：Extractor直接分析源代码。
Snapshot Database里面包括：源代码，关系数据。
接下来用户输入QL语句，经过CodeQL的工具库转换为Compiled Query，参与查询。
最终展示查询结果。

Extractor做了哪些事情？

如图所示：

对于待分析的源代码，首先Copy一份，用于后续保留。
将源码转换为关系型数据，也就是trap文件，放在database里面（比如，每个Java文件可以生成一个trap文件）。
上面二者(Copy + Database)构建为snapshot快照。

环境准备

CodeQL CLI来作为工具库 VSCode作为查询前端

CodeQL-CLI-binaries

文本命令行工具

下载地址：https://github.com/github/codeql-cli-binaries MacOS直接下载osx版本

配置环境变量

把$CODEQL放在插件中：

CodeQL Library

CodeQL Query 查询工具库，负责编译QL语言，必不可少。

下载地址：https://github.com/github/codeql

codeql-lib文件夹下存放的是各个语言的若干QL模块(qll文件)，模块中有若干class可以用来match语言的若干case，这些class可以分为四种类型：

语法型
控制流型
类型推断型
污点跟踪型

源码准备

参考楼兰师傅，自己本地follow一遍

下载待分析的源码

选择WebGoat，jdk8版本

1	git clone --branch v8.0.0 https://github.com/WebGoat/WebGoat.git

构建数据库

到这里，codeql已经是一个命令行工具了，你可以随时在终端里调用它。

在WebGoat根目录下，创建数据库：

1 2	cd WebGoat codeql database create webgoat-qldb -l java

构建之后的数据库：

VSCode

配置数据库

将上一步生成的webgoatqldb加载到vscode中，From a folder

配置QL Pack

QL packs organize the files used in CodeQL analysis and can store queries, library files, query suites, and important metadata. Their root directory must contain a file named qlpack.yml. Your custom queries should be saved in the QL pack root, or its subdirectories.

按照官方文档来，新建一个文件夹codeql-query（存放配置文件和查询语句），表示一个QL查询包，编写配置文件alpack.yml：

1
2
3

name: example-query # 包名 确保唯一性 必写
version: 0.0.0 # 版本号 必写
libraryPathDependencies: codeql-java # 依赖 必写

添加到工作区

Add folder to workspace，保证codeql-home，codeql-query包添加到工作区：

QL查询

编写QL脚本之后，直接右键->Code：Run Query；

图中为查询所有method，右侧为查询到的结果，点击既可查看对应的src源码。

细节

插件

vscode插件中，页面上右键点击会出现：

区别：

CodeQL: Run Query ql整体查询（最常用）
CodeQL: Quick Evaluation 只查询鼠标选中的谓词/片段
CodeQL: Run Query on Multiple Databases 在多个数据中联合查询

CodeQL queries

重要的查找类型有：

Alert queries: queries that highlight issues in specific locations in your code.
Path queries: queries that describe the flow of information between a source and a sink in your code.

查询方式：

You can add custom queries to QL packs to analyze your projects with “Code scanning”, use them to analyze a database with the “CodeQL CLI,” or you can contribute to the standard CodeQL queries in our open source repository on GitHub.

元数据 Metadata

元数据表示查询的目的、如何解释查询结果、如何显示查询结果。

不同的查询方式所需的元数据也不相同。

如果你是向GitHub仓库贡献一个查询，学习query metadata style guide.
如果你正在将自定义查询添加到使用 LGTM 进行分析的qlpack中，学习Writing custom queries to include in LGTM analysis.
如果你使用 CodeQL CLI进行查询，你的元数据部分必须包含 @kind.
如果你使用LGTM终端或者VSCode插件进行查询，元数据不是必须的；

但是如果你希望你的结果显示为alert 或者path，必须包含@kind. 更多内容学习 “Analyzing your projects”

Alert problem：必须包含@kind problem

Path problem：必须包含@kind path-problem

元数据属性

更细的学习：Query metadata style guide.

属性	值	描述
`@description`	`<text>`	描述查询的用途，代码元素用 `'`框住
`@id`	`<text>`	识别和分类查询，使用`/`和`-`分割；LGTM模板：`<language>/<brief-description>`
`@kind`	`problem` `path-problem`	the query is an alert (`@kind problem`) or a path (`@kind path-problem`)
`@name`	`<text>`	查询的名字，代码元素用 `'`框住
`@tags`	`correctness` `maintainability` `readability` `security`	查询的归类
`@precision`	`low` `medium` `high` `very-high`	查询的准确率（误报率高低）
`@problem.severity`	`error` `warning` `recommendation`	描述一次非安全查询的产生的任何警报的严重性等级
`@security-severity`	`<score>`	漏洞评分

备注：

tags
- @tags correctness–for queries that 检测程序不正确的行为.
- @tags maintainability–for queries that 检测程序中难以改动的模式.
- @tags readability–for queries that 检测程序中源码难以阅读的模式.
- @tags security–for queries that 检测程序中的安全性问题.

官方规范小例子：

规范查询结果

1	select f, "This file is similar to $@.", other, other.getBaseName()

$@ 会把后两列合并显示

关于数据流分析

数据流图

AST上面的节点是语句/表达式 Expr Stmt

数据流图上的节点是传递value的语义元素 ExprNode ParameterNode

Some AST nodes (such as expressions) have corresponding data flow nodes, but others (such as if statements) do not. This is because expressions are evaluated to a value at runtime, whereas if statements are purely a control-flow construct and do not carry values. There are also data flow nodes that do not correspond to AST nodes at all.

AST上面的节点：Stmt，Expr

Expr有对应的数据流图节点，因为表达式在数据流分析过程中变成了value

Stmt没有对应的数据流图节点，比如if，纯纯的控制流结构，不携带任何值

当然，数据流图还有很多节点中根本没在抽象语法树中出现；

建模

两种：

local data flow ：局部数据流分析，函数内
Global data flow：全局数据流分析，函数间

数据流分析+污点分析

For example, if you are tracking an insecure object x (which might be some untrusted or potentially malicious data), a step in the program may ‘change’ its value. So, in a simple process such as y = x + 1, a normal data flow analysis will highlight the use of x, but not y. However, since y is derived from x, it is influenced by the untrusted or ‘tainted’ information, and therefore it is also tainted. Analyzing the flow of the taint from x to y is known as taint tracking.

关于路径查询

官方模板：

/**
 * ...
 * @kind path-problem
 * ...
 */

import <language>
// For some languages (Java/C++/Python) you need to explicitly import the data flow library, such as
// import semmle.code.java.dataflow.DataFlow
import DataFlow::PathGraph
...

from MyConfiguration config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "<message>"

import DataFlow::PathGraph

PathGraph有内置查询谓词edges，用来判断ab之间是否有数据边，有的话找出来

数据流分析调试

常规模板：

class MyConfig extends TaintTracking::Configuration {
  MyConfig() { this = "MyConfig" }

  override predicate isSource(DataFlow::Node node) { node instanceof MySource }

  override predicate isSink(DataFlow::Node node) { node instanceof MySink }
}

from MyConfig config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink.getNode(), source, sink, "Sink is reached from $@.", source.getNode(), "here"

检查Source和Sink集合

选中node instanceof MySource,右键“CodeQL: Quick Evaluation”，可以查看你的SourceSet

设置`fieldFlowBranchLimit`属性值

1	override int fieldFlowBranchLimit() { result = 5000 }

Data-flow configuration们自身有个属性：fieldFlowBranchLimit

他的value如果过高会浪费性能，过低找不到path；

部分数据流

“数据流片段”

如果你想获取全部的“数据流片段”，naive的做法可能是两步：

传统的hasFlow
isSink => any()

但是这样是找不到全部的数据流片段的，因为CodeQL内置的数据流库其实一直在努力修建不可能的路径。

官方推荐谓词：Configuration.hasPartialFlow

1	final predicate hasPartialFlow(PartialPathNode source, PartialPathNode node, int dist) {}

用法：Configuration.hasPartialFlow

需要先设置explorationLimit，它是整个搜索范围的Top，dist应该<=：

1	override int explorationLimit() { result = 5 }

推荐的方式是直接写一个谓词来包装它：

predicate adhocPartialFlow(Callable c, PartialPathNode n, Node src, int dist) {
  exists(MyConfig conf, PartialPathNode source |
    conf.hasPartialFlow(source, n, dist) and
    src = source.getNode() and
    c = n.getNode().getEnclosingCallable()
  )
}

序言