Comparing different SAST solutions with one another is no trivial task. Indeed, beyond some straightforward criteria such as a tool’s speed, usability, or integration options, the quintessential question is: How well does it perform in detecting actual vulnerabilities in your code?
This question can be subdivided into two separate questions: First, how many of the actual vulnerabilities in your code are detected and flagged by the tool? This is measured as the true positive rate (TPR): A true positive is a real vulnerability in your code that the tool correctly flags as such. The true positive rate is the percentage of real vulnerabilities that the tool discovers. Obviously, computing the true positive rate requires knowing how many real vulnerabilities there exist in the first place, which is usually unknown for real-world applications. Second, how many code pieces are flagged as vulnerable, even though they really are not vulnerable at all? This is measured as the false positive rate (FPR): A false positive is a piece of code that looks like it might be vulnerable (but really is not), but that the tool flags as such nevertheless.
Computing these measures for different tools in order to compare them to each other is further complicated by a range of technical problems. For example, tools may report different types of vulnerabilities in different ways. Consider for instance a SQL Injection where a SQL query that incorporates two different unsanitized user inputs is executed. Some tools may report two different vulnerabilities (one for each input), while others may report only one (with two inputs). Some tools may also report quality problems in your code that are valid, but that are of lesser relevance. While those may be, strictly speaking, true positives, saying that this tool has a higher absolute number of true positives than another would distort the comparison with respect to actual, serious vulnerabilities.
This is why, when you want to compare tools with respect to a given application, it makes sense to establish a ground truth of actual vulnerabilities and false vulnerabilities in the code - and then count how many of the actual vulnerabilities are found, and how many of the false vulnerabilities are incorrectly reported by the tools you want to compare. A test application that includes such a ground truth can then be used as a benchmark for comparing tools.
Obviously, designing such a benchmark means a substantial amount of work. Fortunately, for Java (as for other common languages), some pre-fabricated benchmarks designed by security experts exist, such as the OWASP Benchmark, the Juliet Test Suite, SecuriBench, WAVSEP, AltoroJ, and others. This is good, yet the results of vulnerability detection tools on these benchmarks should be taken with a grain of salt. Indeed, these benchmarks have been around for a while and are widely known. Hence it is to be expected that developers of vulnerability detection tools tested their tool extensively on these well-known benchmarks and possibly improved them - both to enhance the overall strength of their tool and to achieve a better score.
Think about it: If you were a teacher and you wanted to test the abilities of your students, you probably would not give them an exam that’s been publicly available for years and that all your students have had ample opportunity to study. In that sense, benchmarks are actually more helpful for developers of vulnerability detection tools than users of those tools. Developers may notice possible improvements (or bugs) when running their tool on such a benchmark, and by implementing (or fixing) them they will improve the overall quality of their tool. However, as a user who wishes to compare tools, it is advisable to run a tool on a wide variety of applications, including lesser known or (even better!) self-written code. This will give you a much better idea of how such a tool performs on real-world code. We covered this topic earlier in our blog post about 5 Best Practices for SAST Evaluation.
The OWASP Benchmark
Among all the aforementioned benchmarks, the OWASP Benchmark is arguably the most well-known one. Its most recent version (v1.2) comprises 2,740 test cases distributed across 11 different vulnerability categories. Each test case may either be an actual vulnerability (i.e., the tool should report that vulnerability) or a false vulnerability (i.e., the tool should not report that vulnerability). Each test case is assigned a number between 1 and 2,740, with no apparent connection between that number and its vulnerability category or whether it is an actual vulnerability, i.e., the numbers are distributed in a seemingly random fashion. See Table 1 for details.
|Vulnerability Category (CWE)||# of vulnerable test cases||# of non-vulnerable test cases||Total|
|Path Traversal (22)||133||135||268|
|Command Injection (78)||126||125||251|
|Cross-Site Scripting (79)||246||209||455|
|SQL Injection (89)||272||232||504|
|LDAP Injection (90)||27||32||59|
|Weak Cryptography (327)||130||116||246|
|Weak Hash (328)||129||107||236|
|Weak Randomness (330)||218||275||493|
|Trust Boundary Violation (501)||83||43||126|
|Cookie Misconfiguration (secure flag) (614)||36||31||67|
|XPath Injection (643)||15||20||35|
Table 1: Test cases in OWASP Benchmark v1.2
As can be seen, the OWASP Benchmark establishes a ground truth for true positives and false positives, such that the results of different tools can be easily compared: For each test case, we can determine whether the tool reported the corresponding vulnerability or not. The percentage of reported vulnerabilities among the vulnerable test cases constitutes the true positive rate, while the percentage of reported vulnerabilities among the non-vulnerable test cases constitutes the false positive rate. This way, the results of different tools are fully and directly comparable.
Clearly, the best possible outcome is a TPR of 100% and an FPR of 0%. By contrast, consider a tool that would simply guess uniformly at random in each case whether a vulnerability is present or not: Such a tool would be expected to achieve a TPR of about 50% and an FPR of about 50%. In the same vein, a tool that would always guess that a vulnerability is present would achieve 100% true positives and 100% false positives, while a tool that would always guess that there is no vulnerability would achieve 0% true positives and 0% false positives. Obviously, none of these tools would be of any use (see the OWASP Benchmark Project for further details).
In order to assess where current tools stand, the OWASP Benchmark developers have published a comparison chart for various open-source and commercial SAST tools, such as Veracode, Checkmarx, Fortify, and Coverity (see Figure 1). In the same manner as seen everywhere else, we bluntly added our results to this chart for direct comparison.
Figure 1: Comparison of our results to other vendors
RIPS Results on OWASP Benchmark
Notwithstanding everything that’s been said above about benchmark suites, our customers often ask us how we perform on that benchmark, or test it on their own with a free trial. So how well does RIPS perform on the OWASP Benchmark?
RIPS is able to achieve a true positive rate of 100% and a false positive rate of 0%.
As with any other project, running RIPS on the OWASP Benchmark is as simple as hitting the Start Analysis button; no special configuration, as with some other vulnerability detection tools, is required. Each OWASP test case is a self-contained, fully runnable Java Servlet that may or may not be actually exploitable. Many of the test cases are intentionally hard from a static analysis perspective: Among other things, they require a highly accurate interprocedural analysis, a precise type inference mechanism, modeling internal methods of the standard library, or detecting semantically unreachable code. Support for third-party frameworks such as Spring is also required. Fortunately, our language-specific Java engine already supported all of those things. In the following, we would like to share our journey of what still needed improvement.
Achieving 100% True Positives
In our first run out-of-the-box, about 90% of the vulnerable test cases were flagged as vulnerable. Some of the actual vulnerabilities that were not reported could be simply fixed by improving our language-specific analysis model. An actual new feature was needed to correctly process values read from configuration files: Consider the code fragment in Listing 1, an excerpt from a test case that contains an actual vulnerability in the category Weak Hash.
Listing 1: Weak Hash vulnerability in the OWASP Benchmark (test #2677)
In the code snippet in Listing 1, an input string is to be hashed. The
hash algorithm to be used is read from the configuration file
benchmark.properties. If that file cannot be found or does not
contain a configuration value named
hashAlg1, the algorithm SHA512 is
used as a fallback: SHA512 may be considered as safe, such that no
vulnerability should be raised due to the fallback value. However, the
OWASP Benchmark does come with a file
benchmark.properties, and this
configuration file does specify a value for the key
shown in Listing 2.
Listing 2: Excerpt from the file
As can be seen, MD5 is used as a hash algorithm in this concrete case:
That is, while the Java code does not exhibit a Weak Hash
vulnerability by itself, it is vulnerable in the context it is
executed in due to the presence of the configuration file and its
contents. This file also contains other configuration values (e.g.,
encryption algorithms) and is used in a variety of test cases
throughout the OWASP Benchmark. In order to properly find these
vulnerabilities, our static analysis engine needed the ability to
discover and read configuration files via
Our engine can now successfully discover the corresponding
file, parse it, store the configuration values, use that information
during analysis of the subsequent code, and eventually raise an alarm
Having improved our engine as described, the next run of our engine now showed a true positive rate of 100%. The fact that we were able to achieve a TPR of 100% with relative ease (something that most other SAST products did not manage to do) underlines the completeness and precision of our algorithms.
Achieving 0% False Positives
In our first run out-of-the-box, about 30% of the non-vulnerable test cases were flagged as vulnerable. A closer look at these false positives revealed that each case was due to one of three reasons: Vulnerabilities present in semantically unreachable code, overtainting of collections, and encryption algorithms questionably considered as secure by the benchmark itself.
Of the 30% false positives, almost half were due to vulnerabilities triggered by code that is semantically unreachable: Consider the example in Listing 3.
Listing 3: False Command Injection vulnerability (test case #90)
Here, the variable
param contains some
attacker-controllable input. Under certain circumstances, the variable
param is assigned to the variable
bar, which is then in turn used
to execute a system command. However, the
circumstances under which this happens never occur, because the
bar is unreachable: Due to
the arithmetic computation, the then-branch of the
if-statement is always chosen, while the else-branch is never
The precise code simulation implemented by our engine correctly performs the arithmetic computation in Listing 3. Nevertheless, so far this information was not used to prune unreachable branches with respect to the propagation of tainted values, and, hence, a vulnerability was raised in these cases, even though none exists. We improved our engine by disregarding taints that flow through unreachable code. Thereby, we were able to reduce the false positives from about 30% to about 17%.
The second frequent reason for the false positives that we observed was an imprecise handling of collections: Consider the example in Listing 4.
Listing 4: False SQL Injection vulnerability (test case #200)
As before, the variable
param contains some attacker-controllable
input. In the code excerpt shown in Listing 4, a new empty list is
initialized. Three elements are added to the list, where the second of
these elements is the dangerous variable
param, while the other two
are safe values. Subsequently, the first element of the list is
removed again, so that the variable
param is now the list’s first
element. Finally, the collection’s second element is retrieved, i.e.,
the safe value
"moresafe". Thus, the code in Listing 4 is not
actually exploitable, yet was reported by our engine as a vulnerability before.
This is a typical approach for static analysis tools: Either do not
consider any of the returned elements as tainted (under-approximation)
or consider all the returned elements as tainted
(over-approximation). Since our engine already
simulates many methods of the standard library, including the methods
Collection and its implementing classes, in a highly
accurate manner, improving our analysis engine to consider the exact
returned value when the entire contents of a collection are statically
known was an easy feat and doable without any measurable performance
hits. This improvement further reduced the false positive rate from 17% to under 9%.
All remaining false positives were in the category Weak Cryptography (see Table 1). Namely, all of the 116 non-vulnerable test cases in that category were flagged as vulnerable. In all of these test cases, data is encrypted using some encryption algorithm, as the example in Listing 5 shows.
Listing 5: (Alleged) false weak encryption vulnerability (test case #58)
In the example in Listing 5, the used encryption algorithm is
DESede/ECB/PKCS5Padding. That is, data is encrypted using DESede
(also known as Triple DES) as block cipher, ECB as mode of operation,
and using a simple padding scheme described in PKCS#5. In fact,
each of the 116 alleged non-vulnerable Weak Cryptography
test cases in the OWASP Benchmark uses one of the following three
In our opinion, none of these can, in good conscience, be considered as truly secure. The ECB mode of operation is known to preserve the structure of the plaintext and therefore does not hide all information adequately. Decryption with CBC mode of operation may be vulnerable to so-called padding oracle attacks. Similarly, RSA with PKCS#1 v1.5 padding (note that in this case, the “ECB” in that string is only present for consistency reasons, but is actually ignored, since RSA is not a block cipher) is vulnerable to the Bleichenbacher attack. Consequently, an alarm raised by RIPS when data is encrypted using these algorithms is, in our view, sensible; we would like to alert our users to the problems that could exist.
Clearly, it would be trivial to tell our engine that these algorithms are safe; as a consequence, we would immediately achieve a TPR of 100% with an FPR of 0% (and in fact, we internally did so in a test run, just to confirm that we do indeed get 100% true positives at 0% false positives, and we did). Yet that would have been plainly dishonest towards our users: Hiding actual vulnerabilities simply because the OWASP Benchmark doesn’t consider them as vulnerable is obviously a no-go.
Instead, we reported this problem to the developers of the OWASP Benchmark and explained our concerns. To our relief, they fully agreed that these test cases are indeed questionable and that they need to be updated so as to use truly secure encryption algorithms. We would like to take the opportunity to express our gratitude towards the OWASP Benchmark developer team for their quick reaction and their professionality, and for maintaining such an extensive suite. We are looking forward to seeing an updated version of the OWASP Benchmark.
Update 30 March 2020: The OWASP Benchmark developers have now released an update of the OWASP Benchmark addressing the aforementioned issue. All non-vulnerable test cases for the Weak Encryption issue type now use truly secure encryption algorithms. We tested this new version of the OWASP Benchmark with RIPS and successfully verified that it does achieve 100% true positives at 0% false positives.
In this blog post, we’ve taken a close look at the OWASP Benchmark and reported on our results. To the best of our knowledge, we are the first SAST tool to achieve a true positive rate of 100% and a false positive rate of 0% - a perfect score. It once again bolsters our belief that our language-specific approach at static analysis is both extremely thorough and highly precise.
Such results, however, should not be staked too high. As we’ve argued in the beginning and exemplarily demonstrated throughout this blog post, it is possible to optimize tools to perform well on a particular benchmark. This is not to say that that’s a bad thing per se. At the contrary, a benchmark can be of valuable help in improving a vulnerability detection tool for the developers of that tool. As we’ve shown, investigating the OWASP Benchmark induced several improvements to our tool which will undoubtedly improve our results on other applications as well. Yet the simple fact that a vulnerability detection tool performs well in a benchmark cannot be taken as the sole clue with respect to how good it is at detecting vulnerabilities in general, for arbitrary applications.
In addition, as we’ve seen, benchmarks may themselves contain test cases that are either of no substantial relevance for real-world applications or that are questionable in the first place. We have seen and reported similar cases in other test suites. To truly compare tools with one another, it is therefore advisable to run the tools on various real applications, possibly including, but certainly not limited to benchmarks.