Comparing different SAST solutions with one another is no trivial task. Indeed, beyond some straightforward criteria such as a tool’s speed, usability, or integration options, the quintessential question is: How well does it perform in detecting actual vulnerabilities in your code?

Benchmark Metrics

This question can be subdivided into two separate questions: First, how many of the actual vulnerabilities in your code are detected and flagged by the tool? This is measured as the true positive rate (TPR): A true positive is a real vulnerability in your code that the tool correctly flags as such. The true positive rate is the percentage of real vulnerabilities that the tool discovers. Obviously, computing the true positive rate requires knowing how many real vulnerabilities there exist in the first place, which is usually unknown for real-world applications. Second, how many code pieces are flagged as vulnerable, even though they really are not vulnerable at all? This is measured as the false positive rate (FPR): A false positive is a piece of code that looks like it might be vulnerable (but really is not), but that the tool flags as such nevertheless.

Computing these measures for different tools in order to compare them to each other is further complicated by a range of technical problems. For example, tools may report different types of vulnerabilities in different ways. Consider for instance a SQL Injection where a SQL query that incorporates two different unsanitized user inputs is executed. Some tools may report two different vulnerabilities (one for each input), while others may report only one (with two inputs). Some tools may also report quality problems in your code that are valid, but that are of lesser relevance. While those may be, strictly speaking, true positives, saying that this tool has a higher absolute number of true positives than another would distort the comparison with respect to actual, serious vulnerabilities.

This is why, when you want to compare tools with respect to a given application, it makes sense to establish a ground truth of actual vulnerabilities and false vulnerabilities in the code - and then count how many of the actual vulnerabilities are found, and how many of the false vulnerabilities are incorrectly reported by the tools you want to compare. A test application that includes such a ground truth can then be used as a benchmark for comparing tools.

Benchmark Suites

Obviously, designing such a benchmark means a substantial amount of work. Fortunately, for Java (as for other common languages), some pre-fabricated benchmarks designed by security experts exist, such as the OWASP Benchmark, the Juliet Test Suite, SecuriBench, WAVSEP, AltoroJ, and others. This is good, yet the results of vulnerability detection tools on these benchmarks should be taken with a grain of salt. Indeed, these benchmarks have been around for a while and are widely known. Hence it is to be expected that developers of vulnerability detection tools tested their tool extensively on these well-known benchmarks and possibly improved them - both to enhance the overall strength of their tool and to achieve a better score.

Think about it: If you were a teacher and you wanted to test the abilities of your students, you probably would not give them an exam that’s been publicly available for years and that all your students have had ample opportunity to study. In that sense, benchmarks are actually more helpful for developers of vulnerability detection tools than users of those tools. Developers may notice possible improvements (or bugs) when running their tool on such a benchmark, and by implementing (or fixing) them they will improve the overall quality of their tool. However, as a user who wishes to compare tools, it is advisable to run a tool on a wide variety of applications, including lesser known or (even better!) self-written code. This will give you a much better idea of how such a tool performs on real-world code. We covered this topic earlier in our blog post about 5 Best Practices for SAST Evaluation.

The OWASP Benchmark

Among all the aforementioned benchmarks, the OWASP Benchmark is arguably the most well-known one. Its most recent version (v1.2) comprises 2,740 test cases distributed across 11 different vulnerability categories. Each test case may either be an actual vulnerability (i.e., the tool should report that vulnerability) or a false vulnerability (i.e., the tool should not report that vulnerability). Each test case is assigned a number between 1 and 2,740, with no apparent connection between that number and its vulnerability category or whether it is an actual vulnerability, i.e., the numbers are distributed in a seemingly random fashion. See Table 1 for details.

Vulnerability Category (CWE) # of vulnerable test cases # of non-vulnerable test cases Total
Path Traversal (22) 133 135 268
Command Injection (78) 126 125 251
Cross-Site Scripting (79) 246 209 455
SQL Injection (89) 272 232 504
LDAP Injection (90) 27 32 59
Weak Cryptography (327) 130 116 246
Weak Hash (328) 129 107 236
Weak Randomness (330) 218 275 493
Trust Boundary Violation (501) 83 43 126
Cookie Misconfiguration (secure flag) (614) 36 31 67
XPath Injection (643) 15 20 35
Total 1415 1325 2740

Table 1: Test cases in OWASP Benchmark v1.2

As can be seen, the OWASP Benchmark establishes a ground truth for true positives and false positives, such that the results of different tools can be easily compared: For each test case, we can determine whether the tool reported the corresponding vulnerability or not. The percentage of reported vulnerabilities among the vulnerable test cases constitutes the true positive rate, while the percentage of reported vulnerabilities among the non-vulnerable test cases constitutes the false positive rate. This way, the results of different tools are fully and directly comparable.

Clearly, the best possible outcome is a TPR of 100% and an FPR of 0%. By contrast, consider a tool that would simply guess uniformly at random in each case whether a vulnerability is present or not: Such a tool would be expected to achieve a TPR of about 50% and an FPR of about 50%. In the same vein, a tool that would always guess that a vulnerability is present would achieve 100% true positives and 100% false positives, while a tool that would always guess that there is no vulnerability would achieve 0% true positives and 0% false positives. Obviously, none of these tools would be of any use (see the OWASP Benchmark Project for further details).

In order to assess where current tools stand, the OWASP Benchmark developers have published a comparison chart for various open-source and commercial SAST tools, such as Veracode, Checkmarx, Fortify, and Coverity (see Figure 1). In the same manner as seen everywhere else, we bluntly added our results to this chart for direct comparison.

Comparison of the results of other vulnerability detection tools

Figure 1: Comparison of our results to other vendors

RIPS Results on OWASP Benchmark

Notwithstanding everything that’s been said above about benchmark suites, our customers often ask us how we perform on that benchmark, or test it on their own with a free trial. So how well does RIPS perform on the OWASP Benchmark?

RIPS is able to achieve a true positive rate of 100% and a false positive rate of 0%.

As with any other project, running RIPS on the OWASP Benchmark is as simple as hitting the Start Analysis button; no special configuration, as with some other vulnerability detection tools, is required. Each OWASP test case is a self-contained, fully runnable Java Servlet that may or may not be actually exploitable. Many of the test cases are intentionally hard from a static analysis perspective: Among other things, they require a highly accurate interprocedural analysis, a precise type inference mechanism, modeling internal methods of the standard library, or detecting semantically unreachable code. Support for third-party frameworks such as Spring is also required. Fortunately, our language-specific Java engine already supported all of those things. In the following, we would like to share our journey of what still needed improvement.

Achieving 100% True Positives

In our first run out-of-the-box, about 90% of the vulnerable test cases were flagged as vulnerable. Some of the actual vulnerabilities that were not reported could be simply fixed by improving our language-specific analysis model. An actual new feature was needed to correctly process values read from configuration files: Consider the code fragment in Listing 1, an excerpt from a test case that contains an actual vulnerability in the category Weak Hash.

1
2
3
4
java.util.Properties props = new java.util.Properties();
props.load(this.getClass().getClassLoader().getResourceAsStream("benchmark.properties"));
String algo = props.getProperty("hashAlg1", "SHA512");
java.security.MessageDigest md = java.security.MessageDigest.getInstance(algo);

Listing 1: Weak Hash vulnerability in the OWASP Benchmark (test #2677)

In the code snippet in Listing 1, an input string is to be hashed. The hash algorithm to be used is read from the configuration file benchmark.properties. If that file cannot be found or does not contain a configuration value named hashAlg1, the algorithm SHA512 is used as a fallback: SHA512 may be considered as safe, such that no vulnerability should be raised due to the fallback value. However, the OWASP Benchmark does come with a file benchmark.properties, and this configuration file does specify a value for the key hashAlg1, as shown in Listing 2.

1
hashAlg1=MD5

Listing 2: Excerpt from the file benchmark.properties

As can be seen, MD5 is used as a hash algorithm in this concrete case: That is, while the Java code does not exhibit a Weak Hash vulnerability by itself, it is vulnerable in the context it is executed in due to the presence of the configuration file and its contents. This file also contains other configuration values (e.g., encryption algorithms) and is used in a variety of test cases throughout the OWASP Benchmark. In order to properly find these vulnerabilities, our static analysis engine needed the ability to discover and read configuration files via java.util.Properties. Our engine can now successfully discover the corresponding file, parse it, store the configuration values, use that information during analysis of the subsequent code, and eventually raise an alarm as expected.

Having improved our engine as described, the next run of our engine now showed a true positive rate of 100%. The fact that we were able to achieve a TPR of 100% with relative ease (something that most other SAST products did not manage to do) underlines the completeness and precision of our algorithms.

Achieving 0% False Positives

In our first run out-of-the-box, about 30% of the non-vulnerable test cases were flagged as vulnerable. A closer look at these false positives revealed that each case was due to one of three reasons: Vulnerabilities present in semantically unreachable code, overtainting of collections, and encryption algorithms questionably considered as secure by the benchmark itself.

Unreachable code

Of the 30% false positives, almost half were due to vulnerabilities triggered by code that is semantically unreachable: Consider the example in Listing 3.

1
2
3
4
5
6
7
int num = 86;
if ( (7*42) - num > 200 )
  bar = "This_should_always_happen";
else bar = param;

Runtime r = Runtime.getRuntime();
Process p = r.exec(cmd + bar);

Listing 3: False Command Injection vulnerability (test case #90)

Here, the variable param contains some attacker-controllable input. Under certain circumstances, the variable param is assigned to the variable bar, which is then in turn used to execute a system command. However, the circumstances under which this happens never occur, because the assignment from param to bar is unreachable: Due to the arithmetic computation, the then-branch of the if-statement is always chosen, while the else-branch is never evaluated.

The precise code simulation implemented by our engine correctly performs the arithmetic computation in Listing 3. Nevertheless, so far this information was not used to prune unreachable branches with respect to the propagation of tainted values, and, hence, a vulnerability was raised in these cases, even though none exists. We improved our engine by disregarding taints that flow through unreachable code. Thereby, we were able to reduce the false positives from about 30% to about 17%.

Overtainted collections

The second frequent reason for the false positives that we observed was an imprecise handling of collections: Consider the example in Listing 4.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
String bar = "alsosafe";
if (param != null) {
  java.util.List<String> valuesList = new java.util.ArrayList<String>( );
  valuesList.add("safe");
  valuesList.add( param );
  valuesList.add( "moresafe" );
  
  valuesList.remove(0); // remove the 1st safe value
  bar = valuesList.get(1); // get the last 'safe' value
}
String sql = "SELECT * from USERS where NAME='foo' and PASSWD='" + bar + "'";
org.owasp.benchmark.helpers.DatabaseHelper.JDBCtemplate.batchUpdate(sql);

Listing 4: False SQL Injection vulnerability (test case #200)

As before, the variable param contains some attacker-controllable input. In the code excerpt shown in Listing 4, a new empty list is initialized. Three elements are added to the list, where the second of these elements is the dangerous variable param, while the other two are safe values. Subsequently, the first element of the list is removed again, so that the variable param is now the list’s first element. Finally, the collection’s second element is retrieved, i.e., the safe value "moresafe". Thus, the code in Listing 4 is not actually exploitable, yet was reported by our engine as a vulnerability before.

This is a typical approach for static analysis tools: Either do not consider any of the returned elements as tainted (under-approximation) or consider all the returned elements as tainted (over-approximation). Since our engine already simulates many methods of the standard library, including the methods declared by Collection and its implementing classes, in a highly accurate manner, improving our analysis engine to consider the exact returned value when the entire contents of a collection are statically known was an easy feat and doable without any measurable performance hits. This improvement further reduced the false positive rate from 17% to under 9%.

Questionable ciphers

All remaining false positives were in the category Weak Cryptography (see Table 1). Namely, all of the 116 non-vulnerable test cases in that category were flagged as vulnerable. In all of these test cases, data is encrypted using some encryption algorithm, as the example in Listing 5 shows.

1
2
3
4
5
javax.crypto.Cipher c = javax.crypto.Cipher.getInstance("DESEDE/ECB/PKCS5Padding");

// Prepare the cipher to encrypt
javax.crypto.SecretKey key = javax.crypto.KeyGenerator.getInstance("DESEDE").generateKey();
c.init(javax.crypto.Cipher.ENCRYPT_MODE, key);

Listing 5: (Alleged) false weak encryption vulnerability (test case #58)

In the example in Listing 5, the used encryption algorithm is DESede/ECB/PKCS5Padding. That is, data is encrypted using DESede (also known as Triple DES) as block cipher, ECB as mode of operation, and using a simple padding scheme described in PKCS#5. In fact, each of the 116 alleged non-vulnerable Weak Cryptography test cases in the OWASP Benchmark uses one of the following three algorithms:

  • DESede/ECB/PKCS5Padding
  • AES/CBC/PKCS5Padding
  • RSA/ECB/PKCS1Padding

In our opinion, none of these can, in good conscience, be considered as truly secure. The ECB mode of operation is known to preserve the structure of the plaintext and therefore does not hide all information adequately. Decryption with CBC mode of operation may be vulnerable to so-called padding oracle attacks. Similarly, RSA with PKCS#1 v1.5 padding (note that in this case, the “ECB” in that string is only present for consistency reasons, but is actually ignored, since RSA is not a block cipher) is vulnerable to the Bleichenbacher attack. Consequently, an alarm raised by RIPS when data is encrypted using these algorithms is, in our view, sensible; we would like to alert our users to the problems that could exist.

Clearly, it would be trivial to tell our engine that these algorithms are safe; as a consequence, we would immediately achieve a TPR of 100% with an FPR of 0% (and in fact, we internally did so in a test run, just to confirm that we do indeed get 100% true positives at 0% false positives, and we did). Yet that would have been plainly dishonest towards our users: Hiding actual vulnerabilities simply because the OWASP Benchmark doesn’t consider them as vulnerable is obviously a no-go.

Instead, we reported this problem to the developers of the OWASP Benchmark and explained our concerns. To our relief, they fully agreed that these test cases are indeed questionable and that they need to be updated so as to use truly secure encryption algorithms. We would like to take the opportunity to express our gratitude towards the OWASP Benchmark developer team for their quick reaction and their professionality, and for maintaining such an extensive suite. We are looking forward to seeing an updated version of the OWASP Benchmark.

Conclusion

In this blog post, we’ve taken a close look at the OWASP Benchmark and reported on our results. To the best of our knowledge, we are the first SAST tool to achieve a true positive rate of 100% and a false positive rate of 0% - a perfect score. It once again bolsters our belief that our language-specific approach at static analysis is both extremely thorough and highly precise.

Such results, however, should not be staked too high. As we’ve argued in the beginning and exemplarily demonstrated throughout this blog post, it is possible to optimize tools to perform well on a particular benchmark. This is not to say that that’s a bad thing per se. At the contrary, a benchmark can be of valuable help in improving a vulnerability detection tool for the developers of that tool. As we’ve shown, investigating the OWASP Benchmark induced several improvements to our tool which will undoubtedly improve our results on other applications as well. Yet the simple fact that a vulnerability detection tool performs well in a benchmark cannot be taken as the sole clue with respect to how good it is at detecting vulnerabilities in general, for arbitrary applications.

In addition, as we’ve seen, benchmarks may themselves contain test cases that are either of no substantial relevance for real-world applications or that are questionable in the first place. We have seen and reported similar cases in other test suites. To truly compare tools with one another, it is therefore advisable to run the tools on various real applications, possibly including, but certainly not limited to benchmarks.