Detecting Information Leaks with a Fuzzer

NOTE: This essay explains many of the basic concepts underpinning LeakFuzzer ([source here] [paper here]). And to a lesser extent, the underpinnings of NIFuzz ([source here]).

Fuzzers can find a lot of bugs - but on their own they are pretty dumb. Generally they can only tell whether the program executed successfully, crashed or exceeded a specified runtime limit (which we could call a ‘hang’ or ‘timeout’^{1which could potentially be used in a denial-of-service attack}). The brilliant set of sanitizers make it possible for a fuzzer to detect issues such as out-of-bounds memory accesses, memory leaks, some uninitialised memory uses and integer and floating point overflows. Behind the scenes, these sanitizers are inserting additional checks into the program which cause it to crash if any of the issues are detected - though the implementations themselves can be quite complex.

All of the above sanitizers are capable of detecting bugs that can be demonstrated with a single program execution. That's reasonable, but how could we detect bugs that require multiple program executions in order to be demonstrated? Do such bugs even exist?

As it turns out, there are classes of bug that require multiple program executions to confirm their existence. The presence of a null pointer dereference is a property of a single program execution (either it dereferences a null pointer or it doesn't - and you'll certainly know if it does); properties that can only be expressed in sets of executions are known as hyperproperties.

Information Leaks

Much of my PhD work was on using fuzzers to detect information leaks, which are a form of error that requires multiple program executions to confirm; and these are the topic of this essay.

Firstly, let's define what we mean by an information leak; then let's look at a couple of examples. To quote the CWE-200 definition: The product exposes sensitive information to an actor that is not explicitly authorized to have access to that information. Now, there are a number of ways that this can happen, but my work was concerned with detecting programming errors within software that result in different outputs from the program, depending on the confidential inputs.

Let's look at a very simple example to start with; we'll look at a single function that takes a secret (confidential) input and returns a public (non-confidential) output. Throughout these examples we will treat the return value as the 'program' output - in practice this is really things like stdout, outgoing network requests, and log files.

int my_leaky_func(int secret_input) {
  if (secret_input > 0) {
    return 1;
  } else {
    return 0;
  }
}

Here we can see that a return value of 1 indicates that the value of secret_input was > 0, and a return value of 0 indicates that the value of secret_input was <= 0. As a user of the system without confidential access, we can still see the return value of the function - as we stated that it is a public (non-confidential) output. The problem here is that we can learn some information about the value of the secret_input; and we should not be able to.

Hypertesting

Ok, so we know that information is being leaked about the value of secret_input, but how can we demonstrate it? The answer is relatively simple: we need to find two different inputs that result in different outputs (from our point of view as a non-confidential level user). For this we can choose the inputs {"secret_input": 0} and {"secret_input": 1} - which produce outputs 0 and 1 respectively.

As we require two different inputs to demonstrate the issue, the property must be a hyperproperty. In fact, the property we are looking for is the non-interference property. Translated to this context, the property is that the secret (confidential) input to the function does not interfere with the public (non-confidential) output. The pair of tests we found above demonstrate a violation of this property.

This pair of tests is called a hypertest.

Security Policies

You might notice that we seemed to conjure this security policy - that secret_input is confidential, and the function return value is non-confidential - out of thin air. And you'd be right, there is nothing in the code that tells us this. It's not that people haven't tried building these rules into the type system of programming languages [1,2,3]; it's just that the developer overhead is just not worth it in most cases.

You'll have probably come across security policies organically in computer file systems when attempting to open a file only to receive a warning that "you do not have permission to access this file". Another place where these are obvious is social media platforms, where users can restrict the visibility of profile information and posts (see here).

Automatically Detecting Information Leaks

So, basically every useful program ever written does not have information flow security built into its source code; and most do not have a security policy written down anywhere. Still, it would be nice to be able to check whether programs conform to some security policy (ideally the one written down somewhere, but more likely existing in the developer's mind).

People have come up with a number of ways to do this automatically:

Use taint analysis, which tracks the flow of data through a program (or system). FlowDroid does so for Android apps. In particular FlowDroid is tracking the flow of user information from sources such as contacts, location and calendar information; to sinks such as web requests, email or SMS². For context, a common pattern of Android malware is to harvest user data and offload it to the attackers.

Compile in runtime assertions that enforce security policies; the linked work defines these in terms of finite state machines.

Use a model checker to verify whether there exist any hypertests that demonstrate a violation of the non-interference property.

Adapt an existing automated test generation tool to generate hypertests, and use assertions to fail any tests that demonstrate a violation of the non-interference property.

It is this last approach that I took in my work. I chose fuzzers as they work at a system level, but alternatively you could use a tool like EvoSuite which generates unit tests.

(ab)Using a Fuzzer to do Hypertesting

In the grey-box fuzzers such as AFL and its ilk, the fuzzer produces an input that is a plain old byte-array. The 'libFuzzer' harness is seen as the default for implementing a fuzzer harness; and most fuzzers will support it. Here's a minimal example:

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  myTestFunction(Data, Size); // run the functionality you want to test
  return 0;
}

For a program that takes a string as input (such as a parser), building such a harness is trivial. For a program that takes more complex inputs, you need to construct the required types from the byte-array. We can use this to construct a hypertest like so:

int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  int secret_val, alt_secret_val;
  if (Size != sizeof(secret_val) + sizeof(alt_secret_val)) {
    // let's only use inputs if they're exactly the right size
    return 0;
  }

  // Read the variable values out from the byte-array Data
  secret_val = *(int *)Data;
  alt_secret_val = *(int *)(Data + sizeof(secret_val));

  // Check that the 'my_leaky_func' returns the same value for both
  assert(my_leaky_func(secret_val) == my_leaky_func(alt_secret_val));  

  return 0;
}

Public Inputs

So far we've just been looking at that one trivial example function (my_leaky_func) that takes a single secret input. In reality, we need to deal with systems that can also take public (non-confidential) input - e.g. Heartbleed was triggered by a packet that anyone could send.

Demonstrating the presence of an information leak with a hypertest in this case requires an extra constraint: the value of any public (non-confidential) input parts must be the same for both tests. Why? Because if the public input part differs, the differing outputs could be solely due to that; meaning that no information from the secret part is disclosed.

Modifying our harness from above:

int my_other_leaky_func(int public_val, int secret_val) {
  if (public_val == 42) {
    return secret_val > 0;
  } else {
    return 0;
  }
}
          
int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  int public_val, secret_val, alt_secret_val;
  if (Size != 3 * sizeof(int)) {
    // let's only use inputs if they're exactly the right size
    return 0;
  }

  // Read the variable values out from the byte-array Data
  public_val = *(int *)Data;
  int pos = sizeof(public_val);
  secret_val = *(int *)(Data + pos);
  pos += sizeof(secret_val);
  alt_secret_val = *(int *)(Data + pos);

  // Check that the 'my_other_leaky_func' returns the same value for both
  assert(
    my_other_leaky_func(public_val, secret_val) == 
    my_other_leaky_func(public_val, alt_secret_val)
  );  

  return 0;
}

For the example above, we could trigger a crash through the assertion with {"public_val": 42, "secret_val": 0, "alt_secret_val": 1} which results in return values of 0 and 1 respectively. Notice that {"public_val": 0, "secret_val": ?, "alt_secret_val": ?} does not trigger the assertion as the 'else' branch of my_other_leaky_func is always taken - thus always returning 0.

Caching

The issue with using a setup like the harness above is that you potentially need to test every pair of tests (with matching public part and differing secret parts). This means that if there are n different possible secret inputs then you need to run ≈n² tests to get them all. For the examples, with a 32-bit int as input, this means ≈(2³²)² or 2⁶⁴ executions - and that's just for each public input (and there are 2³² of those).

Really, we only care if there are any pairs (with matching public input parts) that result in differing outputs. This means that we can get away with storing only the output for each different public input. To do this, we need only store a mapping from public input to (public) output. A pseudocode (Python) implementation of a fuzzer doing this is as follows:

output_for_input = {}

# main fuzzing loop - runs indefinitely
while True:
  public_input, secret_input = get_input()
  public_output = my_other_leaky_func(public_input, secret_input)
  if public_input not in output_for_input:
    output_for_input[public_input] = public_output    
    continue

  if public_output != output_for_input[public_input]:
    print(f'Found a failing hypertest for {public_input}!')

With this we can find out that there is a failing hypertest (and thus information leak), but we don't have enough information to reproduce it. For that, we need to also store the secret input that produced each output:

hypertest_map = {}
while True:
  public_input, secret_input = get_input()
  public_output = my_other_leaky_func(public_input, secret_input)
  if public_input not in hypertest_map:
    hypertest_map[public_input] = (secret_input, public_output)
    continue

  alt_secret_input, alt_public_output = hypertest_map[public_input]
  if public_output != alt_public_output:
    failing_hypertest = [
      {"public_input": public_input, "secret_input": secret_input},
      {"public_input": public_input, "secret_input": alt_secret_input}
    ]
    print(f'Found a failing hypertest: {failing_hypertest}')

With the aid of this dictionary, hypertest_map, we need to only execute each secret input once; taking us from O(n²) to O(n)!

Handling 100,000,000 inputs (and outputs)

There is clearly a computational gain to be made by caching the input-output data in this way. But one thing to bear in mind is the sheer number of inputs executed when fuzzing - it is not uncommon to see 10,000 or even over 100,000 executions per second (per CPU core). Then factor in that fuzzing campaigns run for hours to days. Doing the math, at the low end that's 10,000 * 3600 * 24 = 864,000,000 inputs executed over 24 hours. For our example, that potentially means ≈10GB of raw data: 4 bytes for each key (public input), and 8 bytes for the corresponding value (secret input and public_output).

Unfortunately, an awful lot of real-world programs take much larger inputs than that - and can produce even larger outputs. To save space, we can lean even further on the fact that we only care about failing hypertests. We can store only a hash of each input or output - then if we find that find that there is a differing output, we can store that in full and wait until we see another one:

# We will now store a mapping from {public_input_hash:
# (secret_input_hash, public_output_hash, secret_input_full)}
hypertest_map = {}

while True:
  public_input, secret_input = get_input()
  public_output = my_other_leaky_func(public_input, secret_input)
  public_input_hash = hash(public_input)

  if public_input_hash not in hypertest_map:
    # Leave the 'secret_input_full' empty for now, to save memory
    hypertest_map[public_input] = (secret_input, public_output, None)
    continue

  # We have seen this public input before, so fetch it
  existing = hypertest_map[public_input_hash]
  alt_sec_in_hash, alt_pub_out_hash, alt_sec_in_full = existing

  if hash(public_output) == alt_public_output_hash:
    # We got the same output as the other one(s)
    continue

  if alt_sec_in_full is None:
    # This is a different output, but we didn't store a full 
    # secret input yet
    hypertest_map[public_input_hash] = (
      hash(secret_input), hash(public_output), secret_input
    )
    continue

  failing_hypertest = [
    {"public_input": public_input, "secret_input": secret_input},
    {"public_input": public_input, "secret_input": alt_sec_in_full}
  ]
  print(f'Found a failing hypertest: {failing_hypertest}')

This means that we are only storing hashes of the input-output data, up until a failing hypertest is witnessed. Now that we know there's a failing hypertest to be found, we can store the full secret input. And then, the next time we observe the original output (or another unseen output), we can output the complete pair of tests. This means that regardless of input or output size, we can cache enough information to detect information leaks for fuzzing campaigns.

Caveats

For hypertesting to work, the program must be deterministic - if it is not, then we may observe differences in output that are due to non-determinism rather than information leakage. In my early work on detecting information leaks with fuzzers, I worked around this by simply rerunning each test in the failing hypertest 100 times. If the output varied at all, I simply discarded the pair rather than report the information leak; resulting in false negatives.

Summary

Information leaks are a violation of the non-interference property, which is a hyperproperty. We can show the presence of an information leak by finding a pair of tests with matching public inputs and differing secret inputs, that produce differing outputs. Assuming that the program we are testing produces deterministic output, the difference between the 2 outputs must be due to information in the secret input.

We can construct a fuzzing harness that generates a single public input and two secret inputs, then runs the functionality twice and asserts that the output is the same. This requires testing each pair of secret inputs, giving us an O(n²) solution. We can get this down to O(n) by holding a mapping from public inputs to outputs, but given the number of executions in a typical fuzzing campaign, this can use too much memory for large inputs and outputs. Instead, we can store a hash of the necessary values, and only bother to store the full values when we know that a failing hypertest exists.

¹ This could be potentially used in a denial-of-service attack ↩
² A full list can be found here ↩.