Reading Concuerror’s reports (Testing Poolboy, Part 2)

We continue our tutorial on Concuerror, explaining the errors it can detect and the options that you can use to filter out those that are not important for your application.

Index

Symbolic names
Abnormal exits
1. Ignoring “after” timeouts
2. Treating abnormal exit reasons as normal
A report without problems
1. Interleavings as a graph
2. Racing events
“Deadlocked” processes

Symbolic names

Concuerror makes an effort to replace every Erlang PID that appears in its report with a symbolic name. The first process spawned is labelled P and every other process is named after the process that spawned it, with an integer denoting the order in which it was spawned i.e. P’s first “child” is named P.1, P.1’s third child is named P.1.3 and so on.

If you prefer to see the raw PIDs you can use --symbolic_names false. Concuerror is reusing the same processes, so the results should be consistent across different interleavings.

Abnormal exits

Every time a process exits abnormally, Concuerror will mark the interleaving as erroneous. This means that any exception that escapes to the top-level will trigger a warning. In our example, processes P, P.1.1 and P.1.1.1 exited abnormally.

Erroneous interleaving 1:
* At step 52 process P exited abnormally
    Reason:
      {timeout,{gen_server,call,[P.1,stop]}}
    Stacktrace:
      [{gen_server,call,2,[{file,"gen_server.erl"},{line,182}]},
       {poolboy_tests_1,pool_startup,0,
                        [{file,"poolboy_tests_1.erl"},{line,8}]}]
* At step 76 process P.1.1.1 exited abnormally
    Reason:
      shutdown
    Stacktrace:
      []
* At step 85 process P.1.1 exited abnormally
    Reason:
      shutdown
    Stacktrace:
      [{proc_lib,exit_p,2,[{file,"proc_lib.erl"},{line,260}]}]

Ignoring “after” timeouts

If we take a look at the trace we can see that P triggered the standard timeout clause of a gen call:

  50: P: receive timeout expired after 5000 ms
    in gen.erl line 213
  [...]
  52: P: exits abnormally ({timeout,{gen_server,call,[P.1,stop]}})

As explained here, Concuerror by default assumes that any receive statement may trigger the after clause, unless it is impossible for a matching message not to have already arrived.

However, let’s assume that we don’t care about such timeouts. We can use the --after_timeout 1000 to treat any timeout higher than 1000ms as infinity. Notice that the one of the tips we got earlier suggests the same course of action:

Tip: A process crashed with reason '{timeout, ...}'. This may happen when a call
  to a gen_server (or similar) does not receive a reply within some standard
  timeout. Use the '--after_timeout' option to treat after clauses that exceed some
  threshold as 'impossible'.

Treating abnormal exit reasons as normal

The other two processes exited abnormally because they were terminated by events generated by the delivery of a stop message to a supervisor. The supervisor then sent shutdown exit signals to the relevant processes. Again, let’s assume that this is acceptable behaviour in our context, using --treat_as_normal shutdown (also suggested by a tip).

Tip: A process crashed with reason 'shutdown'. This may happen when a supervisor
  is terminating its children. You can use '--treat_as_normal shutdown' if this is
  expected behavior.

A report without problems

We now run Concuerror again, using the new options:

poolboy $ concuerror -f poolboy_tests_1.erl -m poolboy_tests_1 -t pool_startup \
  --pa .eunit --after_timeout 1000 --treat_as_normal shutdown

This time the output finishes in:

[...]
Warning: Some abnormal exit reasons were treated as normal ('--treat_as_normal').
Done! (Exit status: completed)
  Summary: 0 errors, ... interleavings explored

Concuerror finished the exploration, finding all the different states that the program can reach. None of these interleavings had any processes terminating abnormally with a reason different from shutdown, nor any other errors.

Interleavings as a graph

Concuerror can also produce a DOT file, including all the interleavings. This feature can be enabled with --graph <filename>. You can then use dot to convert the DOT file to an image.

Racing events

The report includes a section where each pair of events that are racing is reported, to justify each additional interleaving. Let’s take a look at the output:

[...]
################################################################################
Race Pairs:
--------------------------------------------------------------------------------
You can disable race pair messages with '--show_races false'
* Exit signal (normal) from P.1 reaches P.1.1
   is in race with
  Message ({'DOWN',#Ref<0.0.0.508>,process,P.1.1.1,shutdown}) from P.1.1.1 reaches P.1.1
* Message ({'DOWN',#Ref<0.0.0.508>,process,P.1.1.1,shutdown}) from P.1.1.1 reaches P.1.1
   is in race with
  Timer #Ref<0.0.0.522>: expires, delivering {timeout,#Ref<0.0.0.522>,kill} to P.1.1
* P.1: exits normally
   is in race with
  P: true = erlang:demonitor(#Ref<0.0.0.407>, [flush])
    in gen.erl line 215

We can see e.g. that the delivery of an exit signal from a linked process races with the delivery of an arbitrary message, (since P.1.1 is trapping exits), a similar race appears between a message from a timer and finally a process exiting races with the cancellation of a monitor, as the monitor message will be sent in one case but not in the other.

The same pair of events may be reported multiple times, but this should happen only if we are exploring the reverse interleaving of a different race.

If you see a pair of events that you believe should not be racing, feel free to report the discrepancy! Concuerror’s race detection accuracy is the most sensitive component regarding the total number of interleavings explored!

“Deadlocked” processes

Since we called poolboy:stop/1, all processes were properly stopped. What would happen if we had let a process lingering? We already have such a test, pool_startup_no_stop/0 (we need to use poolboy:start/1 instead of start_link, as the normal exit signal from the initial process also causes a shutdown).

Running this test, yields:

poolboy $ concuerror -f poolboy_tests_1.erl -m poolboy_tests_1 --pa .eunit \
 -t pool_startup_no_stop
Concuerror started at 03 Jun 2014 12:00:00
Writing results in concuerror_report.txt
[...]
Done! (Exit status: warning)
  Summary: 1 errors, 1/1 interleavings explored

The output file now contains the following description:

Erroneous interleaving 1:
* Blocked at a 'receive' ("deadlocked"; other processes have exited):
    P.1 in gen_server.erl line 360
    P.1.1 in gen_server.erl line 360
    P.1.1.1 in gen_server.erl line 360

Concuerror reports any processes that have not exited, when no other processes are available to run. Such processes are blocked at a receive statement and are considered “deadlocked”. These warnings can be turned off by --ignore_error deadlock.

Concuerror will always reset the test to the initial state, so leaving some processes running may be a better alternative to suffering from all the races between exit signals that appear at termination.

Categories