The Whys and Hows of Cimba, Explained

In this section, we will explain the background for Cimba and some of the key design choices that are made in it. We start with a brief history that is necessary background for the project goals.

Project history and goals

Cimba 3.0.0 was released on GitHub as public beta in late 2025 but, as the version number indicates, there is some history before this first public release.

The earliest ideas that eventually became Cimba date to work done at the Norwegian Defence Research Establishment in the late 1980s, towards the end of the Cold War. I built, maintained, and ran discrete event simulation models in languages like Simscript and Simula67. Encountering Simula67’s coroutines and object-orientation was revelatory in its essential rightness. However, Simula67 was still severely limited in many other respects and not really a practical option at that time. Our simulation models were quite large, up to several hundred thousand lines of code. The main language used was Simscript II.5.

Around 1990, I started building discrete event simulation models in C++ as an early adopter of that language. The first C++ models ran on VAXstations where spawning a coroutine is a single assembly instruction. Trying to port that code to a Windows PC was a somewhat painful experience (and a complete failure). I actually complained to Bjarne Stroustrup in person about the inconsistent to non-existent support for Simula-like coroutines in C++ at a conference in Helsingør, Denmark, probably in 1991. He seemed to agree but I silently resolved to build my next simulation model in pure K&R C.

That opportunity arose at MIT around 1994, where I needed a discrete event simulation model for cross-checking analytical models of manufacturing systems. For perhaps obvious reasons, this was a clean sheet design with no code carried forward from the earlier C++ work at NDRE. It had a collection of standard random number generators and distributions, and used a linked list for its main event queue. It did the job but could be improved. In retrospect, I consider this Cimba version 1.0.

For my PhD thesis research at the MIT Operations Research Center, I needed to run many simulations with various parameter combinations and replications. By then, I had realized that parallelizing a discrete event simulation model is trivially simple if one looks at it with a telescope instead of trying to use a microscope. The individual replications are meant to be independent identically distributed trials, implying that there is no interaction between them at runtime. One can just fork off as many replications in parallel as one has computing resources for, and use one computing node to dole out the jobs and collect the statistics.

This was lashed together at the operating system process level using rsh and a Perl script to control the individual simulations on a cluster of Sun Sparcstations across MIT campus, and, just to make a point, on a IBM Pentium PC running Linux and one Sparcstation back at the Norwegian Defence Research Establishment for an architecture independent and trans-Atlantic distributed simulation model. At the same time, the core discrete event simulation engine was rewritten in ANSI C with a binary heap event queue. It needed to be very efficient and have a very small memory footprint to run at low priority in the background on occupied workstations without anyone noticing a performance drop. This was pretty good for 1995, and can safely be considered Cimba version 2.0, but it was never released to the public.

After that, not much happened to it, until I decided to dust it off and publish it as open source many years later. The world had evolved quite a bit in the meantime, so the code required another comprehensive re-write to exploit the computing power in modern CPU’s, this time coded in C17 and assembly with POSIX pthreads concurrency. This is the present Cimba 3.0.

The goals for Cimba 3.0 are quite similar to those for earlier versions:

  • Speed and efficiency, where small size in memory translates to execution speed on modern CPUs with cached memory pipelines, and where multithreading on CPU cores provides the parallelism.

  • Portability, running both on Linux and Windows, initially limited to the AMD64 / x86-64 architecture and GCC-like compilers, with more architectures planned.

  • Expressive power, combining process- and event-oriented simulation worldviews with a comprehensive collection of state-of-the-art pseudo-random number generators and distributions.

  • Robustness, using object-oriented design principles and extensive unit testing to ensure that it works as expected (but do read the Licence, we are not making any warranties here).

I believe that Cimba 3.0 meets these goals. I hope you will agree.

Coroutines revisited

It is well known that the Simula programming language introduced object-oriented programming. For those of us that were lucky enough (or just plain old enough) to actually have programmed in Simula67, the object-orientation with classes and inheritance was only part of the experience, and perhaps not the most important part.

The most powerful concept for simulation work was the coroutine. This generalizes the concept of a subroutine to several parallel threads of execution that co-exist and are non-preemptively scheduled. Combined with object-orientation, it means that one can describe a class of objects as independent threads of execution, often infinite loops, where the object’s code just does its own thing. The objects become active agents in their own world. The complexity in the simulation arises from the interactions between the active processes and various passive objects, while the description of each entity’s actions is very natural.

Coroutines received significant academic interest in the early years, but were then overshadowed by the object-oriented inheritance mechanisms. It seems that current trends are turning away from the more complex inheritance mechanisms, in many cases using composition instead of (multiple) inheritance, and also reviving the interest in coroutines. One fairly recent article is https://www.cs.tufts.edu/~nr/cs257/archive/roberto-ierusalimschy/revisiting-coroutines.pdf

Unfortunately, when C++ finally got “coroutines” as a part of the language in 2020, these turned out to be both less powerful and less efficient than expected. (See https://probablydance.com/2021/10/31/c-coroutines-do-not-spark-joy/ for the details.) For our purposes here, it is sufficient to say that these coroutines are not the coroutines we are looking for.

In Cimba, we have some additional requirements to the coroutines beyond being full-fledged coroutines, i.e., stackful first class objects. Our coroutines need to be thread-safe, since we will combine these with multithreading at the next higher level of concurrency. The Cimba coroutines will interact within each thread, but not across threads.

We also want our coroutines to share information both through pointer arguments to the context-switching functions yield(), resume(), and transfer(), and by the values returned by these functions. Control is passed out of a coroutine by calling one of these functions. Control is also passed back to the coroutine by what appears to be a normal return from this function call. The two ways of communicating between coroutines are then sharing a pointer to some mutable context data structure as an argument, and returning a signal value that can be used to determine if something unexpected happened during the call. We will use both, since these give very different semantics.

Moreover, we want our coroutines to return an exit value if and when they terminate, and we want the flexibility of either just returning this exit value from the coroutine function or by calling a special exit() function with an argument. These should be equivalent, and the exit value should be persistent after the coroutine execution ends.

And, of course, we want our coroutines to be extremely efficient. Calling malloc() and free() or memcpy()’ing large amounts of data in each context switch is a definite no go. We want the minimal context switches needed to guarantee correctness, only saving the necessary registers and swapping out the stack pointer, nothing more.

We are not aware of any open source coroutine implementation that exactly meets these requirements, so Cimba contains its own, built from the ground up in C and assembly. This gives us a powerful set of thread-safe stackful coroutines, fulfilling all requirements to “full” coroutines, and in addition providing general mechanisms for communication between coroutines. The Cimba coroutines can both be used as symmetric or as asymmetric coroutines, or even as a mix of those paradigms by mixing asymmetric yield/resume pairs with symmetric transfers. (Debugging such a program may become rather confusing, though.)

Cimba processes are asymmetric coroutines

Our basic coroutines are a bit too general and powerful for simulation modeling. We use these as internal building blocks for the Cimba processes. These are essentially named asymmetric coroutines, inheriting all properties and methods from the coroutine class, adding a name, a priority for scheduling processes, and pointers to things it may be waiting for, to resources it may be holding, and to other processes that may be waiting for it. As asymmetric coroutines, the Cimba processes always transfer control to a single dispatcher process running on the main program stack, and are always re-activated from the dispatcher process only.

The processes understand the simulation time and may hold() for a certain amount of simulated time. Underneath this call are the asymmetric coroutine primitives of cmi_coroutine_yield() (to the simulation dispatcher) and cmi_coroutine_resume() (when the simulation clock has advanced by this amount). Processes can also acquire() and release() resources, wait for some other process to finish, interrupt or stop other processes, wait for some specific event, or even wait for some arbitrarily complex condition to become true.

Since the asymmetric coroutine process concept is fundamental to how Cimba works, we will now look closely at what happens during a context switch between processes.

Suppose we are running the M/M/1 simulation used to benchmark against SimPy, benchmark/MM1_single.c. We are running on a single CPU core. The queue is currently empty, the arrival process is holding, the service process has just woken up from its hold(), looped around, and is now about to get() an object from the queue in line 79 of the code.

The illustration below shows the stacks at this point:

_images/stack_1.png

The service process to the right (green) has the CPU and is executing user code (red text).

The main system stack is to the left. The dispatcher has executed the wakeup event that resumed the service process. It has stored its registers on the stack and transferred control to the service process. The main stack pointer is at the last register pushed to the stack, but is itself safely stored to memory instead of in the SP register.

The arrival process is holding. That call caused a context switch, so its stack pointer is at the last register that was pushed to its stack. The difference from the dispatcher on the main stack is that it is on its own stack without touching the main stack at all.

Now the service process is in its get() call and finds the queue empty. It has to wait, so it registers itself with the resource guard and yields. At that moment, the stacks look like this:

_images/stack_2.png

The arrival process has saved its registers to the stack and its stack pointer to the appropriate member of our struct cmi_coroutine. Control transfers to the dispatcher on the main stack by loading its stack pointer from memory to the register, and then popping the remaining register values from the main stack.

_images/stack_3.png

The stack rapidly returns up to the dispatcher loop in cmb_event_queue_execute(), which now pulls off and executes the next event from the event queue.

_images/stack_4.png

That happens to be another hold wakeup call. When executed, that event in turn resumes the target process, this time the arrival process. The asymmetric coroutine yield()/resume() pair is implemented by symmetric transfer() calls, which in turn triggers the context switch.

At this point, returned from one event and executing the next, the stacks look the same as in the previous illustration, just with different data values in the registers and stack variables.

_images/stack_5.png

Control then passes to the arrival process. Its stack pointer is loaded from memory into the appropriate register.

_images/stack_6.png

It pops the other saved register values from the stack and returns from the context call, which in turn returns back to the user code immediately after the hold() call in line 59 of benchmark/MM1_single.c. At this point, the context switch from the service to the arrival process by way of the dispatcher is complete, the arrival process is executing user code, and the stacks look like this:

_images/stack_7.png

As should be evident from these examples, Cimba does not care about what level of the function call stack its context switching functions get called from. It can be directly from the process function (like in this example), or the process function can call another function, that calls some other function, which in turn calls a Cimba function like cmb_process_hold(). The intermediate user functions in the call chain do not even need to know that they are part of a discrete event simulation, much less care about what stack they are executing on. They are just C functions that get called and do their thing. There is no need to refactor the whole call chain to yield generators at multiple levels if you change something deep in the call chain. This is why we insisted on proper stackful coroutines.

One thing to be aware of: Recent generations of x86-64 CPUs have strong protection against stack-smashing exploits. This protects against a common cybersec attack where a stack-allocated buffer is overwritten and the attacker can change the return pointer, enabling arbitrary code execution. This protection is called Control-flow Enforcement Technology (CET). Cimba intentionally changes the stack and return pointers, and informs the CPU that this is to be expected. Linking Cimba will turn the Intel CET (shadow stack + IBT) off for the host process on x86-64. For this reason, it would be a potential security risk to combine a Cimba simulation with code that reads unchecked user input into fixed-size stack buffers, since one layer of protection against possible exploits is disabled.

We will soon return to Cimba’s processes and their interactions, but if the reader has been paying attention, there is something else we need to address first: We just said “inheriting all properties and methods from the coroutine class”, but we also just said “C17” and “assembly”.

Object-oriented programming. In C and assembly.

Object-oriented programming is a way of structuring program design, not necessarily a language feature. It uses concepts like encapsulation, inheritance, polymorphism, and abstraction to create a natural description of the objects in the program.

  • Encapsulation bundles the properties and methods of objects into a compact description, usually called a class. The individual instances, objects that belong to the same class, have the same structure but may have different data values.

  • Inheritance is the relationship between classes where a child class is derived from another parent class. Objects belonging to the child class also belong to the parent class and inherit all properties and methods from there. The child class adds its own properties and methods, and may overload (change) the meaning of parent class methods.

  • Polymorphism allows the program to deal with parent classes and have each child class fill in the details of what should be done. The canonical example is a class shape with a draw() method, where the different child classes circle, rectangle, triangle, etc, all contain their own draw() method. The program can then have a list of shapes, ask each one to draw() itself, and have the different subtypes of shapes do the appropriate thing.

  • Abstraction selectively exposes important features to surrounding code and hides implementation details. This clarifies what an object does, while shielding the complexity of how it is done.

This can also be implemented in a language like C, without explicit support from programming language features. The key observation is that the first member of a C struct is guaranteed to have the same address in memory as the struct itself. We can then use structs as classes, encapsulating the properties of our “class” as struct members, and implement inheritance by making the parent struct the first member of a derived “class”. A pointer to a child class object is then also a pointer to a parent class object, exactly as we want.

Polymorphism can be implemented by having pointers to functions as struct members, and then place the appropriate function there when initializing the object. If necessary, this can be extended by having a dedicated vtable object for the class to avoid multiple copies of the function pointers in each class member object, at the cost of one more redirection per function call.

The encapsulation and abstraction is then a matter of disciplined modularity in structuring the code, with the code and header files as a main building block. Careful use of static and extern functions and variables provide the equivalents of private, public, and friend class properties and methods. The header file encapsulate and expose what a module or class does, while the corresponding C code file contains the implementation details.

Even if this is the most natural way of describing the entities in our simulated world, there are other things there that might be less natural to describe as classes and objects. In particular, we do not consider the pseudo-random number generators objects in this sense. They just exist in the simulated world and are called as functions without the complexities of creating an object-oriented framework around them. A clear, modular structure to encapsulate and protect internal workings is still needed.

Cimba functions and variables follow a naming convention of <namespace>_<module>_<function>. There are three namespaces:

  • cimba - functions and objects at the outer level, organizing and executing your multithreaded simulation experiment as a series of trials. This is outside the simulated world. Example: cimba_version()

  • cmb - functions and objects in the simulated world. These are the building blocks of your simulation. You will probably build a single-threaded version only using functions from this namespace first, and parallelize it later when you need the computing power. Example: cmb_random_uniform()

  • cmi - internal functions and objects that for some reason need to be exposed globally, but that your simulation model does not need to interact with. Example: cmi_coroutine_create()

Static functions and variables internal to each module do not use the <namespace>_<module>_ prefix since they do not have global scope. One example is the intermediate function wakeup_event_time visible in the stack images above. It is an internal static function in cmb_process.c. Hence, its name is not prefixed with its namespace cmb and module process since it has file scope only.

There is one notable exception to this naming convention: The function cmb_time(), which returns the current simulation clock value. It is declared and defined in the cmb_event module, and should perhaps be called cmb_event_time() according to our rule, but since it is a global state in the simulated world and not related to any particular event, it is more intuitive to make this one exception for it.

As part of our convention, the object methods will take a first argument that is a pointer to the object itself. This corresponds to the implicit this or self pointer in object-oriented languages. Again, there are a few exceptions: Some functions that are only called by the current process and act on itself do not have this argument. It is enough to say cmb_process_hold(5.0), not cmb_process_hold(me, 5.0). Similarly, calling cmb_process_exit(NULL) is enough, calling cmb_process_exit(me, NULL) would be slightly strange. We believe this exception makes the code more intuitive, even if it is not entirely consistent.

One can claim that this approach to object-oriented programming provides most of the benefits while minimizing the overhead and constraints from a typical object-oriented programming language. However, there are some features we cannot provide directly:

  • Multiple inheritance, where a class is derived from more than one parent class. Our parent classes need to go first in each child class struct, and then there can only be one parent. This is no big loss, since multiple inheritance quickly becomes very confusing. We instead distinguish between is a relationships (single inheritance) and has a relationships (composition). For example, our cmb_resource is a cmi_holdable (an abstract parent class), but it has a cmb_resourceguard to maintain an orderly priority queue of processes waiting for the resource, and the cmb_resourceguard itself is a cmi_hashheap.

  • Automatic initialization and destruction for objects that go in and out of scope. In C++, Resource Allocation Is Initialization (RAII). In C, it is not. (RAINI?) This requires us to distinguish clearly between creating, initializing, terminating, and destroying an object. The create/destroy pair handles raw memory allocation. For objects declared as local variables or implicitly as parent class objects, these are not called. The initialize/terminate pair makes the allocated memory space ready for use as an actual object and handles any necessary clean-up (such as deallocating internal arrays allocated during the object’s lifecycle). In some cases, there is also a reset function, in effect a terminate followed by a new initialize, returning the object to a newly initialized state.

When defining your own classes derived from Cimba classes, such as the visitor class in our third tutorial, your code has the responsibility to follow this pattern. Your allocator function (e.g., visitor_create()) allocates raw memory, while the constructor function (visitor_initialize()) fills it with meaningful values. The constructor does not get called automatically, so your code is also responsible for calling it, both for objects allocated on the heap, objects declared as local variables, and for objects that exist as a parent class to one of your objects. The last case is done by calling the parent class constructor, here cmb_process_initialize(), from within the child class constructor function.

Similarly, your code needs to provide a destructor to free any memory allocated by the object (visitor_terminate()), and a deallocator to free the object itself (visitor_destroy()). Your destructor function should also call the parent class destructor (here cmb_process_terminate()), but your de-allocator should NOT call the parent class de-allocator, since that would be free’ing the same memory twice and probably crash your application.

By looking around in the Cimba code, you will find many examples of how we have used the object-orientation. For example, a cmb_resourceguard does not actually care or know what type of resource it guards, only that it is something derived from the cmi_resourcebase abstract base class. Or, a process may be holding some resource, but may not really care what kind, only that it is some kind of cmi_holdable, itself derived from cmi_resourcebase. If the process needs to drop the resource in a hurry, there is a polymorphic function (really just a pointer to the appropriate function) for how to handle that for a particular kind of resource.

Events and the event queue

The most fundamental property of a discrete event simulation is that state only changes at the event times. The basic algorithm is to maintain a priority queue of scheduled events, retrieve the first one, set the simulation clock to its reactivation time, execute the event, and repeat. The time increments between events will vary.

An event may schedule, cancel, or re-prioritize other events, and in general change the state of the model in arbitrary and application-defined ways. This is why parallelizing a single model run is near impossible: The dispatcher cannot know what event to execute next or what state the next event will encounter before the current event is finished executing.

Cimba maintains a single thread local event queue and simulation clock. These are global to the simulated world, but local to each trial thread. Two simulations running in parallel on separate CPU cores exist in the same shared memory space, but do not interact or influence each other. They are parallel universes.

We define an event as a triple consisting of a pointer to a function that takes two pointers to void as arguments and does not return any value, and the two pointers to void that will become its arguments. The function is an action, the two arguments its subject and object. The event is then executing the one-liner (*action)(subject, object);

Our process interactions are also events. For example, a process calling cmb_process_hold(5.0) actually schedules a wakeup event for itself at the current simulation time + 5.0 before it yields to the dispatcher. At some point, that event has bubbled up to the top of the priority queue and gets executed. Similarly, when a cmb_resourceguard wakes up a waiting process to inform it that “congratulations, you now have the resource”, it schedules an event at the current time that actually resumes the process. This avoids long and complicated call stacks.

This also happens to be the reason why our events need to be (at least) a triple: The event to reactivate some process needs to contain the reactivation function, a pointer to the process, and a pointer to the void* message argument for its resume() call, (*resume)(me, msg).

Events are instantaneous in simulated time and always execute on the main stack directly from the dispatcher. If an event function itself tries to call cmb_process_hold(), it will try to put the event dispatcher itself to sleep. This is not a good idea.

Events do not have a return value. There is nowhere to return the value to. There is nowhere to store a return value for later use either. An event function has signature void foo(void*, void*) while a process function has signature void *bar(void*, void*) since it can and often will return something.

An event is not even an object. It is ephemeral; once it has occurred, it is gone. You cannot take a pointer to an event. You can schedule an event as a triple (action, subject, object) to occur at a certain point in time with a certain priority, and you can cancel a scheduled event, reschedule it, or change its priority by referring to its handle, but it is still not an object. It only exists inside the event queue until it occurs. In computer sciency terms, it could be considered a closure

The event queue also provides wildcard functions to search for, count, or cancel entries that match some combination of (action, subject, object). For this purpose, special values CMB_ANY_ACTION, CMB_ANY_SUBJECT, and CMB_ANY_OBJECT are defined. As an example, suppose we are building a large-scale simulation model of an air war. When some plane in the simulation gets shot down, all its scheduled future events should be cancelled. In Cimba, this can be done by a simple call like cmb_event_pattern_cancel(CMB_ANY_ACTION, my_airplane, CMB_ANY_OBJECT);

The hash-heap - a binary heap meets a hash map

Since the basic discrete event simulation algorithm is all about inserting and retrieving events in a priority queue, the efficiency of this data structure and the algorithms acting on it becomes a key determinant of the overall application efficiency.

Cimba uses a hash-heap data structure for this. It consists of two interconnected parts:

  • The heap is a binary tree represented as a partially sorted array. The next event to be executed is always in array position 1. Retrieving it will trigger a reshuffle of the heap, making insertion and retrieval O(log n) operations. A scheduled event is just a value in this array, which may be moved to another location in the array at any time. However, cancelling a future event is a O(n) operation, since the array needs to be searched from the beginning to find the event.

  • The hash complements the heap with an open addressing hash map. When an event is first scheduled, it is assigned a unique identifier, a handle. The hash map keeps track of where in the heap that event is at any time. Accessing the event using its handle is then an O(1) operation, while canceling it and reshuffling the heap is O(log n). For large models with many scheduled events, this may be a useful speed improvement. We use Fibonacci hashing and open addressing for simplicity and efficiency, see https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/

Once we have this module tightly packaged, it can be reused for other purposes than just the main event queue. We use the same data structure for all priority queues of processes waiting for some resource, since our cmb_resourceguard is a derived class from cmi_hashheap. It is also used for the cmb_priorityqueue class of arbitrary objects passing from some producer to some consumer process. In our fourth tutorial, the LNG harbor simulation, we even used an instance of it at the modeling level to maintain the set of active ships in the model.

Each entry in the hashheap array provides space for four 64-bit payload items, together with the hash key, a double, and a signed 64-bit integer for use as prioritization keys. The cmi_hashheap struct also has a pointer to an application-provided comparator function that determines the ordering between two entries. For the main event priority queue, this is based on reactivation time, priority, and as a last resort, the key (handle) value as a tie-breaker, while the waiting list for a resource will use the process priority and the handle value as ordering keys. If no comparator function is provided, the hashheap will use a default comparator that only uses the double key and retrieves the smallest value first, intended as a simple FIFO rule.

For efficiency reasons, the hash table needs to be sized as a power of two. It will start small and grow as needed. This way, the entire structure will fit well inside a 2K CPU L1 cache until the application requires it to outgrow the cache. We do not want to penalize the performance of small simulation models for the ability to run very large ones.

Resources, resource guards, demands and conditions

Many simulations involve active processes competing for some scarce resource. Cimba provides five resource classes and one very general condition variable class. Two of the resource classes are holdable with acquire/release semantics, where cmb_resource is a simple binary semaphore that only one process can hold at a time and the cmb_resourcepool is a counting semaphore where several processes can hold some amount of the resource. The other three resource types have put/get semantics, where the cmb_buffer only considers the number of units that goes in and out, while the cmb_objectqueue and cmb_priorityqueue allow individual pointers to objects.

The common theme for all these is that a process requests something and may have to wait in an orderly priority queue for its turn if that something is not immediately available. Our hashheap is a good starting point for building this. For this purpose, we derive a class cmb_resourceguard from the cmi_hashheap, adding a pointer to some resource (the abstract base class) to be guarded, and a list of any observing resource guards.

When a process requests some resource and finds it busy, it enqueues itself in the hashheap priority queue. It also registers its demand function, a predicate function that takes three arguments: A pointer to the guarded resource, a pointer to the process itself, and a void* pointer to some application-defined context. Using some combination of this information, the demand function returns a boolean true or false answer to whether the demand is satisfied. The demand function is pre-packaged for the four resource types. For example, a process requesting a simple cmb_resource demands that it is not already in use, while a process requesting to put some amount into a cmb_buffer demands that there is free space in the buffer. After registering itself, the process yields control to the dispatcher. All of this happens inside the respective acquire(), get(), or put() call, invisible to the calling process.

When some other process is done with the resource and releases it, it will signal the cmb_resourceguard. This signal causes the resource guard to evaluate the demand function for its highest-priority waiting process. If satisfied, that process is removed from the wait list, gets reactivated, and can grab the resource. The resource guard then forwards the signal to any other resource guards that have registered themselves as observers of this one, causing these to do the same on their wait lists.

The condition variable cmb_condition is essentially a named resource guard with a user application defined demand function. The condition demand function can be anything that is computable from the given arguments and other state of the model at that point in simulated time. It can be used for arbitrarily complex “wait for any one of many” or “wait for all of many” scenarios where the cmb_condition will register itself as observer to the underlying resource guards, and, as shown in our second tutorial, it can include continuous-valued state variables.

We believe that the open-ended flexibility of our demand predicate function, pre-packaged for the common resource types and exposed for the cmb_condition, makes Cimba a very powerful and expressive simulation tool.

There may also be a weak pun here somewhere on the C++ promise keyword: Cimba processes do not promise. They demand.

Error handling: The loud crashing noise

Cimba error handling is intentionally draconian. It will not try to “handle” errors gently, but will make a loud crashing noise instead.

To understand why, think about the worst case scenario for a discrete event simulation model: Producing incorrect results. The model can not handle errors in “helpful” ways that risks introducing biases. The consequences of that could range from embarrassing (e.g., during your Ph.D. thesis defense) to catastrophic (e.g., military decision making). Also, the model will run unattended in a multithreaded environment with no direct user interface. Requesting user clarification is not an option. It is far better that the simulation stops at any sign of trouble and requires you to fix the model code or its input before trying again, than to do something that could turn out to be subtly wrong.

As a contrasting opposite, consider a music player application. If some sample is missing, the app should interpolate rather than make an audible dropout. If the network is slow, it should reduce the bit rate and degrade sound quality rather than stopping and restarting as the buffer empties and refills. This is not the kind of business Cimba is in. Like the proverbial samurai, it needs to return victorious or not at all.

Our approach is known as “offensive programming”. This is closely related to the Design by Contract paradigm, where code expresses clear assertions about the expected preconditions, invariants, and postconditions during function execution. If one of those assertions is proven invalid, execution stops right there with a diagnostic message. The assertions then become self-enforcing code documentation, since whatever condition it asserts to be true must be true for execution to proceed past that point.

The second important observation is that tracing the flow of execution in a large discrete event simulation model can become mind-bogglingly complex. We have two levels of concurrency within the same memory space: Multithreading and stackful coroutines. Your debugger will probably be very confused. Finding out what happened if the error has had time to propagate elsewhere in your code before something crashes will be near impossible. We need to catch it as close to the source as possible. Error messages need to give additional useful information, not just about what went wrong and where in the code it went wrong, but also what process, thread, random number seed, and so on, to replicate, locate, and fix the issue.

Cimba provides its own assert() macros. Tripping a Cimba assert will give a crash report like this:

9359.5      Service cmb_process_hold (272):  Fatal: Assert "dur >= 0.0" failed, source file cmb_process.c, seed 0x9bec8a16f0aa802a

Process finished with exit code 134 (interrupted by signal 6:SIGABRT)

It shows the simulation time, process, function, line number, the actual condition that failed, the program code file, and finally the random number seed used to initialize the trial. If running multi-threaded, it will also include the trial number. You now know both where to look and how to reproduce the issue if you want a closer look.

If you are using a debugger, we encourage you to put a permanent breakpoint in cmi_assert_failed(). Assuming that you are compiling in debug mode, you will then be able to page up the stack and identify the circumstances that caused the error, see the example in our first tutorial.

Our asserts come in two flavors: the cmb_assert_debug() and cmb_assert_release(). There is also a cmb_assert() macro, but it is just a shorthand for cmb_assert_debug().

Debug asserts are used at the development stage to ensure that everything is working as expected, even if the code to check it is time-consuming. Inside Cimba, you will find asserts that call dedicated predicate functions to validate whether the coroutine stacks are valid, if the event queue heap condition is satisfied, and so forth. Like the standard C assert() macro, the debug asserts vanish from the code if the preprocessor symbol NDEBUG is defined. Disabling the debug asserts will approximately double the execution speed of your model.

Release asserts enforce preconditions, the things that need to be true for some function to work correctly. These remain in the code even with -DNDEBUG, since they express the contracts towards surrounding code such as valid ranges for input values. These are typically simple and fast statements. If you are absolutely certain that your model is working correctly and that all your inputs are valid, you can squeeze out another slight speed improvement (about 10 %) by defining the preprocessor symbol NASSERT and making these vanish as well.

As an illustration, consider the function cmb_random_uniform():

static inline double cmb_random_uniform(const double min, const double max)
{
    cmb_assert_release(min < max);

    const double r = min + (max - min) * cmb_random();
    cmb_assert_debug((r >= min) && (r <= max));

    return r;
}

The function generates a pseudo-random uniform variate on the interval [min, max). We use those argument names instead of, say, [a, b) to make the expectation clear. We then enforce it with a release assert. If min is not strictly less than max, we stop right there. Alternatively, we could be “helpful” and generate samples for intervals with reversed limits, but it is more likely than not that both a zero-width interval and an interval where min > max indicates an input or model code error. Cimba’s way of being helpful is to make its loud crashing noise to draw your attention to fixing the error.

The debug assert validates that the result is within the advertised range. It tests for internal problems in Cimba and can be turned off after sufficient unit testing. After that, it mainly serves as trustworthy documentation: This statement is true, has been tested millions of times in unit testing, and you can easily verify it for yourself.

It is clear what valid inputs and outputs are for the function above, even without a single comment in the code. We are not about to prove total correctness in the strict C.A.R. Hoare sense, but the function shown above does constitute a logical Hoare triple.

There is also an cmb_assert_always() that remains also if NASSERT is defined. This is for use in test programs that need to demonstrate correctness independent of the compilation options. Internally in Cimba, this is only used in the cmi_memutils.h wrappers for malloc() and his friends to thest for out-of-memory conditions. These function calls are slow anyway, and the consequences of an invalid pointer could be hard to trace down, so they will be stopped immediately with very little performance cost.

For empirical data on the relationship between assertions and code quality, see, e.g., https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2006-54.pdf or https://www.cs.ucdavis.edu/~filkov/papers/assert-main.pdf

Logging flags and bit masks

As explained in the first tutorial, the key concept for the logger is the logger flags; a bit mask given as an argument to a logger call, and a current bit field. Both are 32-bit unsigned integers, type uint32_t. If a simple bitwise and (&) between the logger’s bit field and the caller’s bit mask gives a non-zero result, that line is printed, otherwise it is not. Initially, all bits in the logger bit field are on, 0xFFFFFFFF. You can turn selected bits on and off with cmb_logger_flags_on() and cmb_logger_flags_off(). The bit field is thread local, so any bit twiddles will only affect the current thread.

The top four bits are reserved for Cimba use, defined as CMB_LOGGER_FATAL, CMB_LOGGER_ERROR, CMB_LOGGER_WARNING, and CMB_LOGGER_INFO, respectively. These correspond to logger functions (actually macro wrappers) cmb_logger_fatal(), cmb_logger_error(), cmb_logger_warning(), and cmb_logger_info(). These are fprintf()-style functions.

The difference between cmb_logger_fatal() and cmb_logger_error() is that _fatal() will call abort() to terminate your entire program, while _error() calls pthread_exit(NULL) to stop the current thread. cmb_logger_fatal() should be used whenever there is a chance of memory corruption that could affect other threads, while cmb_logger_error() can be used if a single trial for some reason is unsuccessful and needs to bail out without providing a result. To make this work, you will need to initialize the trial result fields in your experiment array to some out-of-band value, and ensure that any remaining out-of-band values at the end of the experiment are not included in the result calculation.

The cmb_logger_info() level is used internally in Cimba to give a basic view of what is going on. It can look like this:

[ambonvik@Threadripper cimba]$ build/benchmark/MM1_single | more
    0.0000  dispatcher      cmb_event_queue_execute (294):  Starting simulation run
    0.0000  Arrival cmb_process_hold (278):  Holding for 0.786458 time units
    0.0000  Arrival cmb_process_timer_add (343):  Scheduled timeout event at 0.786458
    0.0000  Service cmb_objectqueue_get (214):  Gets an object from Queue, length now 0
    0.0000  Service cmb_objectqueue_get (246):  Waiting for an object
    0.0000  Service cmb_resourceguard_wait (149):  Waits for Queue
   0.78646  dispatcher      wakeup_event_time (310):  Wakes Arrival signal 0
   0.78646  Arrival cmb_objectqueue_put (271):  Puts object 0x555ae5e1c000 into Queue, length 0
   0.78646  Arrival cmb_objectqueue_put (293):  Success, put 0x555ae5e1c000
   0.78646  Arrival cmb_resourceguard_signal (219):  Scheduling wakeup event for Service
   0.78646  Arrival cmb_process_hold (278):  Holding for 0.237395 time units
   0.78646  Arrival cmb_process_timer_add (343):  Scheduled timeout event at 1.023854
   0.78646  dispatcher      wakeup_event_resource (173):  Wakes Service signal 0
   0.78646  Service cmb_objectqueue_get (251):  Trying again
   0.78646  Service cmb_objectqueue_get (234):  Success, got 0x555ae5e1c000
   0.78646  Service cmb_process_hold (278):  Holding for 1.321970 time units
   0.78646  Service cmb_process_timer_add (343):  Scheduled timeout event at 2.108428
    1.0239  dispatcher      wakeup_event_time (310):  Wakes Arrival signal 0
    1.0239  Arrival cmb_objectqueue_put (271):  Puts object 0x555ae5e1c008 into Queue, length 0
    1.0239  Arrival cmb_objectqueue_put (293):  Success, put 0x555ae5e1c008
    1.0239  Arrival cmb_process_hold (278):  Holding for 3.063936 time units
    1.0239  Arrival cmb_process_timer_add (343):  Scheduled timeout event at 4.087789
    2.1084  dispatcher      wakeup_event_time (310):  Wakes Service signal 0
    2.1084  Service cmb_objectqueue_get (214):  Gets an object from Queue, length now 1
    2.1084  Service cmb_objectqueue_get (234):  Success, got 0x555ae5e1c008
    2.1084  Service cmb_process_hold (278):  Holding for 0.576082 time units
    2.1084  Service cmb_process_timer_add (343):  Scheduled timeout event at 2.684510
    2.6845  dispatcher      wakeup_event_time (310):  Wakes Service signal 0
    2.6845  Service cmb_objectqueue_get (214):  Gets an object from Queue, length now 0
    2.6845  Service cmb_objectqueue_get (246):  Waiting for an object
    2.6845  Service cmb_resourceguard_wait (149):  Waits for Queue
    4.0878  dispatcher      wakeup_event_time (310):  Wakes Arrival signal 0
    4.0878  Arrival cmb_objectqueue_put (271):  Puts object 0x555ae5e1c008 into Queue, length 0
    4.0878  Arrival cmb_objectqueue_put (293):  Success, put 0x555ae5e1c008
    4.0878  Arrival cmb_resourceguard_signal (219):  Scheduling wakeup event for Service
    4.0878  Arrival cmb_process_hold (278):  Holding for 1.313836 time units
    4.0878  Arrival cmb_process_timer_add (343):  Scheduled timeout event at 5.401625
    4.0878  dispatcher      wakeup_event_resource (173):  Wakes Service signal 0
    4.0878  Service cmb_objectqueue_get (251):  Trying again
    4.0878  Service cmb_objectqueue_get (234):  Success, got 0x555ae5e1c008
    4.0878  Service cmb_process_hold (278):  Holding for 2.084093 time units
    4.0878  Service cmb_process_timer_add (343):  Scheduled timeout event at 6.171883
    5.4016  dispatcher      wakeup_event_time (310):  Wakes Arrival signal 0
    5.4016  Arrival cmb_objectqueue_put (271):  Puts object 0x555ae5e1c000 into Queue, length 0
    5.4016  Arrival cmb_objectqueue_put (293):  Success, put 0x555ae5e1c000
    5.4016  Arrival cmb_process_hold (278):  Holding for 1.702620 time units
    5.4016  Arrival cmb_process_timer_add (343):  Scheduled timeout event at 7.104245

If you compare this to the stack illustrations in the preceding section on Cimba processes as coroutines, this gives a pretty good view of what is happening on the stacks. However, it will not be needed very often and can be turned off with

cmb_logger_flags_off(CMB_LOGGER_INFO);

It will still be in the code if turned off, requiring one 32-bit comparison per call, but you can make it vanish completely (like the asserts) by defining the preprocessor symbol NLOGINFO. That will turn the cmb_logger_info() wrapper macro into a no-op statement, eliminating it from the code. For fast production code, already well tested, you may want to compile Cimba with compiler flags -DNDEBUG -DNASSERT -DNLOGINFO. See the top level meson.build and uncomment those options before compiling and (re)installing Cimba.

As shown in the tutorial, the user application can define up to 28 different logger flags for fine-grained control of what logging messages to print. Remember that these are bit masks, so your logging flags should be 0x00000001, 0x00000002, 0x00000004, 0x00000008, 0x00000010, and so on for single bits turned off and on.

Pseudo-random number generators and distributions

Cimba has a few specific requirements to its pseudo-random number generators as well. For any discrete event simulation framework, the PRNG’s need to be fast and have high statistical quality. In addition, we need them to be thread-safe, since it must be possible to reproduce the exact sequence of random numbers in a trial when given the same seed. We cannot have the outcome depend on other trials that may or may not be running in parallel. This is not very difficult to do, but it needs to be considered from the beginning, since the obvious way to code a PRNG is to keep its state in static variables between calls.

The PRNG in Cimba is an implementation of Chris Doty-Humphrey’s sfc64. It provides 64-bit output and maintains a 256-bit state. It is certain to have a cycle period of at least 2^64, and is both faster and higher statistical quality than better-known generators such as the Mersenne Twister. It is in public domain, see https://pracrand.sourceforge.net for the details. In our implementation, the PRNG state is thread local, giving each trial its own stream of random numbers, independent from any other trials.

We initialize the PRNG in a three-stage bootstrapping process:

  • First, a truly random 64-bit seed can be obtained from a suitable hardware source of entropy by calling cmb_random_hwseed(). It will query the CPU for its best source of randomness. On the x86-64 architecture, the preferred source is the RDSEED instruction that is available on Intel CPUs since 2014 and AMD CPUs since 2016. This instruction uses thermal noise from the CPU itself to create a 64-bit random value. If not available, we will negotiate alternatives with the CPU and return the best entropy that is available, if necessary by doing a mashup of the clock value, thread identifier, and CPU cycle counter.

  • Second, the seed is used to initialize the PRNG by calling cmb_random_initialize(). It needs to create 256 bits of state from a 64-bit seed. We use a dedicated 64-bit state PRNG for this. We initialize it with the 64-bit seed, and then draw four samples from it to initialize the state of our main PRNG. This auxiliary PRNG is splitmix64, also public domain, see https://rosettacode.org/wiki/Pseudo-random_numbers/Splitmix64#C

  • Finally, cmb_random_initialize() draws and discards 20 samples from the main PRNG to make sure that any initial transient is gone before returning and allowing cmb_random_sfc64() to provide pseudo-random numbers to the user application.

The result is a pseudo-random number sequence that cannot be distinguished from true randomness by any available statistical methods. In particular, successive values appear to be totally uncorrelated, so it is not necessary or recommended to use multiple streams of pseudo-random numbers in the same trial. It would do more harm than good. For this reason, the PRNG is not implemented as an object in the simulated world, where various entities can carry around their own sources of randomness, but more like a property of the simulated world. It just is. The simulated entities can obtain sample values from it according to whatever distribution is needed.

The basic cmb_random_sfc64() returns an unsigned 64-bit bit pattern from the PRNG. This is a bit spartan for most purposes. The function cmb_random() instead returns the sample as a double between zero and one. The first unit test in test/test_random.c checks the output of cmb_random() against its expected values:

Quality testing basic random number generator cmb_random(), uniform on [0,1)
Drawing 100000000 samples...

Expected: N 100000000  Mean   0.5000  StdDev   0.2887  Variance  0.08333  Skewness    0.000  Kurtosis   -1.200
Actual:   N 100000000  Mean   0.5000  StdDev   0.2887  Variance  0.08333  Skewness -1.537e-05  Kurtosis   -1.200
--------------------------------------------------------------------------------
( -Infinity,  8.569e-09)   |
[ 8.569e-09,    0.05000)   |#################################################=
[   0.05000,     0.1000)   |#################################################=
[    0.1000,     0.1500)   |#################################################=
[    0.1500,     0.2000)   |#################################################=
[    0.2000,     0.2500)   |#################################################=
[    0.2500,     0.3000)   |#################################################=
[    0.3000,     0.3500)   |#################################################=
[    0.3500,     0.4000)   |#################################################=
[    0.4000,     0.4500)   |#################################################=
[    0.4500,     0.5000)   |#################################################=
[    0.5000,     0.5500)   |#################################################=
[    0.5500,     0.6000)   |#################################################=
[    0.6000,     0.6500)   |#################################################=
[    0.6500,     0.7000)   |#################################################=
[    0.7000,     0.7500)   |#################################################=
[    0.7500,     0.8000)   |#################################################=
[    0.8000,     0.8500)   |#################################################=
[    0.8500,     0.9000)   |#################################################=
[    0.9000,     0.9500)   |##################################################
[    0.9500,      1.000)   |#################################################=
[     1.000,  Infinity )   |-
--------------------------------------------------------------------------------

Autocorrelation factors (expected 0.0):
           -1.0                              0.0                              1.0
--------------------------------------------------------------------------------
   1  -0.000                                 -|
   2  -0.000                                 -|
   3   0.000                                  |-
   4   0.000                                  |-
   5   0.000                                  |-
   6   0.000                                  |-
   7   0.000                                  |-
   8  -0.000                                 -|
   9  -0.000                                 -|
  10  -0.000                                 -|
  11  -0.000                                 -|
  12  -0.000                                 -|
  13   0.000                                  |-
  14  -0.000                                 -|
  15  -0.000                                 -|
--------------------------------------------------------------------------------

Partial autocorrelation factors (expected 0.0):
           -1.0                              0.0                              1.0
--------------------------------------------------------------------------------
   1  -0.000                                 -|
   2  -0.000                                 -|
   3   0.000                                  |-
   4   0.000                                  |-
   5   0.000                                  |-
   6   0.000                                  |-
   7   0.000                                  |-
   8  -0.000                                 -|
   9  -0.000                                 -|
  10  -0.000                                 -|
  11  -0.000                                 -|
  12  -0.000                                 -|
  13   0.000                                  |-
  14  -0.000                                 -|
  15  -0.000                                 -|
--------------------------------------------------------------------------------

Raw moment:   Expected:   Actual:   Error:
--------------------------------------------------------------------------------
    1             0.5     0.50002    0.004 %
    2         0.33333     0.33335    0.006 %
    3            0.25     0.25002    0.008 %
    4             0.2     0.20002    0.010 %
    5         0.16667     0.16669    0.012 %
    6         0.14286     0.14288    0.013 %
    7           0.125     0.12502    0.015 %
    8         0.11111     0.11113    0.016 %
    9             0.1     0.10002    0.018 %
   10        0.090909    0.090926    0.019 %
   11        0.083333    0.083349    0.019 %
   12        0.076923    0.076938    0.020 %
   13        0.071429    0.071443    0.020 %
   14        0.066667     0.06668    0.021 %
   15          0.0625    0.062513    0.021 %
--------------------------------------------------------------------------------

Another way to check the quality is to generate a million successive (x, y) pairs from cmb_random() and plot them. The human eye is pretty good at detecting correlations and patterns, sometimes even where no pattern exists, so this is a quite sensitive (but informal) test:

_images/crossplot_random.png

The various pseudo-random number distributions build on this generator, shaping its output to match the required probability density functions. The algorithms used are selected for speed and accuracy. Please run the unit test for verification of the individual distributions versus expected values.

The exponential and normal distributions are implementations of Chris McFarland’s improved Ziggurat algorithms, see https://github.com/cd-mcfarland/fast_prng or https://arxiv.org/pdf/1403.6870

The gamma distribution uses an algorithm due to Marsaglia and Tsang. It uses a similar rejection sampling approach as the Ziggurat algorithm, but with a continuous function instead of the stepped rectangles of the Ziggurat. See https://dl.acm.org/doi/10.1145/358407.358414

Many other distributions are built on top of these, as sums, products, or ratios of samples. For example, the infamous Cauchy distribution is simply the ratio of two normal variates, suitably scaled and shifted.

Cimba also provides a collection of discrete-valued distributions, starting from the simple unbiased coin flip in cmb_random_flip(). It also provides Bernoulli trials (biased coin flips, if such a thing exists), fair and loaded dice, geometric, Poisson and Pascal distributions, and so forth.

In some cases, a model needs to sample based on some empirical, discrete-valued set of probabilities. The probabilities can be given as an array p[n], where p[i] is the probability of outcome i, for 0 <= i < n. A clever algorithm for this is the Vose alias method, see https://www.keithschwarz.com/darts-dice-coins/

The alias method requires an initial step of setting up an alias table, but provides fast O(1) sampling thereafter. This is worthwhile for n larger than seven or so, and about three times faster than the basic O(n) method for n = 30. In Cimba, the alias table is created by calling cmb_random_alias_create(), sampled with cmb_random_alias_sample(), and destroyed with cmb_random_alias_destroy(). (In this case, we have bundled the allocation and initialization steps into a single _create() function, and the termination and deallocation steps into the _destroy() function.)

Data sets and summaries

As we saw in the previous section, Cimba provides a simple set of statistics utilities for debugging and simple reporting. The most basic class is the cmb_dataset, simply a conveniently resizing array of sample values. It provides methods that require the individual values, such as calculating the median, quartiles, autocorrelation factors, and printing a histogram. It does not directly support fundamental statistics like mean and variance, though.

These are provided through a separate class, cmb_datasummary, that computes running tallies for the first four moments in a single-pass algorithm whenever new data points are added to the summary, using the methods described by Pébay (https://www.osti.gov/servlets/purl/1028931) and Meng (https://arxiv.org/pdf/1510.04923).

The reason for this is that we do not always need to keep all individual sample values, so we do not want to take the memory penalty of storing them if a simple summary is enough. In those cases, just adding the successive samples to a cmb_datasummary is more efficient. If we need both, collect the samples in a cmb_dataset and use the function cmb_dataset_summarize() to calculate a data summary object from the complete data set.

The basic dataset is extended to a time series by the cmb_timeseries class. It adds a second double to make each sample an (x, t) pair. In addition, it calculates a third value w (for weight) that represents the time interval between one sample and the next.

Recall that in a discrete event simulation, nothing happens except at the event times, which can have arbitrary time intervals between them. We may need to handle time series, e.g., the length of some queue, that may be zero for a long time and have short bursts of several non-zero values, but perhaps with zero or near zero durations. If we want to calculate sensible statistics from this time series, the sample values need to be weighted by the duration they had. The cmb_timeseries calculates these weights on the fly.

Similarly to the simple data set and data summary, there is a related cmb_wtdsummary class for calculating duration-weighted statistics from a time series, or, in cases where the complete time series data is not needed, directly in one pass by simply adding sample values to the cmb_wtdsummary along the way. The algorithm is described by Pébay and others in a separate publication, see https://link.springer.com/article/10.1007/s00180-015-0637-z

If one for some reason needs to calculate unweighted statistics from a time series, simply use the cmb_dataset_summarize() function instead. As a derived class, the cmb_timeseries is still also a cmb_dataset, and cmb_dataset_summarize() will treat it as such.

The single pass calculations of variance and higher moments are non-trivial. The obvious implementation is numerically unstable. We strongly recommend using the Cimba cmb_summary and cmb_wtdsummary to do this robustly whenever anything more than a simple sum and average is needed.

And, of course, if more statistical power is needed, use the cmb_dataset_print() and cmb_timeseries_print() functions to write the raw data values to file, and use dedicated software such as R or Gnuplot to analyze and present the data.

Experiments consist of multi-threaded trials

We now return to the primary objective for Cimba: To provide multi-threaded discrete event simulation that harnesses the power of modern multi-core CPU’s. As explained in the sections about coroutines and pseudo-random number generators above, Cimba is designed from the ground up to enable multithreaded execution. Once that is in place, the actual implementation is straightforward. We use POSIX pthreads as the concurrency model, see https://en.wikipedia.org/wiki/Pthreads

The basic idea is to structure your experiment as an array of trials. Your simulated world is constant across trials, except for the trial parameters. Cimba creates a small army of worker threads, one per logical CPU core, and lets them help themselves to the experiment array. Each thread will pick the next available trial, execute the simulation trial function with the given parameters, store the results, and repeat. When there are no more trials to run, the worker processes stop and the main thread takes over again for final output and cleanup.

This gives us near perfect load balancing for free, without the main thread having to do anything. The worker threads keep themselves busy. The alternative approach would be to launch each trial in its own thread and close it again before launching the next trial from the main scheduler. This would also work, but with some more overhead.

The downside to our approach is that new trials will be launched into a thread memory space where an earlier trial used the event queue and the random generators. If the user application is not scrupulous in initializing and terminating objects, it may encounter garbage values with somewhat unpredictable results.

We define the trial function as a function that takes a void * as argument and does not return any value, typedef void (cimba_trial_func)(void *trial_struct). Note that the trial_struct is just a variable name for a void *, not a type name. Cimba cannot know in advance the structure of your trial struct, and makes no assumptions on it.

Our tutorial/tut_1_7.c can serve as an example for how the experiment array can be structured:

struct simulation {
    struct cmb_process *arr;
    struct cmb_buffer *que;
    struct cmb_process *srv;
};

struct trial {
    /* Parameters */
    double arr_rate;
    double srv_rate;
    double warmup_time;
    double duration;
    /* Results */
    uint64_t seed_used;
    double avg_queue_length;
};

struct context {
    struct simulation *sim;
    struct trial *trl;
};

This is user application code, so you can call these structs what you want. Our use of struct simulation here is a slight tip o’ the hat toward old Simula’s CLASS SIMULATION. Note also that we chose to let the trial function get its own seed from hardware and store it as a result value rather than feed it a pre-packaged seed for the trial.

The trial function, here run_MM1_trial(), expects to receive a pointer to a struct trial as argument (disguised as a void *), sets up, executes the simulation, and cleans up afterwards.

The user application allocates the experiment array, e.g.:

struct trial *experiment = calloc(n_trials, sizeof(*experiment));

and initializes it with the various trial parameters, with the necessary variations and replications. Once it has loaded parameters into each trial of the experiment array, it calls cimba_run_experiment() and waits for the results.

cimba_run_experiment(experiment, n_trials, sizeof(*experiment), run_MM1_trial);

Once this function returns, each trial in the experiment array will have its results fields populated. The application can then parse the outcome and present the results in any way it wants.

Two less obvious features to be aware of, perhaps also less useful, but still:

  • You can use different trial functions per trial. If the trial function argument to cimba_run_experiment() is NULL, it will instead take the first 64 bits of your trial struct as a pointer to the trial function to be used for this particular trial. You could have different trial functions for every trial if you want. Of course, if the trial function argument to cimba_run_experiment() is NULL and the first 64 bits of your trial structure do not contain a valid function address, your program will promptly crash with a segmentation fault.

  • cimba_run_experiment() does not even care if the trial functions it executes is a simulation or something else. You could feed it some array of parameter values and a pointer to some function that does something with whatever those parameter values indicate. It is just a wrapper to a worker pool of pthreads and will happily multithread whatever task you ask it to. The main design challenge with Cimba was to construct a discrete event simulation engine that can be run in the multithreaded wrapper, not constructing the wrapper as such.

Benchmarking Cimba against SimPy

The most relevant comparison for Cimba is probably the Python package SimPy (https://pypi.org/project/simpy/), since it provides similar functionality to Cimba, only with Python as its base language instead of C. SimPy emphasizes ease of use as a main design objective, following the overall Python philosophy, while Cimba (being a C library) has a natural emphasis on speed. SimPy processes are based on stackless generators, whereas Cimba has its stackful coroutine processes. The current SimPy version is 4.1.1.

For comparable multithreading functionality, SimPy needs to be combined with the Python multiprocessing package. In the following, when referring to SimPy in a multithreading context, this should be understood as SimPy + multiprocessing.

In the benchmark directory, you will find examples of the same simulated scenario implemented in SimPy and Cimba. A complete multithreaded M/M/1 queue simulation could look like this in SimPy, benchmark/MM1_multi.py

import simpy
import random
import multiprocessing
import statistics
import math

NUM_OBJECTS = 1000000
ARRIVAL_RATE = 0.9
SERVICE_RATE = 1.0
NUM_TRIALS = 100

def arrival_process(env, store, n_limit, arrival_rate):
    for _ in range(n_limit):
        t = random.expovariate(arrival_rate)
        yield env.timeout(t)
        yield store.put(env.now)

def service_process(env, store, service_rate, stats):
    while True:
        arr_time = yield store.get()
        t = random.expovariate(service_rate)
        yield env.timeout(t)
        stats['sum_tsys'] += env.now - arr_time
        stats['obj_cnt'] += 1

def run_trial(args):
    n_objects, arr_rate, svc_rate = args
    env = simpy.Environment()
    store = simpy.Store(env)
    stats = {'sum_tsys': 0.0, 'obj_cnt': 0}
    env.process(arrival_process(env, store, n_objects, arr_rate))
    env.process(service_process(env, store, svc_rate, stats))
    env.run()

    return stats['sum_tsys'] / stats['obj_cnt']

def main():
    num_cores = multiprocessing.cpu_count()
    trl_args = [(NUM_OBJECTS, ARRIVAL_RATE, SERVICE_RATE)] * NUM_TRIALS
    with multiprocessing.Pool(processes=num_cores) as pool:
        results = pool.map(run_trial, trl_args)

    valid_results = [r for r in results if r > 0]
    n = len(valid_results)
    if (n > 1):
        mean_tsys = statistics.mean(valid_results)
        sdev_tsys = statistics.stdev(valid_results)
        serr_tsys = sdev_tsys / math.sqrt(n)
        ci_w = 1.96 * serr_tsys
        ci_l = mean_tsys - ci_w
        ci_u = mean_tsys + ci_w

        print(f"Average time in system: {mean_tsys} (n {n}, conf int {ci_l} - "
              f"{ci_u}, expected: "
              f"{1.0 / (SERVICE_RATE - ARRIVAL_RATE)})")

if __name__ == "__main__":
    main()

The same model would look like this in Cimba, benchmark/MM1_multi.c:

#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>

#include <cimba.h>

#define NUM_OBJECTS 1000000u
#define ARRIVAL_RATE 0.9
#define SERVICE_RATE 1.0

CMB_THREAD_LOCAL struct cmi_mempool objectpool = CMI_MEMPOOL_STATIC_INIT(sizeof(void *), 512u);

struct simulation {
    struct cmb_process *arrival;
    struct cmb_process *service;
    struct cmb_objectqueue *queue;
};

struct trial {
    double arr_mean;
    double srv_mean;
    uint64_t obj_cnt;
    double sum_wait;
    double avg_wait;
};

struct context {
    struct simulation *sim;
    struct trial *trl;
};

void *arrivalfunc(struct cmb_process *me, void *vctx)
{
    cmb_unused(me);
    const struct context *ctx = vctx;
    struct cmb_objectqueue *qp = ctx->sim->queue;
    const double mean_hld = ctx->trl->arr_mean;
    for (uint64_t ui = 0; ui < NUM_OBJECTS; ui++) {
        const double t_hld = cmb_random_exponential(mean_hld);
        cmb_process_hold(t_hld);
        void *object = cmi_mempool_alloc(&objectpool);
        double *dblp = object;
        *dblp = cmb_time();
        cmb_objectqueue_put(qp, object);
    }

    return NULL;
}

void *servicefunc(struct cmb_process *me, void *vctx)
{
    cmb_unused(me);
    const struct context *ctx = vctx;
    struct cmb_objectqueue *qp = ctx->sim->queue;
    const double mean_srv = ctx->trl->srv_mean;
    uint64_t *cnt = &(ctx->trl->obj_cnt);
    double *sum = &(ctx->trl->sum_wait);
    while (true) {
        void *object = NULL;
        cmb_objectqueue_get(qp, &object);
        const double *dblp = object;
        const double t_srv = cmb_random_exponential(mean_srv);
        cmb_process_hold(t_srv);
        *sum += cmb_time() - *dblp;
        *cnt += 1u;
        cmi_mempool_free(&objectpool, object);
    }
}

void run_trial(void *vtrl)
{
    struct trial *trl = vtrl;

    cmb_logger_flags_off(CMB_LOGGER_INFO);
    cmb_random_initialize(cmb_random_hwseed());
    cmb_event_queue_initialize(0.0);
    struct context *ctx = malloc(sizeof(*ctx));
    ctx->trl = trl;
    struct simulation *sim = malloc(sizeof(*sim));
    ctx->sim = sim;

    sim->queue = cmb_objectqueue_create();
    cmb_objectqueue_initialize(sim->queue, "Queue", CMB_UNLIMITED);

    sim->arrival = cmb_process_create();
    cmb_process_initialize(sim->arrival, "Arrival", arrivalfunc, ctx, 0);
    cmb_process_start(sim->arrival);
    sim->service = cmb_process_create();
    cmb_process_initialize(sim->service, "Service", servicefunc, ctx, 0);
    cmb_process_start(sim->service);

    cmb_event_queue_execute();

    cmb_process_stop(sim->service, NULL);
    cmb_process_terminate(sim->arrival);
    cmb_process_terminate(sim->service);
    cmb_process_destroy(sim->arrival);
    cmb_process_destroy(sim->service);

    cmb_objectqueue_destroy(sim->queue);
    cmb_event_queue_terminate();
    free(sim);
    free(ctx);
}

int main(void)
{
    struct trial *trl = malloc(sizeof(*trl));
    trl->arr_mean = 1.0 / ARRIVAL_RATE;
    trl->srv_mean = 1.0 / SERVICE_RATE;
    trl->obj_cnt = 0u;
    trl->sum_wait = 0.0;
    run_trial(trl);

    printf("Average system time %f (expected %f)\n",
            trl->sum_wait / (double)trl->obj_cnt,
            1.0 / (SERVICE_RATE - ARRIVAL_RATE));

    return 0;
}

The Cimba code is significantly longer for this simple example, 139 vs 60 lines. The C base language demands more careful declarations of object types, whereas Python will happily try to infer types from context. C requires explicit management of object creation and destruction, since it does not have Python’s automatic garbage collection. This requires some additional code lines.

For a larger model, where the process function could be large, complex, and call various other functions, the SimPy code could become much harder to follow. We leave it as an exercise for the interested reader to build our entertainment park tutorial or LNG harbor tutorial in SimPy for a similar benchmarking.

Moreover, Cimba’s stackful coroutines allow calls to context-switching functions (like cmb_process_hold() or cmb_resource_acquire()) from arbitrary deep within function call hierarchies. Control will leave the call stack where it is and pick up again from the same point when control is passed back into that cmb_process. This makes for a very natural way to express agentic behavior by simulated processes. The stackless Python generator that SimPy processes are based on cannot do this.

We compile the Cimba code for speed, using gcc v 15.2.1 with options -O3 -flto=auto -fuse-linker-plugin -DNDEBUG -DNASSERT -DLOGINFO as appropriate for already well-tested code (i.e., everything uncommented in the top level meson.build). For SimPy, we use Python v 3.14.2. The host system is built around a ASRock TRX40 Taichi motherboard with an AMD Threadripper 3970x CPU and 128 GB RAM. The OS is Arch Linux.

Both programs produce a one-liner output similar to this:

Average system time 10.000026 (n 100, conf.int. 9.964877 - 10.035176, expected 10.000000)

However, the Cimba experiment can run its 100 trials in 0.56 seconds, while the SimPy version takes 25.5 seconds to do the same thing with all available cores in use. Cimba runs this scenario about 45 times faster than SimPy. Equivalently, the Cimba running time is 97.8 % less than SimPy’s for this simple model.

Cimba even processes twice as many simulated events per second on a single core (approx 32 million events / second) than what SimPy can do if it has all 64 logical cores to itself (approx 16 million events / second).

_images/Speed_test_AMD_3970x.png

Our (admittedly biased) view is that SimPy is good for simple one-off simulations where learning curve and development time are the critical constraints, while Cimba is better for larger, more complex, and more long-lived models where software engineering, maintainability, and efficiency become important. This aligns with our project goals.

How about the name ‘Cimba’?

Very simple. This is a simulation library in C, the author’s initials are ‘AMB’, ‘simba’ means ‘lion’ in Swahili, and real lions are among the very few predators that eat python snakes for breakfast. Of course it had to be named Cimba.

If in doubt, read the source code

Please read the source code if something seems unclear. It is written to be readable for humans, not just for the compiler. It is well commented and contains plentiful assert() statements that mercilessly enforce whatever condition they state to be true at that point.

The code is strongly influenced by the Design by Contract paradigm, where asserts are used to document and enforce preconditions, invariants, and postconditions for every function. It should provide significant clarity on exactly what can be expected from any Cimba function.

If something still seems mysterious, you may have uncovered a bug, and we also consider unclear or incorrect documentation to constitute a bug. Please open an issue for it in the GitHub issue tracker at https://github.com/ambonvik/cimba/issues