Timing pieces of code

Execution time measurements

Python's timeit and repeat

I sometimes want to time my code to see which version of a function I wrote runs faster. I know that you should always profile before you optimize and that micro-benchmarking has lots of caveats (some of which I have myself fallen victim to), but still, sometimes, curiosity gets the best of me and I just have to know. When I am writing in Python, I use the timeit module to measure code execution. This handy module has two main utility functions: timeit and repeat.

timeit will run a piece of code n times while measuring the execution time of each repetition. It then gives back the lowest time, which is arguably the time needed to run the code when nothing else on the system is interfering. Of course, when the code will be run, other things will want CPU time alongside your program and the OS will have to balance it all, but usually, that should not be taken into account when doing micro-benchmarks.

repeat will again run a piece of code n times, but will repeat that experiment r times, returning a list with the lowest execution time of each repetition of the experiment. The following snippet shows how to use both of them in code.

from timeit import timeit, repeat

n = 1000
r = 2
times_repeat = repeat( "def Fn( x ): return x**2;myFn( 4 )", repeat = r, number = n )
times_timeit = []
for idx in range( r ):
    times_timeit.append( timeit( "def Fn( x ): return x**2;myFn( 4 )", number = n ) )

There are two other ways to use the timeit module, namely via the command line with the executable module syntax and using an interactive Python shell like iPython which has the magic function %timeit. I won't cover those subjects here.

Trying it out in C++: time_it

The thing is, I don't always do things in Python and I wanted to have a function similar to timeit which would allow me to measure the execution time of a function or piece of code in C++. It turns out that with C++14 (don't know if C++11 is sufficient, but C++17/C++1z will make it easier I think), it is possible to get close to it. When I got the idea, I started to think about the proposed std::invoke function that should (or has?) make it into the next standard. This function template takes a function, it's arguments and invokes it. This is pretty close to what I need. Coupling the facilities of std::chrono with that function could actually get me pretty close to what I want. Because some compilers do not yet have an implementation of std::invoke, I had to first find one or write one. I decided to use the implementation found on cppreference.com. But let's pretend that std::invoke is available. Then, one can write the function template time_it1 using std::invoke and std::chrono. This is the resulting function for those who do not want to read the steps I went through:

template<
   unsigned int repeat_invoke = 1000,
   unsigned int repeat_meas = 3,
   typename F, typename... ArgTypes
        >
std::array< double, repeat_meas >
time_it( F&& f, ArgTypes&&... args )
{
   using std::chrono::high_resolution_clock;
   using std::chrono::duration_cast;
   using std::numeric_limits;
   using std::chrono::nanoseconds;

   static_assert( repeat_invoke >= 1,
            "time_it: The number of times the code experiment is \
                                          repeated has to be at least 1." );
   static_assert( repeat_meas >= 1,
            "time_it: The number of times the code is run has to \
                                                          be at least 1." );

   double max_dbl = numeric_limits<double>::max();
   double cur_min = max_dbl;
   std::array< double, repeat_meas > shortest_times;
   for( unsigned int meas_idx( 0 ); meas_idx != repeat_meas; ++meas_idx )
   {
      for( unsigned int invoke_idx( 0 ); invoke_idx != repeat_invoke; ++invoke_idx )
      {
         auto start = high_resolution_clock::now();

         std::invoke( std::forward<F>(f), std::forward<ArgTypes>(args)... );

         auto elapsed = duration_cast< nanoseconds >(
                        high_resolution_clock::now() - start ).count();

         if( elapsed < cur_min ) { cur_min = elapsed; }
      }
      shortest_times[meas_idx] = cur_min;
      cur_min = max_dbl;
   }

   return shortest_times;
}

I skipped the includes in this code snippet. I am not pretending this code is exception safe, thread safe or any other kind of safe there is. I am not suggesting anybody else should use it. It's simply something I find useful.

How I got there

Basically, aside from the boiler plate stuff, the function is one that takes another function and its parameters as input, and then outputs the minimum time taken to execute the code. The initial declaration could be written as follows:

template< typename F, typename... ArgTypes >
double time_it( F&& f, ArgTypes&&... args );

This definition uses variadic templates. You can find information on variadic templates on various sites. The problem I had with that first iteration is that I could not specify the number of repetitions or the number of times I wanted to run the timing experiment (equivalent to the n and r parameters of Python's timeit). I had to find a way to do that which would allow for default values as well, since I did not want to specify those arguments every time I call the function. To be able to use default values in C++, the defaulted parameters have to be last (no keyword arguments in C++ land). That presented a problem because I also have to have a parameter pack. So to work around that, I introduced two defaulted template arguments like this:

template<
   unsigned int repeat_invoke = 1000,
   unsigned int repeat_meas = 3,
   typename F,
   typename... ArgTypes
        >
std::array< double, repeat_meas >
time_it( F&& f, ArgTypes&&... args );

Sorry for the reformatting of the template declaration. I must admit I struggle to have a consistent formatting for templates. I find nothing is completely satisfying. I should probably just use Clang Format. Anyhow, back to my function. The addition of the two template arguments allowed me to change the return type to an array and I now have the functionality I want: I can specify, albeit as template parameters, the number of repetitions and the number of invocations at the call site. All that was left to do was write the timing code, since the function invocation itself is simply:

std::invoke( std::forward<F>(f), std::forward<ArgTypes>(args)... );

The use of std::forward in the code here is for efficiency. The timing code is based on std::chrono::high_resolution_clock. I chose to simply record the time before the invocation with std::chrono::high_resolution_clock::now() and do the same after. The difference is the elapsed time. The choice of the clock could be discussed. The most important thing is for the clock to be steady.

Example / test

I have written a function which will make the current thread sleep for ten milliseconds. Then, I have used my new time_it function to time the execution of the defined function. The code is reproduced here and actually serves as a usage example. There are also two calls with a lambda function, one taking arguments.

void my_fn()
{
   using namespace std::chrono_literals;
   std::this_thread::sleep_for( 10ms );
}

int main( int argc, char* argv[] )
{
   // get times
   auto test_val_0 = time_it( my_fn );
   auto test_val_1 = time_it< 100, 6 >( my_fn );
   auto test_val_2 = time_it< 500 >( my_fn );
   auto test_val_3 = time_it< 1000, 2 >( my_fn );
   auto test_val_4 = time_it< 1000, 8 >( [](){
            using namespace std::chrono_literals;
            std::this_thread::sleep_for( 5ms );
         } );
   auto test_val_5 = time_it< 1000, 3 >( []( int lt, int rt ){
            return lt - rt;
         }, 5, 7 );

   // output results
   std::cout << "exp 0: [ " << test_val_0[0] / 1000000.0
             << ", " <<  test_val_0[1] / 1000000.0
             << ", " <<  test_val_0[2] / 1000000.0 << " ]\n";
   std::cout << "exp 1: [ " << test_val_1[0] / 1000000.0
             << ", " <<  test_val_1[1] / 1000000.0
             << ", " <<  test_val_1[2] / 1000000.0
             << ", " <<  test_val_1[3] / 1000000.0
             << ", " <<  test_val_1[4] / 1000000.0
             << ", " <<  test_val_1[5] / 1000000.0 << " ]\n";
   std::cout << "exp 2: [ " << test_val_2[0] / 1000000.0
             << ", " <<  test_val_2[1] / 1000000.0
             << ", " <<  test_val_2[2] / 1000000.0 << " ]\n";
   std::cout << "exp 3: [ " << test_val_3[0] / 1000000.0
             << ", " <<  test_val_3[1] / 1000000.0 << " ]\n";
   std::cout << "exp 4: [ " << test_val_4[0] / 1000000.0
             << ", " <<  test_val_4[1] / 1000000.0
             << ", " <<  test_val_4[2] / 1000000.0
             << ", " <<  test_val_4[3] / 1000000.0
             << ", " <<  test_val_4[4] / 1000000.0
             << ", " <<  test_val_4[5] / 1000000.0
             << ", " <<  test_val_4[6] / 1000000.0
             << ", " <<  test_val_4[7] / 1000000.0 << " ]\n";
   std::cout << "exp 5: [ " << test_val_5[0]
             << ", " <<  test_val_5[1]
             << ", " <<  test_val_5[2] << " ]\n";
   return 0;
}

and the output on my machine is:

exp 0: [ 10.0203, 10.0184, 10.0266 ]
exp 1: [ 10.0465, 10.0373, 10.0241, 10.0198, 10.0214, 10.0243 ]
exp 2: [ 10.0152, 10.0176, 10.0305 ]
exp 3: [ 10.02, 10.0174 ]
exp 4: [ 5.01679, 5.01788, 5.01327, 5.01883, 5.02777, 5.02198, 5.01831, 5.01463 ]
exp 5: [ 55, 55, 55 ]

which, for the four first cases, is slightly more than 10ms and is what was expected. The expected value for the experiment labeled 4 is slightly higher than 5 ms and for the last lambda, it should be in the nanoseconds range, which it is. Of course, the constant value is suspicious and probably means the actual value is actually lower, but the time_it overhead and the clock precision is so that nothing below the 55ns value can be measured on my machine. I have not determined the error bar on the time measurement. This is a complex subject and much more work would be needed to test for that. One other thing I have not tried in this example is the timing of a member function, but it should work.

What is impossible

There are things I have found are not possible with the function as it stands. One of them is the inability to pass time_it a template function as an argument without specifying the template arguments of the passed in (or timed) function. I might be wrong, but I have not found a way to do that. Explicit argument passing is tedious and so time_it might not be very useful in those cases.

One other capability Python's timeit has that this one does not is the ability to accept setup code which is executed before the experiment is run. I am not sure how useful that would be in the context of C++. Of course, I can't imagine what others would use time_it for, so it might be useful even though I can't see it, but I am not writing the STL! It's just that in my view, since C++ is compiled and not interpreted, time_it will still require a main function and an executable to run, unlike Python's timeit which can be run on the fly from the command line to test something out. To my mind, the setup code can then be written in the main function to no real cost compared to having it done in the time_it function. That being said, I believe it would be possible to accept a setup function by making the input of time_it two tuples of function and arguments, the first being the function to test and the second being the setup function. I am just not 100% certain that it would be useful and I cannot say for sure that it can be done because I have not tried implementing it. In my opinion, it would make calling the function without setup code much uglier. Maybe an overload taking two tuples and the current version could live together and that would allow the functionality at "no cost".

Anyhow, those are my thoughts on the matter for now.

Notes

[1] I chose the name with an underscore to differentiate the function from its Python counterpart since my version does not do exactly all the stuff the python version can! ↩︎