Mixed types arithmetic in C++

Posted on 2018-05-19

Arithmetic on mixed fundamental types in C++

For a weekend project of mine, I have had to think about mixed type arithmetic on fundamental types in C++. In the process, I made sense of a few fundamental things (no pun intended ;-) ) and I have decided to write them down. Hopefully, writing about it will allow me to both clarify my thoughts and remember the information!

Arithmetic conversions

Applying binary operators to different types might seem trivial in C++, because it mostly just works. If you write the following code:

float flt{15.f};
long lng_a{30L};
long lng_b = lng_a + flt;
assert( lng_b == 45 );

and then run it, the value of lng_b will be 45. No surprises... Except when you stop to think about what happened in the background and how many rules were involved in the computation.¹

Naively (as seems to often be the case for me...), because of the performance reputation of C++, I assumed that the addition expression above mapped to an assembly language instruction² to add two registers. Then, I started thinking more seriously about the problem, and even though I am anything but an expert in assembly, it brought me to this question: is there an opcode to add an int to a float? Are there mixed type instructions for the CPUs? With modern hardware, it is not as simple as we think anymore, but as far as I could find out, in most hardware, there is not. This means that at the hardware level, both datum have to have the same representation to allow the operation, which is not completely unreasonable. Thus, even for the simple expression in the code above, conversions are needed to select a common type to apply the operation on.

The C++ language standard explicitly states which conversions will take place (inherited from C) allowing one to take control and override the behavior manually using a cast if preferred. This could be needed if, for instance, the default conversion introduces loss of precision on a given platform or if a specific wrapping behavior is required.

One should note that the type selected for the operation by the conversion rules will be the type of both operands and of the return value. This means that a supplementary conversion might happen if the type in which the result of the operation is put is not that which would have been selected by the usual conversions (as is the case in the example above). Something to keep in mind.

Usual arithmetic conversions

The conversion rules applied before binary operations on fundamental types are called the usual arithmetic conversions and can be found in section 8 Expressions of the C++ standard document³. For those like me who do not easily read "standardese", information on the subject with some explanations can be found in other places. That said, I have had to read some of the standard's sections relating to the topic and I have found them not too hard to read. Might be a sign that I am slowly getting assimilated...

In the discussion that follows, I will consider an operation op on two operands t1 and t2 respectively of types T1 and T2. This can be conceptually represented as:

T1 t1;
T2 t2;
t1 op t2;

In the discussion, I will consider the following cases:

T1 and T2 are the same type (yes, conversions can happen...)
T1 is floating point and T2 is integral (or vice versa)
T1 and T2 are both floating point, but different types
T1 and T2 are both integral, but different types

These are almost all the situations covered in paragraph 11 of section 8 of the standard (but the last point is actually split in several sub-sections). The only case I am not considering is when one of the type (or both) is a scoped enumeration (i.e. an enum class), because that had nothing to do with my project and I simply did not think about it as much.

Same type

Even if the types are not actually mixed, I had to consider the case where both operands are of the same type, i.e., T1 == T2. Intuitively, nothing should happen in this case, but it turns out that it is a false assumption. Because arithmetic operators in C++ do not accept any type smaller than int, integral promotion will take place before the operation. This is described in section 7.6 Integral promotions of the standard and can be roughly summarized as: any type smaller than int will be converted to int or unsigned int. For instance, the following relation holds:

short a{0};
short b{1};
static_assert( is_same_v< int, decltype( a + b ) > );

Other than that, nothing else happens in terms of conversions. As the name suggests, this applies only for integral types. I would assume that is because the smallest floating point type is at least as large as an int, but I don't think that is guaranteed.

Mixed integral and floating point types

Now, to look at mixed type arithmetic, the simplest case to start with is that of integral and floating point mixed operations, i.e. either T1 or T2 is a floating point and the other is integral. In this case, the standard simply mandates that the integer value be converted to the floating point type:

int + float => (float)int + float
unsigned long long - float => (float)unsigned long long - float
long double + unsigned => long double + (long double)unsigned
...

The casts illustrated here are at least what conceptually happens if not what actually happens, but, as far as I can tell, it is what actually happens. The type selected in this situation is not too surprising when you think about it. At least for IEEE floating points, the range of the smallest floating point type (float: 3.4×10³⁸) is much larger than that of the largest integer type (unsigned long long: 1.84×10¹⁹). Thus, neglecting the issue of not being able to represent the value exactly if the mantissa of the floating point type cannot hold the value of the integer type, the floating point type will accommodate the integer type. On top of that, the fractional part of the floating point would necessarily be lost (either by rounding, truncating or any other choice) if the conversion would be in the other direction.

So again, because of those two points, the standard here makes sense (at least to me!).

Mixed floating points

Next on the scale of simplicity is the case where both arguments are of a (different) floating point type. In this case, the rule is simple: the smaller type is cast to the larger type before the operation.

double / float => double / (double)float
long double + double => long double + (long double)double
...

This makes sense. The value in the smaller sized variable will fit in the larger one, so no change in value.

Mixed integrals

The final case is that of both operands being of integral types. Here, there are a few more things to consider, since for the same type size, there are signed and unsigned types (for instance, int and unsigned int must be the same size, e.g. 4 bytes). This complicates matters a little and before we continue, we need to first define the concept of integer conversion rank (section 7.15 Integer conversion rank of the standard document) which will be used in deciding the conversions to apply for mixed integer types arithmetic. Once these ranks are defined, the first situation that applies in the following four scenarios is the conversion mandated by the standard:

both have the same signedness, independent of ranks;
rank( unsigned ) >= rank( signed );
rank( signed ) > rank( unsigned ), unsigned in signed range;
rank( signed ) > rank( unsigned ), unsigned not in signed range;

Note that the order of the rank that I have written in situations 3 and 4 are not mentioned in the standard, but the fact that situations 1 and 2 do not apply implies that the rank of the signed integer is strictly greater than that of the unsigned integer, so I wrote it explicitly.

Integer conversion rank

From what I understand from reading the standard, the integer types in C++ are not given explicit values, but the relative ordering of the ranks is specified. This can be loosely interpreted as: the integer ranks are in corresponding order of size where the larger integral types have a higher rank. In particular, the standard says (section 7.15, par. 1.3):

The rank of long long int shall be greater than the rank of long int, which shall be greater than the rank of int, which shall be greater than the rank of short int, which shall be greater than the rank of signed char.

In order to remove any ambiguity, the standard adds quite a few details (there are 10 clauses to the section), but I believe that the following order of ranks, from smallest rank to highest rank, is mandated by the standard:

bool
char, signed char, unsigned char
short, unsigned short
int, unsigned int
long, unsigned long
long long, unsigned long long

where for a given type size, signed and unsigned types share their rank. I said the rule of thumb as presented above loosely interprets the standard because the standard does not explicitly mandate the size of short, int, long, and others. This freedom is to allow the implementers to represent the various hardware architectures that exist. I think this is mostly an artifact of history, since a lot of modern hardware is 32 or 64 bits, but it is still how the standard is written. That said, it remains that on some machines, two types could share the same size, e.g. on a particular architecture, sizeof(long) could be the same as sizeof(int). In such a case, the standard would still stipulate that those types' ranks are different. Specifically, in the example give, long would still have a higher rank than int.

Same signedness

So, getting back to the mixed operations and the usual conversion, in the case of two integral types with the same signedness, i.e. both T1 and T2 are signed or both of them are unsigned, the standard mandates that the integer with the smaller rank be converted (after promotion), to the integer with the higher rank.

long + int => long + (long) int
unsigned short * unsigned int => (unsigned int)unsigned short * unsigned int
...

The higher ranked integer will accommodate the values of the smaller ranked one without problem, and there are no considerations of sign, so no possible loss of value or overflow in the conversion (there is possible overflow in the operation, but not in the conversion). This case is an easy one.

Differing signedness, unsigned with larger or equal rank

In this case, the standard says that the signed integer will be converted to the unsigned type.

int + unsigned int => (unsigned int)int + unsigned int
short - unsigned int => (unsigned int)short - unsigned int
...

The fact that the operation then yields the correct answer is mandated by the standard. In section 7.8 Integral conversions, the standard says:

If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2ⁿ where n is the number of bits used to represent the unsigned type). [ Note: In a two’s complement representation, this conversion is conceptual and there is no change in the bit pattern (if there is no truncation). — end note]

Because of the modulo 2ⁿ arithmetic, this will give the correct unsigned answer... most of the time. See the discussion in the last section for an example where this rule yields a surprising result.

This being the case, if you are putting the result of the operation in a variable, at this point, it is worth thinking about that variable's type, because if that type is not the type of the unsigned operand (or larger unsigned integral type), you will incur a conversion. That is, while the operation is guaranteed to be correct by the standard, putting it back into anything but a large enough unsigned integral type might not yield the result you expect. In a smaller unsigned integral type, there is at least another modulo conversion happening. If the type is signed (whether it is large enough or not), then the result is implementation defined as stipulated by the standard, again in section 7.8:

If the destination type is signed, the value is unchanged if it can be represented in the destination type; otherwise, the value is implementation-defined.

The standard does not specify what happens in this case and instead gives latitude to the compiler vendor saying the result is implementation defined. This means that if you rely on this conversion, the behavior might not be portable (not undefined as in the case of overflow, just not portable and tied to the compiler you use). On two's complement machines, this will actually give you wrapping behavior, but relying on this is actually non portable (even if, from what I understand, most hardware uses two's complement these days). On other architectures, the behavior will be different and so portable code should not rely on the conversions without some kind of checks.

Differing signedness, signed with larger rank, unsigned in range

Here, the standard says that the unsigned integral type is converted to the signed integral type.

long long int + unsigned long( value < long_long_int_max )
                           => long long int + (long long int)unsigned long

Given that the unsigned integer is representable in the range of the signed integer, the conversion will work as stipulated in section 7.8 of the standard that I quoted in the previous part of this post (at least, that is my understanding). So that should always give the correct answer since the unsigned value is in range of the signed type.

Differing signedness, signed with larger rank, unsigned not in range

Here, the standard says that both operands are converted to the unsigned type of same rank as that of the signed integer in the operation. The unsigned should be in range of the unsigned with the larger rank (i.e. the unsigned with same rank as the signed in the operation, which is higher than that of the unsigned in the operation). The signed one will be modulo 2ⁿ converted. Thus the result should be right given the modulo arithmetic, but with the usual caveats of what you do with the result.

Back to the first example

So coming back to the first example, let's see if I can apply the rules to it.

float flt{15.f};
long lng_a{30L};
long lng_b = lng_a + flt;

According to the conversion rules, I would say that the long value will first be converted to float to allow the addition, and that the resulting float will be truncated¹, which is what the standard mandates in section 7.10 Floating-integral conversions:

A prvalue of a floating-point type can be converted to a prvalue of an integertype. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.

The numbers above are small enough that it just works as expected! This is probably true for a lot of use cases, which is why I think I can stand by my initial affirmation that "applying binary operations to different types might seem trivial in C++, because it mostly just works".

Keep informed

As mentioned in the previous post, there is a (controversial?) proposal that has been brought to the the C++ standards committee by JF Bastien which would make two's complement the only allowed representation for signed integers. This could change some of the details of this article, namely the parts where conversion from unsigned to signed is implementation defined. So in C++20 or C++23, the information here could be out of date (already).

Also, because of conversions, the following assert will actually fire as the operation will yield false even if the mathematics would suggest otherwise:

assert( -1 < 0u )

That is because this is a case where both integers have the same rank (the -1 literal is int and the 0u literal is unsigned int), but differing signedness. Here, according to the rules above, the signed integer is converted to the unsigned integer, which means -1 becomes the largest unsigned integer, which will not be smaller than 0. This kind of surprising behavior is currently being discussed in the context of a proposal by Herb Sutter. Richard Smith is proposing to bring consistency between the new three-way comparison operator (a.k.a the spaceship operator <=>) and the usual C comparison operators. This might have no impact on what I discussed here or might change it completely. I will admit that I am aware of the proposal, but I have not had time to read it through.

In any case, the two proposals above, if they are adopted, will change some of what I discussed here, so keep informed if this matters to you!

Notes

I would like to thank Patrice Roy for reading my post and giving me some advice on it. His time is greatly appreciated.

^[1] Here is a link to the code of the first example in compiler explorer (put in a main function so it compiles). You can see the cvtsi2ss, addss and cvttss2si instructions which respectively convert the long to a float, adds the resulting float with the flt variable, and converts back the result to a long. ↩︎

^[2] I believe assembly instructions, assembly code, machine code, and opcodes are roughly the same (according to Wikipedia, some assembly instructions do not map directly to opcodes, but most do). In the context of this post, I don't think it makes much of a difference. Thus, I use the terms interchangeably, but I might be assuming a bit. I am out of my depths in this domain. ↩︎

^[3] The official published document must be purchased from the ISO organization, but the draft papers are freely available and can be found on the web. For instance, a C++17 draft paper (the latest draft before publication I believe, but I might be wrong) can be found here. ↩︎