Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Here's a reliable/portable solution:

    bool double_to_uint64 (double x, uint64_t *out)
    {
        double y = trunc(x);
        if (y >= 0.0 && y < ldexp(1.0, 64)) {
            *out = y;
            return true;
        } else {
            return false;
        }
    }
If you need different rounding behavior, just change trunc() to round(), floor() or ceil(). Note that it is important that the result is produced by converting the rounded double (y) to an integer type, not the original value (x).

Explanation:

- we first round the value to an integer (but still a floating point value),

- we then check that this integer is in the valid range of the target integer type by comparing it with exact integer values (0 and 2^N),

- if the check passes, then converting this integer to the target integer type is safe, and if the check fails, then conversion is not possible.

Of course if you literally need to convert to "long" you have a problem because the width of "long" is not known, but that is a rather different concern. I argue types with unknown width like "long" should almost never be used anyway.

(based on my answer here: https://stackoverflow.com/questions/8905246/how-to-check-if-...)



sort by: page size:

Programming languages aren't able to express arbitrary real numbers, so to "determine whether a given real number is representable as an int or not" is mostly meaningless in a programming language as opposed to a computer algebra system.

What you can do is:

- determine if a float is an integer (trunc(x) == x),

- convert a float to a certain integer type with some kind of rounding, or get an error if it's out of range (see my comment with double_to_uint64),

- convert a float to a certain integer type exactly, or get an error if it's not representable (e.g. by doing both of the above).

The basic reason that so many people fail to use floats correctly is that they act like operations on floats are equivalent to operations on the real numbers they represent, when in fact they are usually defined as the operation on real numbers rounded to a representable value.


https://wiki.c2.com/?IeeeSevenFiftyFour:

“IEEE 754 […] Has the interesting and useful property that two's complement comparisons of the underlying bit pattern of any two IEEE 754 numbers will have the same result as comparing the numbers that are represented”

That means that, if you interpret the bits of a float/double as an int32/int64, increase that integer by one, and then interpret the bits of the result as a float/double, you get the smallest float/double that’s larger than what you started with (with exceptions for NaN, infinities, +/- zero, and, possibly, some categories I forget)

That can be useful if you want to iterate over all floats between 0.0 and 1.0, for example, but may not be that efficient on some modern hardware, where moving data from the “float side” of the CPU to the “integer side” is expensive)


float == 32 bits, double == 64 bits.

That is not a problem, it simply converts the float value to uint32. What the post mentions is `uint32_t my_int = (uint32_t)&my_float;`

Double precision floats can't represent every 64-bit integer. If you want to math it, what kind of number will you accept?

I mention at the start that casting to double is not allowed. The motivation for that restriction is to allow the solution to generalize to double/int64 or even quad/int128.

In a more practical sense, casting to a double is a good solution. Both int32s and floats can be represented faithfully as a double, and that's what I used to test my implementation against.

However, the unreliable extra precision is still a problem. Casting to double may or may not remove it. You're still exposed to false positives and false negatives, depending on where and when precision is available. You should still run the float through a volatile float field, if you want consistent behavior.

(Note: the volatile field doesn't have to be static. You can use a struct on the stack. That just took more code, and seemed like a superfluous detail. In C, instead of C#, you can just do a cast, instead of storing it in a field.)


No, not 32, 64, 128 bit floats/doubles (though I'm glad it has those as well). I meant plain ints. Do I have to do binary math or can I make use of large numbers natively?

Point 5 seems like dangerous advice, shortening types (especially floating-point types) without analysis isn't a good idea and especially not a deadly sin.

Changing doubles to floats can give a significant performance boost (mainly on ARM with Neon) but it brings significant limitations on range and accuracy that can lead to subtle bugs if you don't do the analysis (especially if you mostly test it with double precision.)


Why bother? Almost every float literal is inexact. floats are inexact by design. That's why you can fit over 2^(2^53) values into 64 bits.

And compiler can't help you on the application UI layer.


Floats are still represented in 32 bits, or 64 for doubles

Can't think of a common programming language which doesn't support saving bit-accurate copies of floats in a contiguous buffer. Even in JavaScript you can put your doubles in a Float64Array and convert that to or from a buffer.

If you want headers, you can define those as text and just use extents to embed the binary data. Tar is another alternative, pretty easy to implement.


- automatic integer-to-float comparison to accommodate bigger integers. A horrible hack to squeeze a little extra performance in naive benchmarks in computers with no native 64 bit integer support. This really makes no sense whatsoever now and may have had some partial justification in the early 90s, prior to PHP4 even.

No, this is not unique to php. Many popular, comparable languages perform an int -> float conversion. For example, Perl:

$ perl -wle'print "20938410923849012834092834" + 0 if "20938410923849012834092834" == "20938410923849012834092835"'

2.0938410923849e+25

- This is not a philosophical debate about typing styles or the existence of perfect type conversions. PHP's problems in this regard are relics from a dubious past.

Conversion from string -> number, and loose numeric types which auto-convert to float are near universal in loosely typed languages, out of necessity -- if such a scheme doesn't work consistently it can't be used at all. This brings me back to my point. You said "IMO this conversion should fail if the number represented is not valid, or fall back to arbitrary precision math". My response is that you cannot provide such a rule on the basis of "is it valid" because there is no such thing as a "valid" type conversion -- ALL have precision loss. It is inherent in the datatype. When I said "you may be underestimating the difficulty in predicting whether a particular decimal number can be accurately represented as a floating point type" you should perhaps read that as "you cannot do this, it is not possible".

Instead you might suggest that no loose conversion, no loose typing be permitted in a language design -- and I would agree wholeheartedly. But your suggestion that this be handled on a case-by-case basis depending on the numeric value is fundamentally unworkable. Big integers are not the only area this type of problem presents.


I found a tool[0] that helps me debug potential floating point issues when they arise. This one has modes for half-, single- and double-precision IEEE-754 floats, so I can deal with the various issues when converting between 64-bit and 32-bit floats.

[0] https://news.ycombinator.com/item?id=29370883


How does double_of_uint work? Float 5.0 and (u)int 5 have different representations so you can't just do static type change and get correct result, you need a runtime conversion. I suppose it might be special cased in the compiler, but in that case it's bit of a poor example.

nice. so, go does automatic conversion/check for float/int..

If you want to reinterpret a float as an integer or vice versa, you can do that easily enough with Rust's unsafe functions:

    fn approx_invsqrt(r : f32) -> f32
    {
        let y : f32 = unsafe {
            let i : i32 = std::mem::transmute(r);
            std::mem::transmute(0x5f375a86 - (i>>1))
        };
        return y*(1.5-(0.5*r*y*y));
    }

    fn main()
    {
        println!("approx_invsqrt(2.0) = {}", approx_invsqrt(2.0));
    }
Result:

    approx_invsqrt(2.0) = 0.70693

Aren't most float operations binary? You'd need to iterate over a 64-bit space to test those.

Double precision floats can losslessly manipulate integers with up to 56 bits precision, so you get the best of both worlds with no extra effort.

In my experience there are few things slower that float to string and string to float. And it seems so unnecessary.

I always implemented round to a specific digit based on the built-in roundss/roundsd functions which are native x86-64 assembler instructions (i.e. https://www.felixcloutier.com/x86/roundsd).

I do not understand why this would not be preferable to the string method.

float round( float x, int digits, int base) { float factor = pow( base, digits ); return roundss( x * factor ) / factor; }

I guess this has the effect of not working for numbers near the edge of it's range.

One could check this and fall back to the string method. Or alternatively use higher precision doubles internally:

float round( float x, int digits, int base ) { double factor = pow( base, digits ); return (float)( roundsd( x * factor ) / factor ); }

But then what do you do if you have a double rounded and want to maintain all precision? I think there is likely some way to do that by somehow unpacking the double into a manual mantissa and exponent each of which are doubles and doing this manually - or maybe using some type of float128 library (https://www.boost.org/doc/libs/1_63_0/libs/multiprecision/do...)...

But changing this implementation now could cause slight differences and if someone was rounding then hashing this type of changes could be horrible if not behind some type of opt-in.

next

Legal | privacy