Use a `union` to directly store the result and `memcpy()` it to the `uint256`read from it.
lower 64 bits.The `memcpy` is only added for strict standard compliance,
Using `arith_uint256` would be more straightforward but it requires aand should be optimized out by the compiler.
Note that this works because the `uin256_t` `WIDTH` member
conversion that has a measurable impact on the benchmark performanceis a `constexpr`.
This does not change the behavior.