# Portable Encoding of Floating-Point Values

by Jens Moeller, comp.lang.c, 24 May 1994

In de.comp.lang.c there was recently a thread about storing doubles in a machine independent way; the usual solution (converting to a string via atof or a similar function) results in a loss in accuracy, much need of storage, and, furthermore, takes much time. As it happend, I was currently working on this problem when I found the initial request of Steffani at Karlsruhe Univerity in the article <2r0lmc\$sl5@nz12.rz.uni-karlsruhe.de>. I think this issue is interesting not only in Germany, and so I decided to post this message to comp.lang.c.

At first we have to ask why the usual convertion via atof results in losses of accurracy. The reason is that the reals are converted from a (usually used) binary representation to a decimal one. But if you, for example, consider the value 0.1, you perceive that this number cannot be exactly represented as a binary value. As a consequence, dividing a real value by 10 will often cause the result to be rounded, but something like that is always done by atof when converting a real to its decimal string representation.

A much better solution in my eyes is to convert the given number into its representation to a basis of a power of 2; my choice is 256, which allows to store one digit in exactly one byte. The convertion can be performed without any losses in accurracy regarded that the underlying representation of floats is a binary one. (Of course, if floating points are represented by BCD digits, its exactly the wrong thing we do.)

The function Norm256 first splits the real into its sign, the mantissa and the exponent; note that the exponent is to the basis 256. The mantissa is normalized, i. e. lies within the range 1.0 and 256.0 exclusively. The value 0.0 is the only exception; this value yields a mantissa of 0.0.

In a second step, the function EncodeReal encodes the sign, the exponent and the mantissa separately. The first byte of the code contains the sign in the most siginificant bit and the total length of the coded number. As a consequence, the mantissa may use up to 124 * 8 bits for its representation. The exponent is coded in one or two bytes; the most significant bit of the second byte of code informs about which format is used. The one byte format uses a bias of 64; the two byte format a bias of 16383. The mantissa is coded a single byte per loop; at most sizeof (double) bytes may be taken to code the mantissa since further bytes will never contain any reasonable information. However, the loop will in most cases stop because the mantissa becomes 0.0. The function returns the number of bytes needed for the coded value. The range of reals converted in this way is approximately -2^130904 to +2^130904.

The function DecodeReal is used to decode a real that was perhaps coded on a different machine. The result is the decoded number. If the value does not fit in the real representation of the current machine, errno is set to ERANGE and the result is HUGE_VAL or -HUGE_VAL in the case of an overflow or 0.0 in the case of an underflow.

The function Init must be called before any conversation is performed; it mainly computes the numbers 256^(2^i) and 1/256^(2^i), which are used within all other routines.

Experiments have shown that reals are normally coded and decoded without any losses of accurracy. The code is relatively compact; at most three additional bytes are generated in comparison to the underlying representation of the C compiler. Very often, the code can be guaranteed to require only two additional bytes: the IEEE standard format for single precision can always be coded in six bytes, the IEEE standard format for extended precision in only twelve.

Maybe the code above was able to help someone. I know that some statements could be rewritten using bit manipulations, so don't complain about that. The code does not use these or other special language facilities of C to allow a convinient translation into other languages.

P.S. Does anyone know the difference between the constants HUGE_VAL defined within math.h and DBL_MAX declared in float.h? The ANSI standard seems to require both to be defined (as does the code above).

```#include <stdio.h> #include <math.h> #include <float.h> #include <errno.h> #define MAX_EXP256 16383 #define LD_MAX_EXP256 14 typedef unsigned char byte; static double Pow256 [LD_MAX_EXP256], /* Pow256 [i] == pow (256, pow (2, i + 1)) */ OneOverPow256 [LD_MAX_EXP256]; /* OneOverPow256 [i] == 1.0 / Pow256 [i] */ static unsigned MaxPow256Index; /* index up to which the preceeding arrays */ /* are filled; the following holds: */ /* SQR (Pow256 [MaxPow256Index]) > DBL_MAX */ static double MaxMantissa; static int MaxExp; /* Mantissa and exponent of DBL_MAX */ #define SQR(x) ((x) * (x)) #define ODD(x) ((unsigned) (x) % 2 != 0) #define ABS(x) (((x) >= 0) ? (x) : -(x)) #define HIGH(x) ((byte) ((unsigned) (x) / 256)) #define LOW(x) ((byte) (x)) /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ static void Norm256 (Real, SignOut, MantissaOut, ExpOut) double Real; short *SignOut; double *MantissaOut; int *ExpOut; { register short Sign; register double Mantissa; register int Exp; if (Real == 0.0) { Sign = 0; Mantissa = 0.0; Exp = 0; } else { Sign = (Real >= 0.0) ? 1 : -1; Mantissa = ABS (Real); if (Mantissa >= 1.0 && Mantissa < 256.0) { Exp = 0; } else { register unsigned Pow2, i; if (Mantissa >= 256.0) { for (i = 0, Pow2 = 1; i < MaxPow256Index && Mantissa >= Pow256 [i + 1]; i++, Pow2 *= 2); Mantissa /= Pow256 [i]; Exp = (int) Pow2; } else { for (i = 0, Pow2 = 1; i <= MaxPow256Index && Mantissa <= OneOverPow256 [i]; i++, Pow2 *= 2); if (i <= MaxPow256Index) { Mantissa *= Pow256 [i]; } else { Mantissa *= Pow256 [i - 1]; Mantissa *= Pow256 [i - 1]; } Exp = -(int) Pow2; } while (i > 0 && Mantissa >= 256.0) { do { i--, Pow2 /= 2; } while (i > 0 && Mantissa < Pow256 [i]); Mantissa /= Pow256 [i]; Exp += (int) Pow2; } } } *SignOut = Sign; *MantissaOut = Mantissa; *ExpOut = Exp; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ unsigned EncodeReal (Real, Buf) double Real; byte Buf []; { short Sign; double Mantissa; int Exp; register unsigned CodedExp; register unsigned Len, MantissaLen; Norm256 (Real, &Sign, &Mantissa, &Exp); if (Exp >= -64 && Exp <= 63) { CodedExp = (unsigned) (Exp + 64); Buf  = (byte) CodedExp; Len = 2; } else { CodedExp = (unsigned) (Exp + MAX_EXP256); Buf  = HIGH (CodedExp) + 0200; Buf  = LOW (CodedExp); Len = 3; } for (MantissaLen = 0; MantissaLen < sizeof (double) && Mantissa != 0.0; MantissaLen++) { Buf [Len] = (byte) Mantissa; Mantissa = (Mantissa - (double) Buf [Len]) * 256.0; Len++; } Buf  = (byte) ((Sign >= 0) ? Len : Len + 0200); return Len; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ double DecodeReal (Buf) byte Buf []; { double Real; short Sign; double Mantissa; register int Exp; register unsigned CodedExp; register unsigned Len, MantissaLen; register unsigned i; if (Buf  < 0200) { Sign = 1; Len = (unsigned) Buf ; } else { Sign = -1; Len = (unsigned) Buf  - 0200; } if (Buf  < 0200) { CodedExp = (unsigned) Buf ; Exp = (int) CodedExp - 64; MantissaLen = Len - 2; } else { CodedExp = ((unsigned) Buf  - 0200) * 256 + (unsigned) Buf ; Exp = (int) CodedExp - MAX_EXP256; MantissaLen = Len - 3; } Mantissa = 0.0; for (i = 0; i < MantissaLen; i++) { Mantissa = (Mantissa * 0.00390625) + (double) Buf [Len - i - 1]; } Real = (Sign >= 0) ? Mantissa : -Mantissa; if (Exp >= 0) { if (Exp > MaxExp || (Exp == MaxExp && Mantissa > MaxMantissa)) { /* the value causes an overflow on this machine */ Real = (Sign >= 0) ? HUGE_VAL : -HUGE_VAL; errno = ERANGE; } else { for (i = 0; Exp != 0; i++) { if (ODD (Exp)) { Real *= Pow256 [i]; } Exp /= 2; } } } else { Exp = -Exp; for (i = 0; i <= MaxPow256Index && Exp != 0; i++) { if (ODD (Exp)) { Real *= OneOverPow256 [i]; } Exp /= 2; } if (Exp != 0 || Real == 0.0) { /* the value causes an underflow on this machine */ Real = 0.0; errno = ERANGE; } } return Real; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ static void Init () { short DummySign; register unsigned i; Pow256  = 256.0; OneOverPow256  = 0.00390625; for (i = 1; i < LD_MAX_EXP256 && Pow256 [i - 1] <= DBL_MAX / Pow256 [i - 1]; i++) { Pow256 [i] = SQR (Pow256 [i - 1]); OneOverPow256 [i] = SQR (OneOverPow256 [i - 1]); } MaxPow256Index = i - 1; Norm256 (DBL_MAX, &DummySign, &MaxMantissa, &MaxExp); } ```
HTMLified by Steve Hollasch, 8 June 1995