|
Floating Point Notation - Problems
| Following example tries to calculate the units ( bills and coins ) of a certain Euro amount. |
| Module Example1
Sub Main()
Dim units As Single() = _
{500, 200, 100, 50, 20, 10, 5, 2, 1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01}
Dim amount As Single = 0.06
Console.Write(amount & " : ")
Dim index As Integer
Do While amount > 0
Do While amount - units(index) >= 0
Console.Write(units(index) & " ")
amount -= units(index)
Loop
index += 1
Loop
Console.WriteLine()
Console.ReadLine()
End Sub
End Module Download Broncode |
| An exception ( runtime error ) IndexOutOfRangeException occurs at index 15. When looking at the above example, you would expect that on index 14 amount would be 0, so index 15, which is indeed out of range, would never be reached. When 'index' reaches, and 'units(index)' evaluates to 0.05, subtraction 0.06 - 0.05 happens, this doesn't lead to 0.01, but to 0.009999998.
Certain floating point operations can have strange results. These strange results can be unexpected when you don't know anything about floating point operations.
The strange result are usually the caused by the internal representation of the values. Some decimal values ( base 10 ) can never be exactly represented in these floating point datatypes. Often the values need to be approximated, so round off error can be produced when operations on these values occur.
For instance value 1/3 can in decimal scale ( base 10 ) never be exactly represented in its normal representation : 0.333... Every 3 you add makes it more precise, but it will never be completely accurate. 1/10 for instance can never be completely represented in a binary scale ( base 2 ) : 0.00011001100110011... ( the 0011 part infinitely repeats ).
What ever the scale you use, there will always be values that are impossible to represent exactly and completely. Irrational numbers ( number which cannot be expressed as a fraction ) are particularly hard to represent, for instance some squareroots, the constant pi, the constant e, ... .
All rational numbers could exactly be represented if both the divisor and dividend are stored. But irrational numbers can not be stored this way.
No schema with finite capacity can ever represent all decimal values exactly. It is impossible to represent an infinite range of values in a finite amount of bits.
Most environments ( also .NET ) use floating point notation to represent decimal values. This is not a perfect system, but by implementing the IEEE 754 standard for floating point notations, .NET at least guarantees standardised techniques are be used to approximate values, and to perform operations on approximated values.
In the following example other strange results are produced. |
| Module Example2
Sub Main()
Console.WriteLine(2.0 Mod 0.2 = 0)
Console.WriteLine(2.0 Mod 0.2)
Dim someSingle As Single = 4.99
Console.WriteLine(someSingle * 17 = 84.83)
Console.WriteLine(someSingle * 17)
someSingle = 1 / 107.0
Console.WriteLine(someSingle * 107 = 1)
Console.WriteLine(someSingle * 107)
Console.ReadLine()
End Sub
End Module Download Broncode |
| Output : False
0,2
False
84,82999
False
0,9999999 |
Up
Floating Point Notation - Representation
| IEEE 754 Single Precision ( like Single in .NET ) :
1 bit for the sign (s) + 8 bits for the exponent (e) + 23 bits for the mantissa (m) = 32 bits
Some remarks about the following notations : - binary values are between square brackets ( for instance [0101] ) - symbol ~ is used for approximation
Binary format : |
| seee eeee emmm mmmm mmmm mmmm mmmm mmmm |
| Different representations are used within floating point :
- normalised - zero ( negative and positive zero ) - subnormal ( denormalised ) - infinity ( positive and negative infinity ) - not-a-number ( NaN )
Normalised Representation :
This representation is used for most values.
General formula : |
| (-1)^[s] * [1.mmmm mmmm mmmm mmmm mmm] * 2^[eeee eeee] |
| [0] (-1)^0 = 1
or
[1] (-1)^1 = -1 |
| Exponent :
The exponent is stored as an unsigned byte value. To be ably to represent small values ( with negative exponent ), an offset ( also called bias ) of - 127 is used.
Some possible representations : |
| [0000 0000] = 0 -> reserved for other representations
[0000 0001] = 1 - 127 = -126 -> minimum exponent
...
[0111 1110] = 126 - 127 = -1
[0111 1111] = 127 - 127 = 0
[1000 0000] = 128 - 127 = 1
...
[1111 1110] = 254 - 127 = 127 -> maximum exponent
[1111 1111] = 255 -> reserved for other representations |
| Exponents 0 en 255 are reserved for other representations, later more about these reserved values.
Mantissa :
Number 0,5 could be represented as 1 * 2^-1 or as 0.5 * 2^0 or as 0.25 * 2^1 or as 0.125 * 2^2 or as ... . By dividing the mantissa by 2, and adding 1 to the exponent, the same result is reached. In other words, one value could have different representations, and room ( read : format representations ) for other values is lost. To avoid this, and to maximize the range of possible values that can be represented, the normalized representation will maximize the significant, and minimize the exponent. This process is called normalization.
The significant is always preceded with [1.], so all 24 digits for the mantissa can be used to represent this mantissa.
Minimum value for the mantissa is : |
| [1.0000 0000 0000 0000 0000 000] or 1 |
| Maximum value for the mantissa is : |
| [1.1111 1111 1111 1111 1111 111] or 1,999999940395355224609375 or 2^1 - 2^-24 |
| The minimum normalized value uses mantissa 1 and exponent -126 : |
| 1 * 2^-126 or ~ 1,1754E-38 . |
| The maximum normalized value uses mantissa 1,999999940395355224609375 and exponent 127 : |
| 1,999999940395355224609375 * 2^127 or ~ 3.4028E+38 |
| Some possible representations of normalized values : |
| [0000 0000 1000 0000 0000 0000 0000 0000] or ~ 1,1754E-38
-> minimum positive value
...
[0011 1111 0000 0000 0000 0000 0000 0000] or 0,5
...
[0011 1111 0001 1001 1001 1001 1001 1010] or 0,6
...
[0111 1111 0111 1111 1111 1111 1111 1111] or ~ 3,4028E+38
-> maximum positive value -> 'Single.MaxValue'
...
[1000 0000 1000 0000 0000 0000 0000 0000] or ~ -1,1754E-38
-> minimum negative value
...
[1111 1111 0111 1111 1111 1111 1111 1111] or ~ -3,4028E+38
-> maximum negative value -> 'Single.MinValue' |
| 0,5 will be represented with sign 1 ( or [0] ), mantissa 1 or [1.0000 0000 0000 0000 0000 000] ) and exponent -1 ( or [0111 1110] ), or 1 * 1 * 2^-1.
0,6 will be represented with sign 1 ( of [0] ), mantissa 1,1935484 or [1.0011 0011 0011 0011 0011 010] ) and exponent -1 ( or [0111 1110] ), or 1 * 1,1935484 * 2^-1.
Representation of Zero :
How is zero represented? Neither the exponent, nor the mantissa can be zero, so the result ( multiplication of both ) can never be zero.
For zero two representations are reserved, one for positive zero and one for negative zero.
Both the significant and the exponent are 0 for the representation of zero. |
| [0000 0000 0000 0000 0000 0000 0000 0000] -> +0
[1000 0000 0000 0000 0000 0000 0000 0000] -> -0 |
| Subnormal ( Denormalized ) Representation :
General formula : |
| (-1)^[s] * [0.mmmm mmmm mmmm mmmm mmm] * 2^-126 |
| These representations are used to represent very small values.
Exponent :
The exponent is always [0000 0000] or 0, this 0 has no meaning ( except for being part of this representation ). The value used for the exponent in this representation is always -126 ( equal to the minimum exponent in the normalized representation ).
Mantissa :
The mantissa is not normalized. A prefix [0.] is always presumed.
The minimum mantissa is : |
| [0.0000 0000 0000 0000 0000 001] or 0,00000011920928955078125 or 2^-23 |
| The maximum mantissa is : |
| [0.1111 1111 1111 1111 1111 111] or 0,999999940395355224609375 or 2^0 - 2^-24 |
| The minimum denormalized value is : |
| 0,00000011920928955078125 * 2^-126 or ~ 1,4012E-45 |
| The maximum denormalized value is : |
| 0,999999940395355224609375 * 2^-126 or ~ 1,1754E-38 |
| [0000 0000 0000 0000 0000 0000 0000 0001] or ~ 1,4012E-45
-> minimum positive value -> 'Single.Epsilon'
...
[0000 0000 0111 1111 1111 1111 1111 1111] or ~ 1,1754E-38
-> maximum positive value
...
[1000 0000 0000 0000 0000 0000 0000 0001] or ~ -1,4012E-45
-> minimum negative value
...
[1000 0000 0111 1111 1111 1111 1111 1111] or ~ -1,1754E-38
-> maximum negative value |
| Representation of Infinities :
Exponent :
The exponent is always [1111 1111] or 255, this 255 has no meaning ( except for being part of this representation ).
Mantissa :
The mantissa is always [000 0000 0000 0000 0000 0000] or 0, this 0 has no meaning ( except for being part of this representation ).
Sign :
The sign bit indicates positive or negative infinity. |
| [0111 1111 1000 0000 0000 0000 0000 0000] -> 'Single.PositiveInfinity'
[1111 1111 1000 0000 0000 0000 0000 0000] -> 'Single.NegativeInfinity' |
| Module Example3
Public Sub Main()
Console.WriteLine("seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm")
Console.WriteLine(GetBinary(1.17549435E-38F) & " : " & _
1.17549435E-38F.ToString())
Console.WriteLine(GetBinary(0.5F) & " : " & 0.5F.ToString())
Console.WriteLine(GetBinary(0.6F) & " : " & 0.6F.ToString())
Console.WriteLine(GetBinary(Single.MaxValue) & " : " & _
Single.MaxValue.ToString())
Console.WriteLine(GetBinary(-1.17549435E-38F) & " : " & _
-1.17549435E-38F.ToString())
Console.WriteLine(GetBinary(Single.MinValue) & " : " & _
Single.MinValue.ToString())
Console.WriteLine(GetBinary(0.0F) & " : " & 0.0F.ToString())
Console.WriteLine(GetBinary(-0.0F) & " : " & -0.0F.ToString())
Console.WriteLine(GetBinary(Single.Epsilon) & " : " & _
Single.Epsilon.ToString())
Console.WriteLine(GetBinary(-1.401298E-45F) & " : " & _
-1.401298E-45F.ToString())
Console.WriteLine(GetBinary(Single.PositiveInfinity) & " : " & _
Single.PositiveInfinity.ToString())
Console.WriteLine(GetBinary(Single.NegativeInfinity) & " : " & _
Single.NegativeInfinity.ToString())
Console.ReadLine()
End Sub
Public Function GetBinary(ByVal value As Byte) As String
For counter As Integer = 1 To 8
GetBinary = (value Mod 2).ToString() & GetBinary
value >>= 1
Next
End Function
Public Function GetBinary(ByVal value As Single) As String
If BitConverter.IsLittleEndian Then
For Each byteElement As Byte In BitConverter.GetBytes(value)
GetBinary = GetBinary(byteElement) & GetBinary
Next
Else
Throw New ApplicationException("Only Little Endian supported.")
End If
End Function
End Module Download Broncode |
| Output : seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
00000000100000000000000000000000 : 1,175494E-38
00111111000000000000000000000000 : 0,5
00111111000110011001100110011010 : 0,6
01111111011111111111111111111111 : 3,402823E+38
10000000100000000000000000000000 : -1,175494E-38
11111111011111111111111111111111 : -3,402823E+38
00000000000000000000000000000000 : 0
10000000000000000000000000000000 : 0
00000000000000000000000000000001 : 1,401298E-45
10000000000000000000000000000001 : -1,401298E-45
01111111100000000000000000000000 : oneindig
11111111100000000000000000000000 : -oneindig |
| Operations on Zero, NaN and Infinity :
Operation on special values ( zero, NaN and infinity ) will according to the IEEE 754 standard have specific results.
Every operation using a NaN operand, will result in a NaN.
Other operations are illustrated by following example : |
| Module Example4
Sub Main()
Dim singleOperands As Single() = {Single.PositiveInfinity, _
Single.NegativeInfinity, _
123.0F, -123.0F, 0.0F, -0.0F}
Dim operatorSymbols As String() = {"*", "/", "+", "-"}
For Each operatorSymbol As String In operatorSymbols
Console.WriteLine("OPERATOR " & operatorSymbol.ToString())
Console.WriteLine()
For Each singleOperand1 As Single In singleOperands
For Each singleOperand2 As Single In singleOperands
PrintCalculation(singleOperand1, operatorSymbol, _
singleOperand2)
Next
Console.WriteLine()
Next
Console.WriteLine()
Next
Console.ReadLine()
End Sub
Sub PrintCalculation(ByVal operand1 As Single, _
ByVal operatorSymbol As String, _
ByVal operand2 As Single)
Console.Write(GetString(operand1) & " " & operatorSymbol & " " & _
GetString(operand2) & " = ")
Select Case operatorSymbol
Case "*"
Console.WriteLine(GetString(operand1 * operand2))
Case "/"
Console.WriteLine(GetString(operand1 / operand2))
Case "+"
Console.WriteLine(GetString(operand1 + operand2))
Case "-"
Console.WriteLine(GetString(operand1 - operand2))
End Select
End Sub
Function GetString(ByVal value As Single) As String
If IsPositiveZero(value) Then
GetString = "+0"
ElseIf IsNegativeZero(value) Then
GetString = "-0"
ElseIf Single.IsNegativeInfinity(value) Then
GetString = "-Infinity"
ElseIf Single.IsPositiveInfinity(value) Then
GetString = "+Infinity"
ElseIf Single.IsNaN(value) Then
GetString = "NaN"
Else
GetString = value.ToString()
End If
End Function
Public Function IsPositiveZero(ByVal value As Single) As Boolean
If BitConverter.GetBytes(value)(0) = 0 AndAlso _
BitConverter.GetBytes(value)(1) = 0 AndAlso _
BitConverter.GetBytes(value)(2) = 0 AndAlso _
BitConverter.GetBytes(value)(3) = 0 Then _
IsPositiveZero = True
End Function
Public Function IsNegativeZero(ByVal value As Single) As Boolean
If BitConverter.GetBytes(value)(0) = 0 AndAlso _
BitConverter.GetBytes(value)(1) = 0 AndAlso _
BitConverter.GetBytes(value)(2) = 0 AndAlso _
BitConverter.GetBytes(value)(3) = 128 Then _
IsNegativeZero = True
End Function
End Module Download Broncode |
| Output : OPERATOR *
+Infinity * +Infinity = +Infinity
+Infinity * -Infinity = -Infinity
+Infinity * 123 = +Infinity
+Infinity * -123 = -Infinity
+Infinity * +0 = NaN
+Infinity * -0 = NaN
-Infinity * +Infinity = -Infinity
-Infinity * -Infinity = +Infinity
-Infinity * 123 = -Infinity
-Infinity * -123 = +Infinity
-Infinity * +0 = NaN
-Infinity * -0 = NaN
123 * +Infinity = +Infinity
123 * -Infinity = -Infinity
123 * 123 = 15129
123 * -123 = -15129
123 * +0 = +0
123 * -0 = -0
-123 * +Infinity = -Infinity
-123 * -Infinity = +Infinity
-123 * 123 = -15129
-123 * -123 = 15129
-123 * +0 = -0
-123 * -0 = +0
+0 * +Infinity = NaN
+0 * -Infinity = NaN
+0 * 123 = +0
+0 * -123 = -0
+0 * +0 = +0
+0 * -0 = -0
-0 * +Infinity = NaN
-0 * -Infinity = NaN
-0 * 123 = -0
-0 * -123 = +0
-0 * +0 = -0
-0 * -0 = +0
OPERATOR /
+Infinity / +Infinity = NaN
+Infinity / -Infinity = NaN
+Infinity / 123 = +Infinity
+Infinity / -123 = -Infinity
+Infinity / +0 = +Infinity
+Infinity / -0 = -Infinity
-Infinity / +Infinity = NaN
-Infinity / -Infinity = NaN
-Infinity / 123 = -Infinity
-Infinity / -123 = +Infinity
-Infinity / +0 = -Infinity
-Infinity / -0 = +Infinity
123 / +Infinity = +0
123 / -Infinity = -0
123 / 123 = 1
123 / -123 = -1
123 / +0 = +Infinity
123 / -0 = -Infinity
-123 / +Infinity = -0
-123 / -Infinity = +0
-123 / 123 = -1
-123 / -123 = 1
-123 / +0 = -Infinity
-123 / -0 = +Infinity
+0 / +Infinity = +0
+0 / -Infinity = -0
+0 / 123 = +0
+0 / -123 = -0
+0 / +0 = NaN
+0 / -0 = NaN
-0 / +Infinity = -0
-0 / -Infinity = +0
-0 / 123 = -0
-0 / -123 = +0
-0 / +0 = NaN
-0 / -0 = NaN
OPERATOR +
+Infinity + +Infinity = +Infinity
+Infinity + -Infinity = NaN
+Infinity + 123 = +Infinity
+Infinity + -123 = +Infinity
+Infinity + +0 = +Infinity
+Infinity + -0 = +Infinity
-Infinity + +Infinity = NaN
-Infinity + -Infinity = -Infinity
-Infinity + 123 = -Infinity
-Infinity + -123 = -Infinity
-Infinity + +0 = -Infinity
-Infinity + -0 = -Infinity
123 + +Infinity = +Infinity
123 + -Infinity = -Infinity
123 + 123 = 246
123 + -123 = +0
123 + +0 = 123
123 + -0 = 123
-123 + +Infinity = +Infinity
-123 + -Infinity = -Infinity
-123 + 123 = +0
-123 + -123 = -246
-123 + +0 = -123
-123 + -0 = -123
+0 + +Infinity = +Infinity
+0 + -Infinity = -Infinity
+0 + 123 = 123
+0 + -123 = -123
+0 + +0 = +0
+0 + -0 = +0
-0 + +Infinity = +Infinity
-0 + -Infinity = -Infinity
-0 + 123 = 123
-0 + -123 = -123
-0 + +0 = +0
-0 + -0 = -0
OPERATOR -
+Infinity - +Infinity = NaN
+Infinity - -Infinity = +Infinity
+Infinity - 123 = +Infinity
+Infinity - -123 = +Infinity
+Infinity - +0 = +Infinity
+Infinity - -0 = +Infinity
-Infinity - +Infinity = -Infinity
-Infinity - -Infinity = NaN
-Infinity - 123 = -Infinity
-Infinity - -123 = -Infinity
-Infinity - +0 = -Infinity
-Infinity - -0 = -Infinity
123 - +Infinity = -Infinity
123 - -Infinity = +Infinity
123 - 123 = +0
123 - -123 = 246
123 - +0 = 123
123 - -0 = 123
-123 - +Infinity = -Infinity
-123 - -Infinity = +Infinity
-123 - 123 = -246
-123 - -123 = +0
-123 - +0 = -123
-123 - -0 = -123
+0 - +Infinity = -Infinity
+0 - -Infinity = +Infinity
+0 - 123 = -123
+0 - -123 = 123
+0 - +0 = +0
+0 - -0 = +0
-0 - +Infinity = -Infinity
-0 - -Infinity = +Infinity
-0 - 123 = -123
-0 - -123 = 123
-0 - +0 = -0
-0 - -0 = +0 |
|
|
|