Floating point representation of real numbers.

Real number representation:

As we have seen earlier, integers can be represented by the combination of binary digits, same is the case with real numbers, but representation is a bit different. There can be many techniques to represent them, but most efficient and useful is floating point representation. The reason is described below:

Why:

To understand, why floating point representation is better, we first need to know, what the alternatives are? here are those,

Treat real number as a combination of two integers.
Fixed decimal, sign bit representation.

But these have some disadvantages,

First: It does not seem to be logical. When we go this way, we need to use extra memory to keep track of the separated parts. Whenever we need to perform operation, it becomes more overhead.
Second: It has limits that restrict the wide range of numbers and also calculations. With this, we need to compromise with the accuracy, though floating point is no different here, but still much better than fixed one.

How:

Normalize the number.
The number now can be separated in two parts, one is the decimal part called mantissa, and the other is the exponent. Now the number can be represented as:

sign

The above is a simple example of 8 bit representation, there are 32, 64 bits representation also. There are different standards by which, the representation is implemented, i.e. number of bits for mantissa, number of bits for exponent etc., are standard defined. We will see the algorithm for our example.

IEEE 754 defines two standards for real no as:

Single precision (32-bit) floating point representation.

Sign(1bit)

Exponent 2-9 (8 bits)

Mantissa 10-32 (23 bits)

Double precision (64-bit) floating point representation.

Sign(1bit)

Exponent 2-12 (11 bits)

Mantissa 13-64 (52 bits)

Let us represent 1100.011
Read n ->1100.011
Normalize ->1.100011E 00110101

In the above representation excess-50 notation is used. For understanding how to perform normalization, you can follow the link in references.

Normalization:

There will be two parts of a normalized number, called mantissa and exponent. For our example, we will use 16-bit representation. For mantissa, shift the decimal to either right or left until there is a non-zero on its right and a zero on its left. And fill the remaining bits with zeros.
And the number of shifts will account for exponent. i.e. the number of shifts to left will be written as E+(no of shifts) and to right will be written as E-(no of shifts).
example:
100.011
Mantissa: 0.1000110
exponent: E+3 or E+00000011(binary)

Why normalization:

As we see, the computer has a different representation of real numbers from one that we use. It stores bit combination for them. Every real number has a different representation, as the decimal can be anywhere in the number. So, when it comes to perform operation and manipulations on these numbers, as the computer can not understand the significance of decimal, it can not produce required results. So, for performing operations on these numbers, the numbers need to have a definite structure. Hence by normalizing the number, we provide a particular structure to the computer to represent it in a standard way.

Algorithm:

1.	Read m, e
2.	if MSB = 0
3.	    sign = ‘+’
4.	else
5.	    sign = ‘-‘    
6.	whilem > 1
7.	    m = m/10
8.	    E = e + 1
9.	while m < 0.9
10.	    m = m*10
11.	    E = e – 1
12.	write ‘sign’ + ‘e’ + ‘m’

Excess 50 –

A major problem with above notation could be of accuracy, when we need to assign the exponent a sign that is definitely necessary. But that sign takes up a bit, which leads to reduction in accuracy. So to avoid that, instead of representing the sign of exponent, we add 50 to the exponent when we store the number, and subtract the same while retrieving. As in above example, the exponent 3 is converted to 53. In this way we can represent the exponent without having to assign a sign. Though it results in compression of range for exponent, but that’s what the price for accuracy. For 32 bit and 64-bit representation there are different notation, as for 32 bit, there is bias-127 notation.

References:

Normalized_number
My Video-lecture