Rite the Statements Necessary to Read an Integer From Stan

A Tutorial on Data Representation

Integers, Floating-indicate Numbers, and Characters

Number Systems

Man beings use decimal (base of operations ten) and duodecimal (base 12) number systems for counting and measurements (probably because we have x fingers and ii large toes). Computers utilise binary (base of operations 2) number system, as they are made from binary digital components (known as transistors) operating in ii states - on and off. In calculating, we also utilize hexadecimal (base 16) or octal (base viii) number systems, as a meaty grade for representing binary numbers.

Decimal (Base 10) Number Organization

Decimal number organization has x symbols: 0, ane, 2, 3, iv, five, 6, vii, 8, and ix, chosen digitdue south. It uses positional notation. That is, the least-meaning digit (right-virtually digit) is of the order of 10^0 (units or ones), the 2d right-almost digit is of the order of x^1 (tens), the third right-most digit is of the social club of ten^2 (hundreds), and then on, where ^ denotes exponent. For example,

735 = 700 + 30 + 5 = seven×ten^ii + 3×x^1 + 5×10^0

We shall denote a decimal number with an optional suffix D if ambivalence arises.

Binary (Base 2) Number System

Binary number system has two symbols: 0 and 1, called bits. It is also a positional annotation, for case,

10110B = 10000B + 0000B + 100B + 10B + 0B = 1×two^iv + 0×2^3 + 1×2^ii + 1×ii^1 + 0×ii^0

We shall denote a binary number with a suffix B. Some programming languages announce binary numbers with prefix 0b or 0B (e.g., 0b1001000), or prefix b with the bits quoted (e.thousand., b'10001111').

A binary digit is called a bit. Eight $.25 is called a byte (why 8-bit unit? Probably because 8=2³ ).

Hexadecimal (Base of operations 16) Number Organization

Hexadecimal number system uses 16 symbols: 0, ane, 2, 3, four, v, 6, 7, 8, nine, A, B, C, D, E, and F, called hex digits. It is a positional note, for instance,

A3EH = A00H + 30H + EH = 10×16^2 + 3×16^i + 14×16^0

Nosotros shall announce a hexadecimal number (in curt, hex) with a suffix H. Some programming languages announce hex numbers with prefix 0x or 0X (e.g., 0x1A3C5F), or prefix x with hex digits quoted (e.g., x'C3A4D98B').

Each hexadecimal digit is likewise chosen a hex digit. Virtually programming languages accept lowercase 'a' to 'f' also as uppercase 'A' to 'F'.

Computers uses binary system in their internal operations, as they are built from binary digital electronic components with two states - on and off. However, writing or reading a long sequence of binary bits is cumbersome and fault-prone (effort to read this binary string: 1011 0011 0100 0011 0001 1101 0001 1000B, which is the aforementioned every bit hexadecimal B343 1D18H). Hexadecimal system is used as a compact grade or shorthand for binary bits. Each hex digit is equivalent to 4 binary $.25, i.due east., shorthand for 4 bits, equally follows:

Hexadecimal	Binary	Decimal
0	0000	0
1	0001	one
ii	0010	2
3	0011	3
four	0100	4
v	0101	5
6	0110	vi
7	0111	7
8	chiliad	8
9	1001	9
A	1010	10
B	1011	11
C	1100	12
D	1101	13
E	1110	14
F	1111	15

Conversion from Hexadecimal to Binary

Supplant each hex digit past the iv equivalent bits (as listed in the above table), for examples,

A3C5H = 1010 0011 1100 0101B 102AH = 0001 0000 0010 1010B

Conversion from Binary to Hexadecimal

Starting from the right-most bit (least-significant chip), replace each group of four bits by the equivalent hex digit (pad the left-most bits with zip if necessary), for examples,

1001001010B = 0010 0100 1010B = 24AH 10001011001011B = 0010 0010 1100 1011B = 22CBH

It is important to annotation that hexadecimal number provides a compact form or shorthand for representing binary bits.

Conversion from Base `r` to Decimal (Base of operations x)

Given a northward-digit base of operations r number: d_{due north-1}d_n-2d_{due north-3}...d₂d₁d₀ (base r), the decimal equivalent is given by:

d_n-1×r^n-1          + d_n-ii×r^n-2          + ... + d_one×r¹          + d₀×r⁰

For examples,

A1C2H = ten×16^3 + one×xvi^ii + 12×xvi^1 + 2 = 41410 (base x) 10110B = i×2^4 + ane×2^2 + 1×2^ane = 22 (base x)

Conversion from Decimal (Base 10) to Base `r`

Use repeated partitioning/remainder. For example,

To convert 261(base of operations ten) to hexadecimal:   261/sixteen => quotient=16 residue=five   16/16  => quotient=1  remainder=0   i/16   => quotient=0  rest=1 (quotient=0 stop)   Hence, 261D = 105H (Collect the hex digits from the rest in opposite order)

The to a higher place procedure is actually applicative to conversion between any 2 base systems. For example,

To catechumen 1023(base of operations 4) to base 3:   1023(base four)/3 => quotient=25D remainder=0   25D/iii          => quotient=8D  remainder=ane   8D/3           => quotient=2D  residue=ii   2D/iii           => quotient=0   residual=two (quotient=0 stop)   Hence, 1023(base 4) = 2210(base 3)

Conversion between Two Number Systems with Fractional Role

Split the integral and the partial parts.
For the integral office, dissever by the target radix repeatably, and collect the ramainder in reverse order.
For the fractional function, multiply the fractional office past the target radix repeatably, and collect the integral part in the aforementioned social club.

Example i: Decimal to Binary

Convert 18.6875D to binary Integral Part = 18D   xviii/2 => quotient=nine remainder=0   nine/ii  => quotient=4 remainder=1   4/2  => quotient=2 remainder=0   ii/2  => quotient=1 residual=0   1/2  => caliber=0 remainder=i (quotient=0 stop)   Hence, 18D = 10010B Fractional Part = .6875D   .6875*2=1.375 => whole number is 1   .375*2=0.75   => whole number is 0   .75*2=1.v     => whole number is 1   .5*ii=1.0      => whole number is one   Hence .6875D = .1011B Combine, xviii.6875D = 10010.1011B

Example 2: Decimal to Hexadecimal

Convert 18.6875D to hexadecimal Integral Office = 18D   18/sixteen => caliber=1 balance=2   1/16  => quotient=0 remainder=i (caliber=0 stop)   Hence, 18D = 12H Partial Role = .6875D   .6875*16=11.0 => whole number is 11D (BH)   Hence .6875D = .BH Combine, xviii.6875D = 12.BH

Exercises (Number Systems Conversion)

Convert the post-obit decimal numbers into binary and hexadecimal numbers:
1. 108
2. 4848
3. 9000
Convert the post-obit binary numbers into hexadecimal and decimal numbers:
1. 1000011000
2. 10000000
3. 101010101010
Convert the following hexadecimal numbers into binary and decimal numbers:
1. ABCDE
2. 1234
3. 80F
Catechumen the following decimal numbers into binary equivalent:
1. xix.25D
2. 123.456D

Answers: You could use the Windows' Computer (calc.exe) to carry out number system conversion, by setting it to the Programmer or scientific style. (Run "calc" ⇒ Select "Settings" menu ⇒ Cull "Programmer" or "Scientific" mode.)

1101100B, 1001011110000B, 10001100101000B, 6CH, 12F0H, 2328H.
218H, 80H, AAAH, 536D, 128D, 2730D.
10101011110011011110B, 1001000110100B, 100000001111B, 703710D, 4660D, 2063D.
?? (You work information technology out!)

Calculator Memory & Data Representation

Estimator uses a fixed number of $.25 to represent a piece of data, which could be a number, a graphic symbol, or others. A northward-bit storage location tin correspond up to 2^northward distinct entities. For example, a iii-bit memory location can hold i of these eight binary patterns: 000, 001, 010, 011, 100, 101, 110, or 111. Hence, it tin represent at most viii distinct entities. You could utilize them to represent numbers 0 to 7, numbers 8881 to 8888, characters 'A' to 'H', or up to 8 kinds of fruits similar apple, orange, banana; or up to viii kinds of animals similar lion, tiger, etc.

Integers, for example, can exist represented in viii-scrap, 16-chip, 32-bit or 64-bit. You, as the programmer, choose an appropriate bit-length for your integers. Your choice will impose constraint on the range of integers that tin can be represented. Besides the bit-length, an integer can exist represented in diverse representation schemes, e.one thousand., unsigned vs. signed integers. An 8-flake unsigned integer has a range of 0 to 255, while an eight-flake signed integer has a range of -128 to 127 - both representing 256 distinct numbers.

It is important to note that a computer memory location simply stores a binary pattern. It is entirely up to you, equally the programmer, to decide on how these patterns are to be interpreted. For example, the viii-bit binary design "0100 0001B" tin be interpreted every bit an unsigned integer 65, or an ASCII character 'A', or some secret information known only to you. In other words, you accept to first make up one's mind how to stand for a piece of information in a binary design before the binary patterns brand sense. The estimation of binary design is called data representation or encoding. Furthermore, it is of import that the data representation schemes are agreed-upon past all the parties, i.due east., industrial standards need to exist formulated and straightly followed.

Once you decided on the data representation scheme, sure constraints, in particular, the precision and range volition be imposed. Hence, it is important to understand data representation to write right and high-functioning programs.

Rosette Stone and the Decipherment of Egyptian Hieroglyphs

RosettaStone hieroglyphs

Egyptian hieroglyphs (side by side-to-left) were used past the aboriginal Egyptians since 4000BC. Unfortunately, since 500AD, no 1 could longer read the ancient Egyptian hieroglyphs, until the re-discovery of the Rosette Stone in 1799 past Napoleon's troop (during Napoleon's Egyptian invasion) nearly the town of Rashid (Rosetta) in the Nile Delta.

The Rosetta Stone (left) is inscribed with a decree in 196BC on behalf of King Ptolemy Five. The decree appears in three scripts: the upper text is Ancient Egyptian hieroglyphs, the middle portion Demotic script, and the lowest Ancient Greek. Considering it presents essentially the same text in all three scripts, and Ancient Greek could notwithstanding be understood, information technology provided the primal to the decipherment of the Egyptian hieroglyphs.

The moral of the story is unless you know the encoding scheme, there is no style that you can decode the data.

Reference and images: Wikipedia.

Integer Representation

Integers are whole numbers or fixed-signal numbers with the radix point fixed after the least-significant fleck. They are contrast to real numbers or floating-betoken numbers, where the position of the radix point varies. It is important to take note that integers and floating-point numbers are treated differently in computers. They have different representation and are candy differently (due east.one thousand., floating-indicate numbers are processed in a and so-chosen floating-point processor). Floating-betoken numbers will exist discussed after.

Computers utilise a fixed number of bits to represent an integer. The normally-used bit-lengths for integers are 8-bit, 16-fleck, 32-bit or 64-bit. Besides chip-lengths, there are two representation schemes for integers:

Unsigned Integers: can represent zero and positive integers.
Signed Integers: tin represent zero, positive and negative integers. Three representation schemes had been proposed for signed integers:
1. Sign-Magnitude representation
2. i's Complement representation
3. 2's Complement representation

You, as the programmer, need to decide on the chip-length and representation scheme for your integers, depending on your application'southward requirements. Suppose that you need a counter for counting a small quantity from 0 up to 200, yous might choose the 8-bit unsigned integer scheme as there is no negative numbers involved.

n-bit Unsigned Integers

Unsigned integers tin can represent nil and positive integers, but not negative integers. The value of an unsigned integer is interpreted as "the magnitude of its underlying binary blueprint".

Example one: Suppose that due north=eight and the binary pattern is 0100 0001B, the value of this unsigned integer is ane×2^0 + 1×2^6 = 65D.

Example ii: Suppose that north=16 and the binary pattern is 0001 0000 0000 1000B, the value of this unsigned integer is ane×2^iii + 1×2^12 = 4104D.

Example 3: Suppose that n=16 and the binary pattern is 0000 0000 0000 0000B, the value of this unsigned integer is 0.

An n-fleck pattern can correspond 2^n distinct integers. An n-bit unsigned integer can represent integers from 0 to (2^n)-1, as tabulated below:

n	Minimum	Maximum
8	0	(two^8)-1 (=255)
16	0	(2^xvi)-1 (=65,535)
32	0	(2^32)-1 (=iv,294,967,295) (9+ digits)
64	0	(ii^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

Signed Integers

Signed integers tin correspond zero, positive integers, as well every bit negative integers. Three representation schemes are available for signed integers:

Sign-Magnitude representation
1's Complement representation
ii's Complement representation

In all the in a higher place three schemes, the virtually-pregnant flake (msb) is called the sign scrap. The sign bit is used to correspond the sign of the integer - with 0 for positive integers and 1 for negative integers. The magnitude of the integer, nonetheless, is interpreted differently in different schemes.

north-fleck Sign Integers in Sign-Magnitude Representation

In sign-magnitude representation:

The most-meaning fleck (msb) is the sign bit, with value of 0 representing positive integer and 1 representing negative integer.
The remaining n-1 bits represents the magnitude (accented value) of the integer. The absolute value of the integer is interpreted every bit "the magnitude of the (n-i)-bit binary blueprint".

Example 1: Suppose that n=8 and the binary representation is 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Example 2: Suppose that north=8 and the binary representation is ane 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is 000 0001B = 1D
Hence, the integer is -1D

Example iii: Suppose that north=8 and the binary representation is 0 000 0000B.
Sign bit is 0 ⇒ positive
Accented value is 000 0000B = 0D
Hence, the integer is +0D

Case 4: Suppose that n=8 and the binary representation is 1 000 0000B.
Sign bit is one ⇒ negative
Absolute value is 000 0000B = 0D
Hence, the integer is -0D

sign-magnitude representation

The drawbacks of sign-magnitude representation are:

There are two representations (0000 0000B and 1000 0000B) for the number naught, which could lead to inefficiency and defoliation.
Positive and negative integers need to be processed separately.

north-bit Sign Integers in ane'south Complement Representation

In 1'south complement representation:

Again, the most significant bit (msb) is the sign bit, with value of 0 representing positive integers and 1 representing negative integers.
The remaining n-1 $.25 represents the magnitude of the integer, as follows:
- for positive integers, the absolute value of the integer is equal to "the magnitude of the (n-1)-fleck binary design".
- for negative integers, the absolute value of the integer is equal to "the magnitude of the complement (inverse) of the (n-1)-scrap binary blueprint" (hence called 1's complement).

Example i: Suppose that north=8 and the binary representation 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Example two: Suppose that n=8 and the binary representation i 000 0001B.
Sign flake is 1 ⇒ negative
Absolute value is the complement of 000 0001B, i.east., 111 1110B = 126D
Hence, the integer is -126D

Instance 3: Suppose that n=eight and the binary representation 0 000 0000B.
Sign chip is 0 ⇒ positive
Accented value is 000 0000B = 0D
Hence, the integer is +0D

Example 4: Suppose that n=8 and the binary representation 1 111 1111B.
Sign chip is ane ⇒ negative
Accented value is the complement of 111 1111B, i.e., 000 0000B = 0D
Hence, the integer is -0D

1's complement

Once more, the drawbacks are:

In that location are two representations (0000 0000B and 1111 1111B) for zero.
The positive integers and negative integers demand to exist processed separately.

northward-bit Sign Integers in 2's Complement Representation

In 2's complement representation:

Once more, the near significant bit (msb) is the sign bit, with value of 0 representing positive integers and i representing negative integers.
The remaining n-1 bits represents the magnitude of the integer, equally follows:
- for positive integers, the absolute value of the integer is equal to "the magnitude of the (due north-one)-fleck binary blueprint".
- for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the (northward-1)-bit binary design plus one" (hence chosen 2's complement).

Example ane: Suppose that n=viii and the binary representation 0 100 0001B.
Sign chip is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Case two: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Accented value is the complement of 000 0001B plus 1, i.e., 111 1110B + 1B = 127D
Hence, the integer is -127D

Example 3: Suppose that north=8 and the binary representation 0 000 0000B.
Sign flake is 0 ⇒ positive
Accented value is 000 0000B = 0D
Hence, the integer is +0D

Example 4: Suppose that n=8 and the binary representation ane 111 1111B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B plus one, i.eastward., 000 0000B + 1B = 1D
Hence, the integer is -1D

2's complement

Computers use 2'south Complement Representation for Signed Integers

We have discussed iii representations for signed integers: signed-magnitude, 1's complement and 2'southward complement. Computers employ 2's complement in representing signed integers. This is because:

There is simply ane representation for the number zippo in 2's complement, instead of two representations in sign-magnitude and 1'south complement.
Positive and negative integers tin be treated together in addition and subtraction. Subtraction tin can exist carried out using the "addition logic".

Example 1: Improver of 2 Positive Integers: Suppose that n=8, 65D + 5D = 70D

65D →    0100 0001B  5D →    0000 0101B(+           0100 0110B    → 70D (OK)

Example 2: Subtraction is treated as Improver of a Positive and a Negative Integers: Suppose that n=eight, 5D - 5D = 65D + (-5D) = 60D

65D →    0100 0001B -5D →    1111 1011B(+           0011 1100B    → 60D (discard carry - OK)

Case three: Addition of Two Negative Integers: Suppose that n=viii, -65D - 5D = (-65D) + (-5D) = -70D

-65D →    1011 1111B  -5D →    1111 1011B(+            1011 1010B    → -70D (discard carry - OK)

Because of the fixed precision (i.e., stock-still number of bits), an north-bit 2'south complement signed integer has a certain range. For example, for n=8, the range of 2'south complement signed integers is -128 to +127. During add-on (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Example 4: Overflow: Suppose that northward=8, 127D + 2D = 129D (overflow - beyond the range)

127D →    0111 1111B   second →    0000 0010B(+            1000 0001B    → -127D (incorrect)

Instance 5: Underflow: Suppose that north=eight, -125D - 5D = -130D (underflow - below the range)

-125D →    yard 0011B   -5D →    1111 1011B(+             0111 1110B    → +126D (wrong)

The post-obit diagram explains how the 2's complement works. By re-arranging the number line, values from -128 to +127 are represented contiguously past ignoring the carry fleck.

signed integer

Range of n-bit 2'south Complement Signed Integers

An n-bit 2'south complement signed integer can represent integers from -ii^(n-ane) to +2^(n-1)-i, as tabulated. Take note that the scheme can represent all the integers inside the range, without any gap. In other words, there is no missing integers within the supported range.

n	minimum	maximum
8	-(2^vii) (=-128)	+(two^7)-1 (=+127)
16	-(2^15) (=-32,768)	+(two^15)-1 (=+32,767)
32	-(2^31) (=-two,147,483,648)	+(two^31)-1 (=+2,147,483,647)(9+ digits)
64	-(ii^63) (=-ix,223,372,036,854,775,808)	+(ii^63)-one (=+9,223,372,036,854,775,807)(18+ digits)

Decoding ii's Complement Numbers

Check the sign scrap (denoted as S).
If S=0, the number is positive and its absolute value is the binary value of the remaining n-1 bits.
If S=ane, the number is negative. yous could "capsize the north-one bits and plus 1" to become the absolute value of negative number.
Alternatively, you lot could browse the remaining due north-1 bits from the right (to the lowest degree-significant scrap). Look for the kickoff occurrence of one. Flip all the bits to the left of that first occurrence of 1. The flipped design gives the absolute value. For example,
```
n = viii, bit design = 1 100 0100B S = 1 → negative Scanning from the right and flip all the $.25 to the left of the starting time occurrence of ane ⇒              011 1100B = 60D Hence, the value is -60D
```

Big Endian vs. Footling Endian

Modern computers store one byte of information in each memory accost or location, i.e., byte addressable retentivity. An 32-bit integer is, therefore, stored in 4 memory addresses.

The term"Endian" refers to the order of storing bytes in estimator retentiveness. In "Big Endian" scheme, the most significant byte is stored get-go in the lowest memory accost (or big in first), while "Little Endian" stores the least significant bytes in the lowest memory address.

For example, the 32-flake integer 12345678H (305419896₁₀) is stored every bit 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in piddling endian. An 16-fleck integer 00H 01H is interpreted as 0001H in big endian, and 0100H as little endian.

Exercise (Integer Representation)

What are the ranges of 8-bit, 16-bit, 32-flake and 64-fleck integer, in "unsigned" and "signed" representation?
Give the value of 88, 0, one, 127, and 255 in 8-bit unsigned representation.
Give the value of +88, -88 , -i, 0, +1, -128, and +127 in 8-scrap 2'south complement signed representation.
Give the value of +88, -88 , -one, 0, +one, -127, and +127 in 8-fleck sign-magnitude representation.
Give the value of +88, -88 , -1, 0, +1, -127 and +127 in viii-bit 1's complement representation.
[TODO] more than.

Answers

The range of unsigned north-bit integers is [0, 2^n - 1]. The range of n-scrap 2's complement signed integer is [-ii^(northward-1), +2^(northward-one)-one];
88 (0101 1000), 0 (0000 0000), one (0000 0001), 127 (0111 1111), 255 (1111 1111).
+88 (0101 g), -88 (1010 m), -1 (1111 1111), 0 (0000 0000), +1 (0000 0001), -128 (1000 0000), +127 (0111 1111).
+88 (0101 k), -88 (1101 1000), -1 (1000 0001), 0 (0000 0000 or yard 0000), +ane (0000 0001), -127 (1111 1111), +127 (0111 1111).
+88 (0101 yard), -88 (1010 0111), -1 (1111 1110), 0 (0000 0000 or 1111 1111), +1 (0000 0001), -127 (1000 0000), +127 (0111 1111).

Floating-Signal Number Representation

A floating-point number (or existent number) can represent a very big (one.23×10^88) or a very pocket-sized (1.23×10^-88) value. It could besides represent very large negative number (-1.23×10^88) and very small-scale negative number (-i.23×ten^88), equally well as zero, every bit illustrated:

Representation_FloatingPointNumbers

A floating-indicate number is typically expressed in the scientific notation, with a fraction (F), and an exponent (East) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of 10 (F×x^E); while binary numbers use radix of 2 (F×ii^Due east).

Representation of floating indicate number is non unique. For example, the number 55.66 can be represented equally five.566×10^1, 0.5566×10^ii, 0.05566×10^3, then on. The fractional function can be normalized. In the normalized form, at that place is only a single non-zero digit earlier the radix point. For instance, decimal number 123.4567 can be normalized as 1.234567×x^ii; binary number 1010.1011B tin can be normalized equally 1.0101011B×2^3.

Information technology is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of $.25 (eastward.one thousand., 32-chip or 64-flake). This is considering at that place are infinite number of real numbers (even within a minor range of says 0.0 to 0.i). On the other mitt, a due north-scrap binary design can represent a finite two^n singled-out numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accurateness.

It is also important to note that floating number arithmetics is very much less efficient than integer arithmetic. Information technology could be speed up with a then-called defended floating-indicate co-processor. Hence, utilize integers if your application does non crave floating-point numbers.

In computers, floating-bespeak numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the grade of F×2^E. Both E and F can be positive likewise as negative. Modern computers adopt IEEE 754 standard for representing floating-point numbers. There are ii representation schemes: 32-bit single-precision and 64-bit double-precision.

IEEE-754 32-scrap Single-Precision Floating-Indicate Numbers

In 32-scrap unmarried-precision floating-point representation:

The most significant chip is the sign chip (S), with 0 for positive numbers and 1 for negative numbers.
The following 8 $.25 represent exponent (E).
The remaining 23 bits represents fraction (F).

float

Normalized Form

Permit's illustrate with an case, suppose that the 32-bit pattern is ane thousand 0001 011 0000 0000 0000 0000 0000 , with:

S = ane
E = chiliad 0001
F = 011 0000 0000 0000 0000 0000

In the normalized form, the actual fraction is normalized with an implicit leading 1 in the form of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = one + 1×2^-2 + 1×ii^-iii = 1.375D.

The sign bit represents the sign of the number, with South=0 for positive and Southward=i for negative number. In this example with S=1, this is a negative number, i.e., -1.375D.

In normalized form, the bodily exponent is E-127 (so-called excess-127 or bias-127). This is because we need to stand for both positive and negative exponent. With an 8-fleck East, ranging from 0 to 255, the excess-127 scheme could provide actual exponent of -127 to 128. In this example, Eastward-127=129-127=2D.

Hence, the number represented is -i.375×2^two=-5.5D.

De-Normalized Form

Normalized form has a serious trouble, with an implicit leading 1 for the fraction, information technology cannot stand for the number nothing! Convince yourself on this!

De-normalized form was devised to correspond zero and other numbers.

For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of i) is used for the fraction; and the actual exponent is always -126. Hence, the number zippo tin can be represented with Eastward=0 and F=0 (because 0.0×2^-126=0).

We can also represent very small-scale positive and negative numbers in de-normalized form with E=0. For instance, if Due south=ane, E=0, and F=011 0000 0000 0000 0000 0000. The actual fraction is 0.011=1×2^-two+one×ii^-iii=0.375D. Since Southward=1, it is a negative number. With E=0, the bodily exponent is -126. Hence the number is -0.375×2^-126 = -4.four×ten^-39, which is an extremely small negative number (close to zippo).

Summary

In summary, the value (Northward) is calculated as follows:

For 1 ≤ E ≤ 254, N = (-i)^S × 1.F × 2^(E-127). These numbers are in the so-called normalized grade. The sign-fleck represents the sign of the number. Fractional part (1.F) are normalized with an implicit leading 1. The exponent is bias (or in excess) of 127, so as to stand for both positive and negative exponent. The range of exponent is -126 to +127.
For E = 0, N = (-1)^S × 0.F × two^(-126). These numbers are in the so-called denormalized form. The exponent of 2^-126 evaluates to a very small number. Denormalized form is needed to represent zero (with F=0 and Due east=0). It can also represents very small positive and negative number close to zero.
For E = 255, information technology represents special values, such every bit ±INF (positive and negative infinity) and NaN (not a number). This is across the scope of this article.

Example 1: Suppose that IEEE-754 32-bit floating-point representation blueprint is 0 10000000 110 0000 0000 0000 0000 0000 .

Sign chip S = 0 ⇒ positive number E = 1000 0000B = 128D (in normalized form) Fraction is i.11B (with an implicit leading 1) = 1 + ane×two^-1 + ane×2^-2 = 1.75D The number is +1.75 × 2^(128-127) = +iii.5D

Example 2: Suppose that IEEE-754 32-chip floating-point representation pattern is i 01111110 100 0000 0000 0000 0000 0000 .

Sign bit S = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is ane.1B  (with an implicit leading 1) = 1 + 2^-1 = i.5D The number is -ane.5 × 2^(126-127) = -0.75D

Example 3: Suppose that IEEE-754 32-bit floating-point representation blueprint is 1 01111110 000 0000 0000 0000 0000 0001 .

Sign bit S = 1 ⇒ negative number Due east = 0111 1110B = 126D (in normalized form) Fraction is 1.000 0000 0000 0000 0000 0001B  (with an implicit leading 1) = i + two^-23 The number is -(1 + 2^-23) × ii^(126-127) = -0.500000059604644775390625 (may non be verbal in decimal!)

Instance four (De-Normalized Form): Suppose that IEEE-754 32-chip floating-point representation pattern is 1 00000000 000 0000 0000 0000 0000 0001 .

Sign flake S = i ⇒ negative number Due east = 0 (in de-normalized form) Fraction is 0.000 0000 0000 0000 0000 0001B  (with an implicit leading 0) = 1×2^-23 The number is -2^-23 × 2^(-126) = -2×(-149) ≈ -1.4×10^-45

Exercises (Floating-bespeak Numbers)

Compute the largest and smallest positive numbers that can exist represented in the 32-scrap normalized grade.
Compute the largest and smallest negative numbers can be represented in the 32-bit normalized form.
Repeat (1) for the 32-bit denormalized grade.
Echo (2) for the 32-bit denormalized class.

Hints:

Largest positive number: Due south=0, E=1111 1110 (254), F=111 1111 1111 1111 1111 1111.
Smallest positive number: S=0, E=0000 00001 (one), F=000 0000 0000 0000 0000 0000.
Aforementioned as above, but S=i.
Largest positive number: Due south=0, E=0, F=111 1111 1111 1111 1111 1111.
Smallest positive number: S=0, East=0, F=000 0000 0000 0000 0000 0001.
Same as to a higher place, but S=1.

Notes For Java Users

You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-bit float or double-precision 64-bit double with the specific bit patterns, and print their values. For examples,

System.out.println(Float.intBitsToFloat(0x7fffff)); System.out.println(Double.longBitsToDouble(0x1fffffffffffffL));

IEEE-754 64-bit Double-Precision Floating-Signal Numbers

The representation scheme for 64-bit double-precision is like to the 32-bit unmarried-precision:

The most significant chip is the sign chip (S), with 0 for positive numbers and 1 for negative numbers.
The following eleven bits represent exponent (E).
The remaining 52 bits represents fraction (F).

double

The value (N) is calculated as follows:

Normalized course: For ane ≤ E ≤ 2046, Due north = (-1)^S × 1.F × 2^(E-1023).
Denormalized form: For East = 0, Due north = (-i)^S × 0.F × 2^(-1022). These are in the denormalized form.
For East = 2047, Northward represents special values, such as ±INF (infinity), NaN (non a number).

More on Floating-Point Representation

At that place are 3 parts in the floating-point representation:

The sign bit (Southward) is self-explanatory (0 for positive numbers and 1 for negative numbers).
For the exponent (E), a and so-called bias (or excess) is practical so as to represent both positive and negative exponent. The bias is gear up at half of the range. For single precision with an 8-bit exponent, the bias is 127 (or excess-127). For double precision with a 11-bit exponent, the bias is 1023 (or backlog-1023).
The fraction (F) (also chosen the mantissa or significand) is composed of an implicit leading flake (before the radix betoken) and the partial $.25 (after the radix point). The leading bit for normalized numbers is 1; while the leading scrap for denormalized numbers is 0.

Normalized Floating-Point Numbers

In normalized form, the radix betoken is placed later on the starting time not-zero digit, e,g., 9.8765D×10^-23D, one.001011B×2^11B. For binary number, the leading flake is always one, and demand non exist represented explicitly - this saves ane bit of storage.

In IEEE 754'south normalized form:

For unmarried-precision, 1 ≤ E ≤ 254 with excess of 127. Hence, the actual exponent is from -126 to +127. Negative exponents are used to represent small numbers (< i.0); while positive exponents are used to represent large numbers (> one.0).
North = (-1)^Southward × 1.F × 2^(E-127)
For double-precision, 1 ≤ E ≤ 2046 with excess of 1023. The actual exponent is from -1022 to +1023, and
Northward = (-ane)^S × 1.F × 2^(Eastward-1023)

Take note that n-bit pattern has a finite number of combinations (=2^n), which could stand for finite distinct numbers. It is not possible to stand for the infinite numbers in the real centrality (even a small range says 0.0 to 1.0 has infinite numbers). That is, not all floating-point numbers can be accurately represented. Instead, the closest approximation is used, which leads to loss of accurateness.

The minimum and maximum normalized floating-point numbers are:

Precision	Normalized Northward(min)	Normalized N(max)
Single	0080 0000H 0 00000001 00000000000000000000000B E = 1, F = 0 North(min) = 1.0B × 2^-126 (≈1.17549435 × ten^-38)	7F7F FFFFH 0 11111110 00000000000000000000000B East = 254, F = 0 N(max) = 1.1...1B × 2^127 = (two - ii^-23) × 2^127 (≈3.4028235 × x^38)
Double	0010 0000 0000 0000H N(min) = 1.0B × 2^-1022 (≈ii.2250738585072014 × x^-308)	7FEF FFFF FFFF FFFFH N(max) = 1.1...1B × ii^1023 = (2 - 2^-52) × 2^1023 (≈ane.7976931348623157 × 10^308)

real numbers

Denormalized Floating-Point Numbers

If E = 0, merely the fraction is not-zero, then the value is in denormalized form, and a leading scrap of 0 is assumed, as follows:

For single-precision, East = 0,
N = (-one)^Southward × 0.F × 2^(-126)
For double-precision, E = 0,
N = (-1)^Southward × 0.F × two^(-1022)

Denormalized form can represent very small numbers closed to zero, and zero, which cannot be represented in normalized course, as shown in the above effigy.

The minimum and maximum of denormalized floating-bespeak numbers are:

Precision	Denormalized D(min)	Denormalized D(max)
Single	0000 0001H 0 00000000 00000000000000000000001B E = 0, F = 00000000000000000000001B D(min) = 0.0...i × 2^-126 = 1 × two^-23 × 2^-126 = two^-149 (≈1.4 × 10^-45)	007F FFFFH 0 00000000 11111111111111111111111B E = 0, F = 11111111111111111111111B D(max) = 0.one...1 × two^-126 = (i-two^-23)×2^-126 (≈one.1754942 × ten^-38)
Double	0000 0000 0000 0001H D(min) = 0.0...1 × 2^-1022 = i × 2^-52 × 2^-1022 = two^-1074 (≈4.ix × 10^-324)	001F FFFF FFFF FFFFH D(max) = 0.1...1 × 2^-1022 = (1-2^-52)×2^-1022 (≈4.4501477170144023 × 10^-308)

Special Values

Zero: Nada cannot be represented in the normalized form, and must be represented in denormalized form with E=0 and F=0. There are two representations for zero: +0 with S=0 and -0 with Southward=ane.

Infinity: The value of +infinity (due east.g., i/0) and -infinity (due east.g., -one/0) are represented with an exponent of all one's (Eastward = 255 for single-precision and Due east = 2047 for double-precision), F=0, and Southward=0 (for +INF) and South=1 (for -INF).

Not a Number (NaN): NaN denotes a value that cannot be represented as existent number (eastward.k. 0/0). NaN is represented with Exponent of all one's (East = 255 for single-precision and E = 2047 for double-precision) and any non-zero fraction.

Character Encoding

In computer retentivity, character are "encoded" (or "represented") using a chosen "grapheme encoding schemes" (aka "graphic symbol set", "charset", "character map", or "code page").

For example, in ASCII (too as Latin1, Unicode, and many other grapheme sets):

lawmaking numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.
code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', respectively.
lawmaking numbers 48D (30H) to 57D (39H) represents '0' to '9', respectively.

Information technology is important to note that the representation scheme must exist known before a binary pattern can be interpreted. E.g., the 8-fleck pattern "0100 0010B" could represent anything under the sun known merely to the person encoded it.

The almost commonly-used character encoding schemes are: 7-flake ASCII (ISO/IEC 646) and 8-fleck Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).

A 7-flake encoding scheme (such as ASCII) can stand for 128 characters and symbols. An 8-bit graphic symbol encoding scheme (such as Latin-x) can represent 256 characters and symbols; whereas a 16-bit encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and symbols.

7-bit ASCII Lawmaking (aka US-ASCII, ISO/IEC 646, ITU-T T.50)

ASCII (American Standard Code for Information Interchange) is one of the before graphic symbol coding schemes.
ASCII is originally a vii-bit code. Information technology has been extended to 8-bit to improve apply the eight-bit calculator retentiveness system. (The eighth-bit was originally used for parity check in the early on computers.)

Code numbers 32D (20H) to 126D (7EH) are printable (displayable) characters as tabulated (arranged in hexadecimal and decimal) equally follows:

Hex	0	1	two	3	4	five	6	seven	viii	9	A	B	C	D	E	F
2	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
three	0	1	2	3	four	5	six	7	8	ix	:	;	<	=	>	?
4	@	A	B	C	D	E	F	G	H	I	J	Thousand	50	M	N	O
5	P	Q	R	Southward	T	U	V	West	X	Y	Z	[	\	]	^	_
6	`	a	b	c	d	east	f	g	h	i	j	k	fifty	m	n	o
7	p	q	r	southward	t	u	v	w	x	y	z	{	\|	}	~

Dec	0	1	2	3	four	5	6	7	8	ix
iii			SP	!	"	#	$	%	&	'
iv	(	)	*	+	,	-	.	/	0	1
five	2	three	4	five	6	vii	8	nine	:	;
6	<	=	>	?	@	A	B	C	D	Eastward
7	F	G	H	I	J	Yard	L	M	N	O
8	P	Q	R	S	T	U	V	West	X	Y
9	Z	[	\	]	^	_	`	a	b	c
ten	d	e	f	k	h	i	j	k	l	m
11	n	o	p	q	r	s	t	u	v	w
12	x	y	z	{	\|	}	~

Lawmaking number 32D (20H) is the blank or space grapheme.
'0' to '9': 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value)
'A' to 'Z': 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB). 'A' to 'Z' are continuous without gap.
'a' to 'z': 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB). 'A' to 'Z' are besides continuous without gap. Still, there is a gap between uppercase and lowercase letters. To convert between upper and lowercase, flip the value of bit-5.

Lawmaking numbers 0D (00H) to 31D (1FH), and 127D (7FH) are special control characters, which are non-printable (non-displayable), every bit tabulated beneath. Many of these characters were used in the early on days for manual command (e.g., STX, ETX) and printer command (e.g., Form-Feed), which are now obsolete. The remaining meaningful codes today are:
- 09H for Tab ('\t').
- 0AH for Line-Feed or newline (LF or '\n') and 0DH for Carriage-Return (CR or 'r'), which are used equally line delimiter (aka line separator, end-of-line) for text files. There is unfortunately no standard for line delimiter: Unixes and Mac use 0AH (LF or "\n"), Windows use 0D0AH (CR+LF or "\r\n"). Programming languages such equally C/C++/Coffee (which was created on Unix) employ 0AH (LF or "\due north").
- In programming languages such as C/C++/Java, line-feed (0AH) is denoted as '\n', carriage-return (0DH) as '\r', tab (09H) as '\t'.

December	HEX	Meaning		DEC	HEX	Meaning
0	00	NUL	Nada	17	xi	DC1	Device Control i
1	01	SOH	Showtime of Heading	18	12	DC2	Device Command two
ii	02	STX	Starting time of Text	19	13	DC3	Device Command iii
3	03	ETX	End of Text	20	14	DC4	Device Command 4
iv	04	EOT	End of Transmission	21	15	NAK	Negative Ack.
v	05	ENQ	Inquiry	22	xvi	SYN	Sync. Idle
six	06	ACK	Acknowledgment	23	17	ETB	End of Manual
7	07	BEL	Bell	24	xviii	CAN	Cancel
8	08	BS	Back Infinite `'\b'`	25	19	EM	End of Medium
nine	09	HT	Horizontal Tab `'\t'`	26	1A	SUB	Substitute
10	0A	LF	Line Feed `'\due north'`	27	1B	ESC	Escape
11	0B	VT	Vertical Feed	28	1C	IS4	File Separator
12	0C	FF	Class Feed `'f'`	29	1D	IS3	Grouping Separator
13	0D	CR	Carriage Render `'\r'`	thirty	1E	IS2	Record Separator
14	0E	SO	Shift Out	31	1F	IS1	Unit Separator
fifteen	0F	SI	Shift In
xvi	10	DLE	Datalink Escape	127	7F	DEL	Delete

8-fleck Latin-one (aka ISO/IEC 8859-1)

ISO/IEC-8859 is a collection of 8-bit graphic symbol encoding standards for the western languages.

ISO/IEC 8859-1, aka Latin alphabet No. i, or Latin-i in short, is the virtually ordinarily-used encoding scheme for western european languages. Information technology has 191 printable characters from the latin script, which covers languages similar English, German, Italian, Portuguese and Spanish. Latin-1 is backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-one (code numbers 0 to 127 (7FH)), is the aforementioned as Usa-ASCII. Code numbers 128 (80H) to 159 (9FH) are non assigned. Lawmaking numbers 160 (A0H) to 255 (FFH) are given as follows:

Hex	0	1	2	3	4	5	6	seven	8	ix	A	B	C	D	E	F
A	NBSP	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	SHY	®	¯
B	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
C	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
D	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
East	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
F	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

ISO/IEC-8859 has 16 parts. Besides the nearly commonly-used Office i, Function 2 is meant for Central European (Polish, Czech, Hungarian, etc), Function 3 for South European (Turkish, etc), Part iv for North European (Estonian, Latvian, etc), Part v for Cyrillic, Part half-dozen for Arabic, Role 7 for Greek, Part eight for Hebrew, Part 9 for Turkish, Part x for Nordic, Part 11 for Thai, Part 12 was abandon, Role 13 for Baltic Rim, Part 14 for Celtic, Office 15 for French, Finnish, etc. Part 16 for South-Eastern European.

Other 8-chip Extension of The states-ASCII (ASCII Extensions)

Beside the standardized ISO-8859-x, there are many viii-scrap ASCII extensions, which are non compatible with each others.

ANSI (American National Standards Found) (aka Windows-1252, or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. Information technology is a superset of ISO-8859-i with lawmaking numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such equally "smart" single-quotes and double-quotes. A common trouble in web browsers is that all the quotes and apostrophes (produced past "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. It it considering the certificate is labeled equally ISO-8859-1 (instead of Windows-1252), where these code numbers are undefined. Most modern browsers and e-mail clients treat charset ISO-8859-1 as Windows-1252 in order to adapt such mis-labeling.

Hex	0	1	two	iii	4	5	6	seven	8	ix	A	B	C	D	E	F
eight	€		‚	ƒ	„	…	†	‡	ˆ	‰	Š	‹	Œ		Ž
9		'	'	"	"	•	–	—		™	š	›	œ		ž	Ÿ

EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early on IBM computers.

Unicode (aka ISO/IEC 10646 Universal Character Fix)

Earlier Unicode, no unmarried character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-8859-x family). Even a single linguistic communication like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.e., the same lawmaking number is assigned to different characters.

Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained past a non-profit system chosen the Unicode Consortium (@ world wide web.unicode.org). Unicode is an ISO/IEC standard 10646.

Unicode is astern compatible with the 7-bit U.s.a.-ASCII and 8-chip Latin-1 (ISO-8859-ane). That is, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same equally Latin-ane.

Unicode originally uses 16 bits (called UCS-2 or Unicode Character Fix - 2 byte), which can stand for upward to 65,536 characters. It has since been expanded to more than 16 $.25, currently stands at 21 bits. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 $.25 or about 2 million characters), roofing all electric current and ancient historical scripts. The original xvi-bit range of U+0000H to U+FFFFH (65536 characters) is known as Bones Multilingual Plane (BMP), roofing all the major languages in use currently. The characters outside BMP are called Supplementary Characters, which are non frequently-used.

Unicode has 2 encoding schemes:

UCS-2 (Universal Character Set - 2 Byte): Uses 2 bytes (16 $.25), covering 65,536 characters in the BMP. BMP is sufficient for most of the applications. UCS-2 is now obsolete.
UCS-4 (Universal Graphic symbol Set up - 4 Byte): Uses 4 bytes (32 bits), roofing BMP and the supplementary characters.

UTF-8 (Unicode Transformation Format - 8-fleck)

The 16/32-bit Unicode (UCS-ii/4) is grossly inefficient if the document contains mainly ASCII characters, considering each graphic symbol occupies 2 bytes of storage. Variable-length encoding schemes, such as UTF-8, which uses ane-iv bytes to represent a character, was devised to improve the efficiency. In UTF-8, the 128 commonly-used U.s.a.-ASCII characters use only ane byte, but some less-commonly characters may require up to four bytes. Overall, the efficiency improved for document containing mainly US-ASCII texts.

The transformation between Unicode and UTF-8 is as follows:

Bits	Unicode	UTF-viii Lawmaking	Bytes
7	00000000 0xxxxxxx	0xxxxxxx	1 (ASCII)
11	00000yyy yyxxxxxx	110yyyyy 10xxxxxx	2
sixteen	zzzzyyyy yyxxxxxx	1110zzzz 10yyyyyy 10xxxxxx	3
21	000uuuuu zzzzyyyy yyxxxxxx	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx	4

In UTF-8, Unicode numbers corresponding to the 7-bit ASCII characters are padded with a leading cypher; thus has the same value equally ASCII. Hence, UTF-viii can be used with all software using ASCII. Unicode numbers of 128 and higher up, which are less frequently used, are encoded using more bytes (2-four bytes). UTF-8 generally requires less storage and is compatible with ASCII. The drawback of UTF-8 is more than processing power needed to unpack the code due to its variable length. UTF-8 is the about popular format for Unicode.

Notes:

UTF-8 uses one-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters outside BMP (21-fleck).
The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) use ane byte. Most European and Middle East characters apply a 2-byte sequence, which includes extended Latin letters (with tilde, macron, acute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) use 3-byte sequences.
All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the ASCII bytes, with a leading '0' scrap, can be identified and decoded easily.

Instance: 您好 (Unicode: 60A8H 597DH)

Unicode (UCS-two) is 60A8H = 0110 0000 x 101000B ⇒ UTF-eight is 11100110 10000010 10101000B = E6 82 A8H Unicode (UCS-2) is 597DH = 0101 1001 01 111101B ⇒ UTF-8 is 11100101 10100101 10111101B = E5 A5 BDH

UTF-xvi (Unicode Transformation Format - 16-bit)

UTF-16 is a variable-length Unicode character encoding scheme, which uses 2 to four bytes. UTF-16 is not commonly used. The transformation tabular array is equally follows:

Unicode	UTF-xvi Code	Bytes
xxxxxxxx xxxxxxxx	Same equally UCS-2 - no encoding	2
000uuuuu zzzzyyyy yyxxxxxx (uuuuu≠0)	110110ww wwzzzzyy 110111yy yyxxxxxx (wwww = uuuuu - i)	iv

Take annotation that for the 65536 characters in BMP, the UTF-16 is the same as UCS-ii (2 bytes). However, 4 bytes are used for the supplementary characters outside the BMP.

For BMP characters, UTF-16 is the same as UCS-ii. For supplementary characters, each character requires a pair 16-bit values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

UTF-32 (Unicode Transformation Format - 32-bit)

Same as UCS-4, which uses 4 bytes for each character - unencoded.

Formats of Multi-Byte (e.g., Unicode) Text Files

Endianess (or byte-order): For a multi-byte grapheme, you need to take care of the order of the bytes in storage. In big endian, the almost meaning byte is stored at the memory location with the lowest accost (large byte first). In petty endian, the about significant byte is stored at the retentiveness location with the highest address (trivial byte first). For example, 您 (with Unicode number of 60A8H) is stored as 60 A8 in large endian; and stored as A8 sixty in little endian. Big endian, which produces a more readable hex dump, is more usually-used, and is often the default.

BOM (Byte Order Marking): BOM is a special Unicode character having code number of FEFFH, which is used to differentiate big-endian and little-endian. For large-endian, BOM appears as Iron FFH in the storage. For little-endian, BOM appears every bit FF FEH. Unicode reserves these two code numbers to foreclose it from crashing with another character.

Unicode text files could take on these formats:

Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
UTF-16 with BOM. The starting time character of the file is a BOM graphic symbol, which specifies the endianess. For large-endian, BOM appears as FE FFH in the storage. For little-endian, BOM appears as FF FEH.

UTF-eight file is always stored as big endian. BOM plays no part. However, in some systems (in particular Windows), a BOM is added as the first character in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM character (FEFFH) is encoded in UTF-8 as EF BB BF. Adding a BOM equally the first graphic symbol of the file is not recommended, as information technology may exist incorrectly interpreted in other system. Y'all tin can accept a UTF-8 file without BOM.

Formats of Text Files

Line Delimiter or End-Of-Line (EOL): Sometimes, when you use the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because different operating platforms use unlike character equally the so-called line delimiter (or end-of-line or EOL). Two non-printable command characters are involved: 0AH (Line-Feed or LF) and 0DH (Wagon-Render or CR).

Windows/DOS uses OD0AH (CR+LF or "\r\due north") as EOL.
Unix and Mac use 0AH (LF or "\north") only.

End-of-File (EOF): [TODO]

Windows' CMD Codepage

Character encoding scheme (charset) in Windows is chosen codepage. In CMD shell, you can consequence command "chcp" to brandish the current codepage, or "chcp codepage-number" to modify the codepage.

Take note that:

The default codepage 437 (used in the original DOS) is an eight-flake character set chosen Extended ASCII, which is different from Latin-1 for code numbers higher up 127.
Codepage 1252 (Windows-1252), is non exactly the same as Latin-1. It assigns code number 80H to 9FH to letters and punctuation, such every bit smart single-quotes and double-quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is considering the page is supposed to be Windows-1252, merely mislabelled as ISO-8859-1.
For internationalization and chinese character fix: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.

Chinese Character Sets

Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). In that location are more than than 20,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-8 scheme, which unfortunately, requires 3 bytes for each CJK character, instead of 2 bytes in the unencoded UCS-ii (UTF-xvi).

Worse still, at that place are besides various chinese character sets, which is non uniform with Unicode:

GB2312/GBK: for simplified chinese characters. GB2312 uses 2 bytes for each chinese character. The about significant bit (MSB) of both bytes are set to 1 to co-be with 7-flake ASCII with the MSB of 0. In that location are about 6700 characters. GBK is an extension of GB2312, which include more characters as well equally traditional chinese characters.
BIG5: for traditional chinese characters BIG5 likewise uses 2 bytes for each chinese graphic symbol. The most significant bit of both bytes are also fix to 1. BIG5 is not compatible with GBK, i.due east., the aforementioned code number is assigned to different graphic symbol.

For example, the world is made more interesting with these many standards:

	Standard	Characters	Codes
Simplified	GB2312	和谐	BACD D0B3
	UCS-2	和谐	548C 8C10
	UTF-8	和谐	E5928C E8B090
Traditional	BIG5	和諧	A94D BFD3
	UCS-two	和諧	548C 8AE7
	UTF-eight	和諧	E5928C E8ABA7

Notes for Windows' CMD Users: To display the chinese graphic symbol correctly in CMD vanquish, you need to choose the correct codepage, e.one thousand., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. You tin utilize control "chcp" to display the current code page and control "chcp codepage_number " to change the codepage. You also take to choose a font that tin brandish the characters (eastward.g., Courier New, Consolas or Lucida Console, NOT Raster font).

Collating Sequences (for Ranking Characters)

A string consists of a sequence of characters in upper or lower cases, east.g., "apple tree", "Boy", "Cat". In sorting or comparison strings, if we order the characters according to the underlying code numbers (due east.yard., United states of america-ASCII) grapheme-by-grapheme, the order for the example would exist "Boy", "apple", "True cat" because uppercase letters accept a smaller code number than lowercase letters. This does non agree with the and so-called dictionary club, where the same majuscule and lowercase letters have the same rank. Some other common trouble in ordering strings is "10" (10) at times is ordered in front of "1" to "9".

Hence, in sorting or comparison of strings, a and then-called collating sequence (or collation) is often defined, which specifies the ranks for messages (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely upwardly to you to cull a collating sequence to run into your application's specific requirements. Some instance-insensitive dictionary-order collating sequences take the same rank for aforementioned upper-case letter and lowercase letters, i.e., 'A', 'a' ⇒ 'B', 'b' ⇒ ... ⇒ 'Z', 'z'. Some case-sensitive dictionary-social club collating sequences put the uppercase letter of the alphabet before its lowercase counterpart, i.e., 'A' ⇒'B' ⇒ 'C'... ⇒ 'a' ⇒ 'b' ⇒ 'c'.... Typically, infinite is ranked earlier digits '0' to '9', followed by the alphabets.

Collating sequence is often linguistic communication dependent, as different languages utilise different sets of characters (e.yard., á, é, a, α) with their ain orders.

For Java Programmers - `java.nio.Charset`

JDK 1.iv introduced a new java.nio.charset package to support encoding/decoding of characters from UCS-2 used internally in Java programme to any supported charset used past external devices.

Example: The post-obit program encodes some Unicode texts in various encoding scheme, and brandish the Hex codes of the encoded byte sequences.

import java.nio.ByteBuffer; import java.nio.CharBuffer; import coffee.nio.charset.Charset;   public class          TestCharsetEncodeDecode          {    public static void primary(String[] args) {              Cord[] charsetNames = {"United states of america-ASCII", "ISO-8859-one", "UTF-eight", "UTF-xvi",                                "UTF-16BE", "UTF-16LE", "GBK", "BIG5"};         Cord message = "Hi,您好!";                Arrangement.out.printf("%10s: ", "UCS-ii");       for (int i = 0; i < message.length(); i++) {          System.out.printf("%04X ", (int)message.charAt(i));       }       System.out.println();         for (String charsetName: charsetNames) {                    Charset charset = Charset.forName(charsetName);          System.out.printf("%10s: ", charset.name());                      ByteBuffer bb = charset.encode(message);          while (bb.hasRemaining()) {             System.out.printf("%02X ", bb.go());            }          Arrangement.out.println();          bb.rewind();       }    } }

          UCS-two: 0048 0069 002C 60A8 597D 0021                 Us-ASCII: 48 69 2C 3F 3F 21               ISO-8859-1: 48 69 2C 3F 3F 21                    UTF-eight: 48 69 2C          E6 82 A8          E5 A5 BD          21                   UTF-sixteen:          Iron FF          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16BE:          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16LE:          48 00          69 00          2C 00          A8 60          7D 59          21 00                               GBK: 48 69 2C          C4 FA          BA C3          21                     Big5: 48 69 2C          B1 7A          A6 6E          21

For Java Programmers - `char` and `Cord`

The char information type are based on the original sixteen-scrap Unicode standard called UCS-two. The Unicode has since evolved to 21 bits, with lawmaking range of U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is known every bit the Bones Multilingual Plane (BMP). Characters above U+FFFF are called supplementary characters. A xvi-bit Java char cannot concord a supplementary character.

Recall that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the aforementioned as UCS-two. A supplementary character uses four bytes. and requires a pair of xvi-fleck values, the first from the high-surrogates range, (\uD800-\uDBFF), the 2d from the low-surrogates range (\uDC00-\uDFFF).

In Java, a String is a sequences of Unicode characters. Coffee, in fact, uses UTF-16 for String and StringBuffer. For BMP characters, they are the same every bit UCS-ii. For supplementary characters, each characters requires a pair of char values.

Java methods that accept a sixteen-bit char value does non support supplementary characters. Methods that accept a 32-bit int value support all Unicode characters (in the lower 21 $.25), including supplementary characters.

This is meant to be an academic discussion. I have yet to encounter the use of supplementary characters!

Displaying Hex Values & Hex Editors

At times, you may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a proficient programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Effort google "Hex Editor".

I used the followings:

NotePad++ with Hex Editor Plug-in: Open-source and gratuitous. You lot can toggle betwixt Hex view and Normal view past pushing the "H" push.
PSPad: Freeware. You can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
TextPad: Shareware without expiration catamenia. To view the Hex value, you demand to "open" the file by choosing the file format of "binary" (??).
UltraEdit: Shareware, not free, 30-solar day trial simply.

Let me know if you have a meliorate choice, which is fast to launch, easy to use, can toggle betwixt Hex and normal view, costless, ....

The following Java programme can be used to display hex lawmaking for Java Primitives (integer, character and floating-point):

1 2 3 4 five vi seven eight 9 10 11 12 13 fourteen 15 16 17 xviii xix 20 21 22 23 24 25 26 27 28 29 30

public class PrintHexCode {      public static void primary(String[] args) {       int i = 12345;       System.out.println("Decimal is " + i);                               System.out.println("Hex is " + Integer.toHexString(i));              System.out.println("Binary is " + Integer.toBinaryString(i));        System.out.println("Octal is " + Integer.toOctalString(i));          Arrangement.out.printf("Hex is %x\n", i);           System.out.printf("Octal is %o\n", i);           char c = 'a';       Arrangement.out.println("Character is " + c);               System.out.printf("Character is %c\n", c);             System.out.printf("Hex is %x\north", (brusk)c);            System.out.printf("Decimal is %d\n", (short)c);          float f = 3.5f;       Organisation.out.println("Decimal is " + f);           System.out.println(Float.toHexString(f));          f = -0.75f;       System.out.println("Decimal is " + f);           System.out.println(Float.toHexString(f));          double d = 11.22;       System.out.println("Decimal is " + d);            System.out.println(Double.toHexString(d));     } }

In Eclipse, yous tin can view the hex code for integer archaic Java variables in debug fashion every bit follows: In debug perspective, "Variable" panel ⇒ Select the "menu" (inverted triangle) ⇒ Coffee ⇒ Coffee Preferences... ⇒ Archaic Display Options ⇒ Check "Display hexadecimal values (byte, curt, char, int, long)".

Summary - Why Bother about Data Representation?

Integer number i, floating-point number 1.0 character symbol 'i', and string "ane" are totally different within the computer memory. Y'all need to know the divergence to write skilful and loftier-performance programs.

In eight-bit signed integer, integer number i is represented as 00000001B.
In 8-flake unsigned integer, integer number 1 is represented as 00000001B.
In xvi-bit signed integer, integer number one is represented as 00000000 00000001B.
In 32-bit signed integer, integer number i is represented equally 00000000 00000000 00000000 00000001B.
In 32-bit floating-point representation, number i.0 is represented as 0 01111111 0000000 00000000 00000000B, i.e., Southward=0, E=127, F=0.
In 64-chip floating-point representation, number i.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B, i.east., Southward=0, E=1023, F=0.
In eight-flake Latin-ane, the character symbol '1' is represented every bit 00110001B (or 31H).
In xvi-fleck UCS-ii, the character symbol 'i' is represented as 00000000 00110001B.
In UTF-8, the character symbol '1' is represented as 00110001B.

If you "add" a 16-scrap signed integer ane and Latin-1 grapheme '1' or a string "i", you could get a surprise.

Exercises (Data Representation)

For the following sixteen-bit codes:

0000 0000 0010 1010; thousand 0000 0010 1010;

Give their values, if they are representing:

a sixteen-bit unsigned integer;
a xvi-bit signed integer;
two 8-bit unsigned integers;
2 8-bit signed integers;
a 16-bit Unicode characters;
two 8-bit ISO-8859-1 characters.

Ans: (1) 42, 32810; (2) 42, -32726; (three) 0, 42; 128, 42; (iv) 0, 42; -128, 42; (5) '*'; '耪'; (six) NUL, '*'; PAD, '*'.

REFERENCES & Resource

(Floating-Point Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Indicate Arithmetic".
(ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "Data technology - seven-bit coded character fix for information interchange".
(Latin-I Specification) ISO/IEC 8859-1, "Data technology - 8-fleck single-byte coded graphic graphic symbol sets - Part 1: Latin alphabet No. 1".
(Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".
Unicode Consortium @ http://www.unicode.org.

mannsnarstrabest.blogspot.com

Source: https://www3.ntu.edu.sg/home/ehchua/programming/java/datarepresentation.html