Data Representation in Computer: Number Systems, Characters, Audio, Image and Video

What is data representation in computer.

A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory.

Number Systems

Number systems are the technique to represent numbers in the computer system architecture, every value that you are saving or getting into/from computer memory has a defined number system.

The number 289 is pronounced as two hundred and eighty-nine and it consists of the symbols 2, 8, and 9. Similarly, there are other number systems. Each has its own symbols and method for constructing a number.

Let us discuss some of the number systems. Computer architecture supports the following number of systems:

Binary Number System

Octal number system, decimal number system.

The decimal number system has only ten (10) digits from 0 to 9. Every number (value) represents with 0,1,2,3,4,5,6, 7,8 and 9 in this number system. The base of decimal number system is 10, because it has only 10 digits.

Hexadecimal Number System

Data representation of characters.

There are different methods to represent characters . Some of them are discussed below:

Since there are exactly 128 unique combinations of 7 bits, this 7-bit code can represent only128 characters. Another version is ASCII-8, also called extended ASCII, which uses 8 bits for each character, can represent 256 different characters.

If ASCII-coded data is to be used in a computer that uses EBCDIC representation, it is necessary to transform ASCII code to EBCDIC code. Similarly, if EBCDIC coded data is to be used in an ASCII computer, EBCDIC code has to be transformed to ASCII.

Using 8-bit ASCII we can represent only 256 characters. This cannot represent all characters of written languages of the world and other symbols. Unicode is developed to resolve this problem. It aims to provide a standard character encoding scheme, which is universal and efficient.

Data Representation of Audio, Image and Video

In most cases, we may have to represent and process data other than numbers and characters. This may include audio data, images, and videos. We can see that like numbers and characters, the audio, image, and video data also carry information.

For example, an image is most popularly stored in Joint Picture Experts Group (JPEG ) file format. An image file consists of two parts – header information and image data. Information such as the name of the file, size, modified data, file format, etc. is stored in the header part.

Numerous such techniques are used to achieve compression. Depending on the application, images are stored in various file formats such as bitmap file format (BMP), Tagged Image File Format (TIFF), Graphics Interchange Format (GIF), Portable (Public) Network Graphic (PNG).

Similarly, video is also stored in different files such as AVI (Audio Video Interleave) – a file format designed to store both audio and video data in a standard package that allows synchronous audio with video playback, MP3, JPEG-2, WMV, etc.

FAQs About Data Representation in Computer

What is number system with example, you might also like, what is c++ programming language c++ character set, c++ tokens, what is artificial intelligence functions, 6 benefits, applications of ai, what is microprocessor evolution of microprocessor, types, features, 10 evolution of computing machine, history, what are decision making statements in c types, what are c++ keywords set of 59 keywords in c ++, what is cloud computing classification, characteristics, principles, types of cloud providers, generations of computer first to fifth, classification, characteristics, features, examples, types of computer software: systems software, application software, what are operators in c different types of operators in c, types of storage devices, advantages, examples, 10 types of computers | history of computers, advantages, what is operating system functions, types, types of user interface, advantages and disadvantages of operating system, advantages and disadvantages of flowcharts, what is flowchart in programming symbols, advantages, preparation, what is problem solving algorithm, steps, representation, what are data types in c++ types, data and information: definition, characteristics, types, channels, approaches, what are expressions in c types, leave a reply cancel reply.

Teach Computer Science

Character Sets

Ks3 computer science.

11-14 Years Old

48 modules covering EVERY Computer Science topic needed for KS3 level.

GCSE Computer Science

14-16 Years Old

45 modules covering EVERY Computer Science topic needed for GCSE level.

A-Level Computer Science

16-18 Years Old

66 modules covering EVERY Computer Science topic needed for A-Level.

KS3 Character Sets Resources (14-16 years)

  • An editable PowerPoint lesson presentation
  • Editable revision handouts
  • A glossary which covers the key terminologies of the module
  • Topic mindmaps for visualising the key concepts
  • Printable flashcards to help students engage active recall and confidence-based repetition
  • A quiz with accompanying answer key to test knowledge and understanding of the module

A-Level Character Sets Resources (16-18 years)

Candidates should be able to:

  • explain the use of binary codes to represent characters
  • explain the term character set
  • describe with examples (for example ASCII and Unicode) the relationship between the number of bits per character in a character set and the number of characters which can be represented.

How are binary codes used to represent characters?

Each character (such as uppercase and lowercase letters, numbers and symbols) must be stored as a unique number called a character code if a computer system is to be able to store and process it. For example the number code for the character ‘a’ could be decimal 97 and a ‘space’ character could be 32. When a character is stored on a computer system it is therefore the number code that is actually stored (as a binary number).

What is a character set?

A character set is a complete set of the characters and their number codes that can be recognised by a computer system.

How are the number of bits per character related to the number of different possible characters ?

The ascii character set – 7-8 bits per character.

The ASCII (American Standard Code for Information Interchange) character set uses 1 byte of memory per character . Original versions of ASCII only used 7 of the 8 bits available , allowing 128 different characters to be represented using the binary codes 0000000 to 1111111. The ASCII character set now uses all 8 bits , allowing 256 total characters . This is still a very limited number of characters and means that different ASCII character sets are needed for the symbols and accented characters used in different countries. The table below shows example characters, their decimal codes and the binary codes actually stored by the computer:

A 65 1000001
B 66 1000010
a 97 1100001
b 98 1100010
SPACE 32 0100000
  • ASCII code 13 is a carriage return , moving the cursor to a new line;
  • ASCII code 9 inserts a tab into a line of text.

The Unicode Character Set – 16 bits per character

The Unicode character set potentially uses up to 16 bits (2 bytes) of memory per character which allows 65,536 different characters to be represented . Using 16 bits means that Unicode can represent the characters and symbols from all the alphabets that exist around the globe, rather than having to use different character sets for different countries. The first 128 characters in ASCII have the same numeric codes as those in Unicode, making Unicode backward compatible with ASCII.

Further Readings:

  • Character encoding
  • Character representation

A text character is a letter, digit, punctuation mark, or other symbol that you might type in on your keyboard.

Computers store characters as numbers – each letter or other symbol has a unique number (for example, A might be represented by the number 65, B by 66 etc). A character string such as "Hello world!" is represented by a list of numbers.

A character set is a list of all the characters which are recognised by a computer's hardware and software. It also defines which number is used to represent each character. Here is a simple character set which just contains the capital letters, and gives each letter a different number to identify it:

character set

In order to transfer text data between different computer systems, we need to use a standard character set that all computers understand. There are two standard character sets which are very widely used - ASCII and Unicode .

  • Bits and bytes
  • Units of storage
  • Powers of two
  • Representing numbers
  • Binary numbers
  • Hexadecimal numbers
  • Adding binary numbers
  • Negative binary numbers
  • Bitwise logical operations
  • Binary shift
  • Extended ASCII
  • Digital images
  • Bitmap images
  • Computer colour
  • Colour depth
  • Vector images
  • Image file formats
  • Computer sound
  • Recording sound
  • Playing sound
  • Sound file formats

Sign up to the Creative Coding Newletter

Join my newsletter to receive occasional emails when new content is added, using the form below:

Popular tags

555 timer abstract data type abstraction addition algorithm and gate array ascii ascii85 base32 base64 battery binary binary encoding binary search bit block cipher block padding byte canvas colour coming soon computer music condition cryptographic attacks cryptography decomposition decryption deduplication dictionary attack encryption file server flash memory hard drive hashing hexadecimal hmac html image insertion sort ip address key derivation lamp linear search list mac mac address mesh network message authentication code music nand gate network storage none nor gate not gate op-amp or gate pixel private key python quantisation queue raid ram relational operator resources rgb rom search sort sound synthesis ssd star network supercollider svg switch symmetric encryption truth table turtle graphics yenc

  • Data representation
  • Memory and storage
  • Creating robust programs
  • Programming languages
  • Systems architecture
  • Computational logic

Data Representation 5.4. Text

Data representation.

  • 5.1. What's the big picture?
  • 5.2. Getting started
  • 5.3. Numbers

Introduction to Unicode

Comparison of text representations, project: messages hidden in music.

  • 5.5. Images and Colours
  • 5.6. Program Instructions
  • 5.7. The whole story!
  • 5.8. Further reading

There are several different ways in which computers use bits to store text. In this section, we will look at some common ones and then look at the pros and cons of each representation.

We saw earlier that 64 unique patterns can be made using 6 dots in Braille. A dot corresponds to a bit, because both dots and bits have 2 different possible values.

Count how many different characters that you could type into a text editor using your keyboard. (Don’t forget to count both of the symbols that share the number keys, and the symbols to the side that are for punctuation!)

The collective name for upper case letters, lower case letters, numbers, and symbols is characters e.g. a, D, 1, h, 6, *, ], and ~ are all characters. Importantly, a space is also a character.

If you counted correctly, you should find that there were more than 64 characters, and you might have found up to around 95. Because 6 bits can only represent 64 characters, we will need more than 6 bits; it turns out that we need at least 7 bits to represent all of these characters as this gives 128 possible patterns. This is exactly what the ASCII representation for text does.

In the previous section, we explained what happens when the number of dots was increased by 1 (remember that a dot in Braille is effectively a bit). Can you explain how we knew that if 6 bits is enough to represent 64 characters, then 7 bits must be enough to represent 128 characters?

Each pattern in ASCII is usually stored in 8 bits, with one wasted bit, rather than 7 bits. However, the left-most bit in each 8-bit pattern is a 0, meaning there are still only 128 possible patterns. Where possible, we prefer to deal with full bytes (8 bits) on a computer, this is why ASCII has an extra wasted bit.

Here is a table that shows the patterns of bits that ASCII uses for each of the characters.

Binary Char Binary Char     Binary Char
0100000     Space     1000000     @ 1100000     `
0100001 ! 1000001 A 1100001 a
0100010 " 1000010 B 1100010 b
0100011 # 1000011 C 1100011 c
0100100 $ 1000100 D 1100100 d
0100101 % 1000101 E 1100101 e
0100110 & 1000110 F 1100110 f
0100111 ' 1000111 G 1100111 g
0101000 ( 1001000 H 1101000 h
0101001 ) 1001001 I 1101001 i
0101010 * 1001010 J 1101010 j
0101011 + 1001011 K 1101011 k
0101100 , 1001100 L 1101100 l
0101101 - 1001101 M 1101101 m
0101110 . 1001110 N 1101110 n
0101111 / 1001111 O 1101111 o
0110000 0 1010000 P 1110000 p
0110001 1 1010001 Q 1110001 q
0110010 2 1010010 R 1110010 r
0110011 3 1010011 S 1110011 s
0110100 4 1010100 T 1110100 t
0110101 5 1010101 U 1110101 u
0110110 6 1010110 V 1110110 v
0110111 7 1010111 W 1110111 w
0111000 8 1011000 X 1111000 x
0111001 9 1011001 Y 1111001 y
0111010 : 1011010 Z 1111010 z
0111011 ; 1011011 [ 1111011 {
0111100 < 1011100 \ 1111100 |
0111101 = 1011101 ] 1111101 }
0111110 > 1011110 ^ 1111110 ~
0111111 ? 1011111 _ 1111111 Delete

For example, the letter "c" (lower case) in the table has the pattern "01100011" (the 0 at the front is just extra padding to make it up to 8 bits). The letter "o" has the pattern "01101111". You could write a word out using this code, and if you give it to someone else, they should be able to decode it exactly.

Computers can represent pieces of text with sequences of these patterns, much like Braille does. For example, the word "computers" (all lower case) would be 01100011 01101111 01101101 01110000 01110101 01110100 01100101 01110010 01110011. This is because "c" is "01100011", "o" is "01101111", and so on. Have a look at the ASCII table above to check if we are right!

The name "ASCII" stands for "American Standard Code for Information Interchange", which was a particular way of assigning bit patterns to the characters on a keyboard. The ASCII system even includes "characters" for ringing a bell (useful for getting attention on old telegraph systems), deleting the previous character (kind of an early "undo"), and "end of transmission" (to let the receiver know that the message was finished). These days those characters are rarely used, but the codes for them still exist (they are the missing patterns in the table above). Nowadays ASCII has been supplanted by a code called "UTF-8", which happens to be the same as ASCII if the extra left-hand bit is a 0, but opens up a huge range of characters if the left-hand bit is a 1.

Have a go at the following ASCII exercises:

  • How would you represent "science" in ASCII? (ignore the " marks)
  • How would you represent "Wellington" in ASCII? (note that it starts with an upper case "W")
  • How would you represent "358" in ASCII? (it is three characters, even though it looks like a number)
  • How would you represent "Hello, how are you?" in ASCII? (look for the comma, question mark, and space characters in the ASCII table)

Be sure to have a go at all of them before checking the answer!

These are the answers.

  • "science" = 01110011 01100011 01101001 01100101 01101110 01100011 01100101
  • "Wellington" = 01010111 01100101 01101100 01101100 01101001 01101110 01100111 01110100 01101111 01101110
  • "358" = 00110011 00110101 00111000
  • "Hello, how are you?" = 1001000 1100101 1101100 1101100 1101111 0101100 0100000 1101000 1101111 1110111 0100000 1100001 1110010 1100101 0100000 1111001 1101111 1110101 0111111

Note that the text "358" is treated as 3 characters in ASCII, which may be confusing, as the text "358" is different to the number 358! You may have encountered this distinction in a spreadsheet e.g. if a cell starts with an inverted comma in Excel, it is treated as text rather than a number. One place this comes up is with phone numbers; if you type 027555555 into a spreadsheet as a number, it will come up as 27555555, but as text the 0 can be displayed. In fact, phone numbers aren't really just numbers because a leading zero can be important, as they can contain other characters – for example, +64 3 555 1234 extn. 1234.

ASCII usage in practice

ASCII was first used commercially in 1963, and despite the big changes in computers since then, it is still the basis of how English text is stored on computers. ASCII assigned a different pattern of bits to each of the characters, along with a few other "control" characters, such as delete or backspace.

English text can easily be represented using ASCII, but what about languages such as Chinese where there are thousands of different characters? Unsurprisingly, the 128 patterns aren’t nearly enough to represent such languages. Because of this, ASCII is not so useful in practice, and is no longer used widely. In the next sections, we will look at Unicode and its representations. These solve the problem of being unable to represent non-English characters.

There are several other codes that were popular before ASCII, including the Baudot code and EBCDIC . A widely used variant of the Baudot code was the "Murray code", named after New Zealand born inventor Donald Murray ). One of Murray's significant improvements was to introduce the idea of "control characters", such as the carriage return (new line). The "control" key still exists on modern keyboards.

In practice, we need to be able to represent more than just English characters. To solve this problem, we use a standard called unicode . Unicode is a character set with around 120,000 different characters, in many different languages, current and historic. Each character has a unique number assigned to it, making it easy to identify.

Unicode itself is not a representation – it is a character set. In order to represent Unicode characters as bits, a Unicode encoding scheme is used. The Unicode encoding scheme tells us how each number (which corresponds to a Unicode character) should be represented with a pattern of bits.

The following interactive will allow you to explore the Unicode character set. Enter a number in the box on the left to see what Unicode character corresponds to it, or enter a character on the right to see what its Unicode number is (you could paste one in from a foreign language web page to see what happens with non-English characters).

Unicode Characters

The most widely used Unicode encoding schemes are called UTF-8, UTF-16, and UTF-32; you may have seen these names in email headers or describing a text file. Some of the Unicode encoding schemes are fixed length , and some are variable length . Fixed length means that each character is represented using the same number of bits. Variable length means that some characters are represented with fewer bits than others. It's better to be variable length , as this will ensure that the most commonly used characters are represented with fewer bits than the uncommonly used characters. Of course, what might be the most commonly used character in English is not necessarily the most commonly used character in Japanese. You may be wondering why we need so many encoding schemes for Unicode. It turns out that some are better for English language text, and some are better for Asian language text.

The remainder of the text representation section will look at some of these Unicode encoding schemes so that you understand how to use them, and why some of them are better than others in certain situations.

UTF-32 is a fixed length Unicode encoding scheme. The representation for each character is simply its number converted to a 32 bit binary number. Leading zeroes are used if there are not enough bits (just like how you can represent 254 as a 4 digit decimal number – 0254). 32 bits is a nice round number on a computer, often referred to as a word (which is a bit confusing, since we can use UTF-32 characters to represent English words!)

For example, the character H in UTF-32 would be:

The character $ in UTF-32 would be:

And the character 犬 (dog in Chinese) in UTF-32 would be:

The following interactive will allow you to convert a Unicode character to its UTF-32 representation. The Unicode character's number is also displayed. The bits are simply the binary number form of the character number.

Your browser does not support iframes.

  • Represent each character in your name using UTF-32.
  • Check how many bits your representation required, and explain why it had this many (remember that each character should have required 32 bits)
  • Explain how you knew how to represent each character. Even if you used the interactive, you should still be able to explain it in terms of binary numbers.

ASCII actually took the same approach. Each ASCII character has a number between 0 and 255, and the representation for the character the number converted to an 8 bit binary number. ASCII is also a fixed length encoding scheme – every character in ASCII is represented using 8 bits.

In practice, UTF-32 is rarely used – you can see that it's pretty wasteful of space. UTF-8 and UTF-16 are both variable length encoding schemes, and very widely used. We will look at them next.

What is the largest number that can be represented with 32 bits? (In both decimal and binary).

The largest number in Unicode that has a character assigned to it is not actually the largest possible 32 bit number – it is 00000000 00010000 11111111 11111111. What is this number in decimal?

Most numbers that can be made using 32 bits do not have a Unicode character attached to them – there is a lot of wasted space. There are good reasons for this, but if you had a shorter number that could represent any character, what is the minimum number of bits you would need, given that there are currently around 120,000 Unicode characters?

The largest number that can be represented using 32 bits is 4,294,967,295 (around 4.3 billion). You might have seen this number before – it is the largest unsigned integer that a 32 bit computer can easily represent in programming languages such as C.

The decimal number for the largest character is 1,114,111.

You can represent all current characters with 17 bits. The largest number you can represent with 16 bits is 65,536, which is not enough. If we go up to 17 bits, that gives 131,072, which is larger than 120,000. Therefore, we need 17 bits.

UTF-8 is a variable length encoding scheme for Unicode. Characters with a lower Unicode number require fewer bits for their representation than those with a higher Unicode number. UTF-8 representations contain either 8, 16, 24, or 32 bits. Remembering that a byte is 8 bits, these are 1, 2, 3, and 4 bytes.

For example, the character H in UTF-8 would be:

The character ǿ in UTF-8 would be:

And the character 犬 (dog in Chinese) in UTF-8 would be:

The following interactive will allow you to convert a Unicode character to its UTF-8 representation. The Unicode character's number is also displayed.

How does UTF-8 work?

So how does UTF-8 actually work? Use the following process to do what the interactive is doing and convert characters to UTF-8 yourself.

Step 1. Lookup the Unicode number of your character.

Step 2. Convert the Unicode number to a binary number, using as few bits as necessary. Look back to the section on binary numbers if you cannot remember how to convert a number to binary.

Step 3. Count how many bits are in the binary number, and choose the correct pattern to use, based on how many bits there were. Step 4 will explain how to use the pattern.

Step 4. Replace the x's in the pattern with the bits of the binary number you converted in Step 2. If there are more x's than bits, replace extra left-most x's with 0's.

For example, if you wanted to find out the representation for 貓 (cat in Chinese), the steps you would take would be as follows.

Step 1. Determine that the Unicode number for 貓 is 35987 .

Step 2. Convert 35987 to binary – giving 10001100 10010011 .

Step 3. Count that there are 16 bits, and therefore the third pattern 1110xxxx 10xxxxxx 10xxxxx should be used.

Step 4. Substitute the bits into the pattern to replace the x's – 11101000 10110010 10010011 .

Therefore, the representation for 貓 is 11101000 10110010 10010011 using UTF-8.

Just like UTF-8, UTF-16 is a variable length encoding scheme for Unicode. Because it is far more complex than UTF-8, we won't explain how it works here.

However, the following interactive will allow you to represent text with UTF-16. Try putting some text that is in English and some text that is in Japanese into it. Compare the representations to what you get with UTF-8.

We have looked at ASCII, UTF-32, UTF-8, and UTF-16.

The following table summarises what we have said so far about each representation.

Representation       Variable or Fixed       Bits per Character       Real world Usage
Fixed Length 8 bits No longer widely used
Variable Length 8, 16, 24, or 32 bits Very widely used
Variable Length 16 or 32 bits Widely used
Fixed Length 32 bits Rarely used

In order to compare and evaluate them, we need to decide what it means for a representation to be "good". Two useful criteria are:

  • Can represent all characters, regardless of language.
  • Represents a piece of text using as few bits as possible.

We know that UTF-8, UTF-16, and UTF-32 can represent all characters, but ASCII can only represent English. Therefore, ASCII fails the first criterion. But for the second criteria, it isn't so simple.

The following interactive will allow you to find out the length of pieces of text using UTF-8, UTF-16, or UTF-32. Find some samples of English text and Asian text (forums or a translation site are a good place to look), and see how long your various samples are when encoded with each of the three representations. Copy paste or type text into the box.

Unicode Encoding Size

Enter text for length calculation:

Encoding lengths:

UTF-8: 0 bits

UTF-16: 0 bits

UTF-32: 0 bits

As a general rule, UTF-8 is better for English text, and UTF-16 is better for Asian text. UTF-32 always requires 32 bits for each character, so is unpopular in practice.

Those cute little characters that you might use in your social media statuses, texts, and so on, are called "emojis", and each one of them has their own Unicode value. Japanese mobile operators were the first to use emojis, but their recent popularity has resulted in many becoming part of the Unicode Standard and today there are well over 1000 different emojis included. A current list of these can be seen here . What is interesting to notice is that a single emoji will look very different across different platforms, i.e. &#128518 ("smiling face with open mouth and tightly-closed eyes") in my tweet will look very different to what it does on your iPhone. This is because the Unicode Consortium only provides the character codes for each emoji and the end vendors determine what that emoji will look like, e.g. for Apple devices the "Apple Color Emoji" typeface is used (there are rules around this to make sure there is consistency across each system).

There are messages hidden in this video using a 5-bit representation. See if you can find them! Start by reading the explanation below to ensure you understand what we mean by a 5-bit representation.

If you only wanted to represent the 26 letters of the alphabet, and weren’t worried about upper case or lower case, you could get away with using just 5 bits, which allows for up to 32 different patterns.

You might have exchanged notes which used 1 for "a", 2 for "b", 3 for "c", all the way up to 26 for "z". We can convert those numbers into 5 digit binary numbers. In fact, you will also get the same 5 bits for each letter by looking at the last 5 bits for it in the ASCII table (and it doesn't matter whether you look at the upper case or the lower case letter).

Represent the word "water" with bits using this system. Check the below panel once you think you have it.

Now, have a go at decoding the music video!

define character representation in computer

Data Representation: Characters

Understanding data representation: characters, basics of characters in data representation:.

  • Characters are the smallest readable unit in text, including alphabets, numbers, spaces, symbols etc.
  • In computer systems, each character is represented by a unique binary code.
  • The system by which characters are converted to binary code is known as a character set .

ASCII and Unicode:

  • ASCII (American Standard Code for Information Interchange) and Unicode are two popular character sets.
  • ASCII uses 7 bits to represent each character, leading to a total of 128 possible characters ( including some non-printable control characters).
  • As a more modern and comprehensive system, Unicode can represent over a million characters, covering virtually all writing systems in use today.
  • Unicode is backward compatible with ASCII, meaning ASCII values represent the same characters in Unicode.

Importance of Character Sets:

  • Having a standard system for representing characters is important for interoperability, ensuring different systems can read and display the same characters in the same way.
  • This is especially important in programming and when transmitting data over networks.

Understanding Binary Representations:

  • Each character in a character set is represented by a unique binary number. E.g., in ASCII, the capital letter “A” is represented by the binary number 1000001.
  • Different types of data (e.g., characters, integers, floating-point values) are stored in different ways, but ultimately all data in a computer is stored as binary .

Characters in Programming:

  • In most programming languages, single characters are represented within single quotes, e.g., ‘a’, ‘1’, ‘$’.
  • A series of characters, also known as a string , is represented within double quotes, e.g., “Hello, world!”.
  • String manipulation is a key part of many programming tasks, and understanding how characters are represented is essential for manipulating strings effectively.

Not Just Text:

  • It’s important to understand that computers interpret everything — not just letters and numbers, but also images, sounds, and more — as binary data.
  • Understanding the binary representation of characters is a foundational part of understanding how data is stored and manipulated in a computer system.

Character encodings: Essential concepts

This article introduces a number of basic concepts needed to understand other articles that deal with characters and character encodings.

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.

In the past, different organizations have assembled different sets of characters and created encodings for them – one set may cover just Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather ad hoc way for representing another language somewhere in the world.

Unfortunately, you can’t guarantee that your application will support all encodings, nor that a given encoding will support all your needs for representing a given language. In addition, it is usually impossible to combine different encodings on the same Web page or in a database, so it is usually very difficult to support multilingual pages using ‘legacy’ approaches to encoding.

The Unicode Consortium provides a large, single character set that aims to include all the characters needed for any writing system in the world, including ancient scripts (such as Cuneiform, Gothic and Egyptian Hieroglyphs). It is now fundamental to the architecture of the Web and operating systems, and is supported by all major web browsers and applications. The Unicode Standard also describes properties and algorithms for working with characters.

This approach makes it much easier to deal with multilingual pages or systems, and provides much better coverage of your needs than most traditional encoding systems.

The following shows Unicode script blocks as of Unicode version 5.2:

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP) . The BMP includes most of the more commonly used characters.

The number 65,536 is 2 to the power of 16. In other words, the maximum number of bit permutations you can get in two bytes.

The Unicode character set also contains space for around a million additional code point positions. Characters in this latter range are referred to as supplementary characters .

Illustration of the 17 planes in the Unicode code set.

For more information about Unicode, see the Unicode Home Page , or read the tutorial An Introduction to Writing Systems & Unicode .

Character sets, coded character sets, and encodings

It is important to clearly distinguish between the concepts of a character set versus a character encoding.

A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points . A code point value represents the position of a character in the coded character set. For example, the code point for the letter á in the Unicode coded character set is 225 in decimal, or 0xE1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for referring to code points, and will be used here.) A Unicode code point can have a value between 0x0000 and 0x10FFFF.

Coded character sets are sometimes called code pages.

The character encoding reflects the way the coded character set is mapped to bytes for manipulation in a computer. The picture below shows how characters and code points in the Tifinagh (Berber) script are mapped to sequences of bytes in memory using the UTF-8 encoding (which we describe in this section). The code point values for each character are listed immediately below the glyph (ie. the visual representation) for that character at the top of the diagram. The arrows show how those are mapped to sequences of bytes, where each byte is represented by a two-digit hexadecimal number. Note how the Tifinagh code points map to three bytes, but the exclamation mark maps to a single byte.

This explanation glosses over some of the detailed nomenclature related to encoding. More detail can be found in Unicode Technical Report #17 .

One character set, multiple encodings. Many character encoding standards, such as those in the ISO 8859 series, use a single byte for a given character and the encoding is a straightforward mapping to the scalar position of the characters in the coded character set. For example, the letter A in the ISO 8859-1 coded character set is in the 65th character position (starting from zero), and is encoded for representation in the computer using a byte with the value of 65. For ISO 8859-1 this never changes.

For Unicode, however, things are not so straightforward. Although the code point for the letter á in the Unicode coded character set is always 225 (in decimal), in UTF-8 it is represented in the computer by two bytes. In other words there isn't a trivial, one-to-one mapping between the coded character set value and the encoded value for this character.

In addition, in Unicode there are a number of ways of encoding the same character. For example, the letter á can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

UTF-8 uses 1 byte to represent characters in the ASCII set, two bytes for characters in several more alphabetic blocks, and three bytes for the rest of the BMP. Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

In the following chart, the first line of numbers represents the position of a character in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding.

Code point U+0041 U+05D0 U+597D U+233B4
UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4

For more information about characters and encodings see Introducing Character Sets and Encodings , or read the tutorial Handling character encodings in HTML and CSS and the article Choosing & applying a character encoding .

The Document Character Set

For XML and HTML (from version 4.0 onwards) the document character set is defined to be the Universal Character Set (UCS) as defined by both ISO/IEC 10646 and Unicode standards. (For simplicity and in line with common practice, we will refer to the UCS here simply as Unicode .)

What this means is that the logical model describing how XML and HTML are processed is described in terms of the set of characters defined by Unicode. (In practical terms, this means that browsers usually convert all text to Unicode internally.)

Note that this does not mean that all HTML and XML documents have to use a Unicode encoding! It does mean, however, that documents can only contain characters defined by Unicode. Any encoding can be used for your document as long as it is properly declared and represents a subset of the Unicode repertoire.

For more information about the document character set see the article Document character set .

Characters & clusters

Although we have used it without much qualification so far in this article, the term 'character' is used here in an abstract and somewhat vague way to refer to the smallest component of written language that has semantic value. However, the term 'character' is often used to mean different things in different contexts: it can variously refer to the visual, logical, or byte-level representation of a given piece of text. This makes the term too imprecise to use when specifying algorithms, protocols, or document formats, unless you explicitly define what you mean by it. If the term 'character' is used in those contexts in a technical sense, the recommendation is to use it as a synonym for code point (described above).

It is particularly important to remember that bytes only rarely equate to characters in Unicode, as shown in the earlier examples.

However, particularly in complex scripts, what a user perceives as a smallest component of their alphabet (and so what we will call a user-perceived character ) may actually be a sequence of code points. For example, the Vietnamese letter ề will be perceived as a single letter even if the underlying code point sequence is U+0065 LATIN SMALL LETTER E + U+0302 COMBINING CIRCUMFLEX ACCENT​ + U+0300 COMBINING GRAVE ACCENT​ . Similarly, a Bangla speaker may view ksha ( ক্ষ), which is composed of the sequence U+0995 BENGALI LETTER KA + U+09CD BENGALI SIGN VIRAMA + U+09B7 BENGALI LETTER SS, ) as a single letter.

It is often important to take into account these user-perceived characters. For example, it is common to treat certain combinations of code points as a single unit for various editing operations, such as line-breaking, cursor movement, selection, deletion, etc. It would usually be problematic if a user selection accidentally omitted part of the letters just mentioned, or if a line-break separated a base character from its following combining characters.

In order to approximate user-perceived character units for such operations, Unicode uses a set of generalised rules to define grapheme clusters – sequences of adjacent code points that can be treated as a unit by applications. A single alphabetic character like e is a grapheme cluster, but so also is any combination of base character and following combining character(s), such as ề mentioned above.

Unicode Standard Annex #29: Text Segmentation actually defines two types of grapheme cluster: extended grapheme clusters, and legacy grapheme clusters. Here when we say 'grapheme cluster' we mean the former. It is not recommended to use the latter.

user-perceived character
(possible) decomposition & grapheme cluster boundaries

Currently there are, however, some limitations to the grapheme cluster rules: for example, the rules split the Bangla user-perceived character kshī ( ক্ষী ) into two adjacent grapheme clusters, rather than enveloping the whole orthographic syllable. Applications that need to work with user-perceived characters in Bangla therefore need to apply some script-specific tailoring of the grapheme cluster rules.

user-perceived character
decomposition & grapheme cluster boundaries

The appropriate units for editing operations sometimes vary according to what you want to do. For example, if you backspace over the Hindi word हूँ ( U+0939 DEVANAGARI LETTER HA + U+0942 DEVANAGARI VOWEL SIGN UU​ + U+0901 DEVANAGARI SIGN CANDRABINDU​ ) the application will typically first delete each of the two combining characters, and then the base. However, if you 'forward-delete' while the cursor is at the left of the word most applications will delete the whole grapheme cluster in one go.

CSS, in order to refer to an indivisible text unit in a given context, uses the term typographic character unit . The definition of what constitutes a typographic character unit depends on the operation that is being applied. So when working with the example of ề above, when deleting forwards there would be a single typographic character unit, but three when backspacing. Also, typographic character units cover the cases such as Bengali ksha , which grapheme clusters currently don't. The determination of what constitutes a typographic character unit in a given language and editing context is deferred to the application, rather than spelled out in rules.

Characters & glyphs

A font is a collection of glyphs . In a simple scenario, a glyph is the visual representation of a code point. The glyph used to represent a code point will vary with the font used, and whether the font is bold, italic, etc. In the case of emoji, the glyphs used will vary by platform.

In fact, more than one glyph may be used to represent a single code point, and multiple code points may be represented by a single glyph.

Emoji provide another example of the complex relationship between code points and glyphs.

U+1F46A FAMILY
U+1F468 U+200D U+1F469 U+200D U+1F466
U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

The emoji character for "family" has a code point in Unicode: 👪 [U+1F46A FAMILY] . It can also be formed by using a sequence of code points: 👨‍👩‍👦 [U+1F468 U+200D U+1F469 U+200D U+1F466] . Altering or adding other emoji characters can alter the composition of the family. For example the sequence 👨‍👩‍👧‍👧 [U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466] results in a composed emoji glyph for a "family: man, woman, girl, boy" on systems that support this kind of composition. Many common emoji can only be formed using sequences of code points, but should be treated as a single user-perceived character when displaying or processing the text.

Character escapes

A character escape is a way of representing a character without actually using the character itself.

For example, there is no way of directly representing the Hebrew character א in your document if you are using an ISO 8859-1 encoding (which covers Western European languages). One way to indicate that you want to include that character in HTML is to use the escape &#x05D0;. Because the document character set is Unicode, the user agent should recognize that this represents a Hebrew aleph character.

Examples of escapes in HTML / XHTML and CSS, and advice on when and how to use them, can be found in the article Using character escapes in markup and CSS .

The HTTP header

When you retrieve a document from a server, the server normally sends some additional information with the document. This is called the HTTP header. Here is an example of the kind of information about the document that is passed by HTTP header with a document as it travels from the server to the client.

The second line from the bottom in this example carries information about the character encoding for the document.

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, the server may associate this information with the files. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the server administrator.

As an example, Apache servers typically provide a default encoding, which can usually be overridden by directory-specific settings. For example, a webmaster might add the following line to a .htaccess file to serve all files with a .html extension as UTF-8 in this and all child directories:

AddType 'text/html; charset=UTF-8' html

For more information on changing the encoding in the HTTP header, see Setting the HTTP charset parameter

Further reading

Getting started? Introducing Character Sets and Encodings

Tutorial, Handling character encodings in HTML and CSS

Serving HTML & XHTML for information about MIME types , standards vs quirks modes , and DOCTYPEs .

Authoring HTML & CSS

Setting up a server

Collating sequence

The iso character set.

  • Interview Problems on String
  • Practice String
  • MCQs on String
  • Tutorial on String
  • String Operations
  • Sort String
  • Substring & Subsequence
  • Iterate String
  • Reverse String
  • Rotate String
  • String Concatenation
  • Compare Strings
  • KMP Algorithm
  • Boyer-Moore Algorithm
  • Rabin-Karp Algorithm
  • Z Algorithm
  • String Guide for CP

Understanding Character Encoding

Ever imagined how a computer is able to understand and display what you have written? Ever wondered what a UTF-8 or UTF-16 meant when you were going through some configurations? Just think about how “HeLLo WorlD” should be interpreted by a computer. We all know that a computer stores data in bits and bytes. So, to display a character on screen or map the character as a byte in memory of the computer needs to have a standard. Read the following :

This is something a memory would show you. How do you know what character each memory byte specifies?

Here comes character encoding into the picture:

If you have already not guessed it – Its “HeLLo WorlD” in UTF-8 for you. And yes, we will go ahead and read about UTF-8. But let’s start with ASCII. Most of you who have done programming or worked with strings must have known what ASCII is. If you haven’t then let’s define what ASCII is.

ASCII stands for American Standard Code for Information Interchange. Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as ‘a’ or ‘@’ or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose.

Just look at the following –

Hexadecimal Decimal Character
\x48 72 H
\x65 101 e
\x4c 76 L

And so on. You can look at the ASCII table and mapping at http://www.asciitable.com/. If you have not already looked at the table, I will recommend that you do it now! You will observe that these are a simple set of English words and punctuations.

Now Suppose I want to write the below characters: 

This will be interpreted by my decoder as

0x410x0a0x200x420x3f0x40

065010032066063064

in decimal, where even a space (0x20) and a next line (0x0a) has a byte value or a memory space.

Different countries, languages but the need that brought them together 

Today internet has made the world come close together. And the people all over the world do not speak just English, right? There came a need to expand this space. If you have created an application and you see that people in France want to use it as you see a high potential there. Wouldn’t it be nice to just have a change in language but having the same functionality?

Why not create a Universal Code in short Unicode for everyone ??

So, here came the Unicode with a really good idea. It assigned every character, including different languages, a unique number called Code Point. One advantage of Unicode over other possible sets is that its first 256 code points are identical to ASCII. So for a software/browser it is easier to encode and decode characters of majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded. Unicode also is a character set (not an encoding). It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a Code point. It has the ambition to contain all characters (and popular icons) used in the entire world.

Before knowing these let us get a few terminologies straight :

  • A character is a minimal unit of text that has semantic value.
  • A character set is a collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.
  • A coded character set is a character set in which each character corresponds to a unique number.
  • A code point of a coded character set is any legal value in the character set.
  • A code unit is a bit sequence used to encode each character of a repertoire within a given encoding form.

Ever wondered what is UTF-8 or UTF-16??

UTF-8: 

UTF-8 has truly been the dominant character encoding for the World Wide Web since 2009, and as of June 2017 accounts for 89.4% of all Web pages. UTF-8 encodes each of the 1,112,064 valid code points in Unicode using one to four 8-bit bytes. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. So how many bytes give access to what characters in these encodings?

1 byte: Standard ASCII 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian) 3 bytes: BMP 4 bytes: All Unicode characters

2 bytes: BMP 4 bytes: All Unicode characters

So I did make a mention about BMP. What is it exactly?

Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow. UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values. UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), which makes it quite bloated but now in this scenario every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can’t do that with UTF-8.This is how it eases to accommodate the entire character set for different languages and help people spread their applications or information to the world just coding/writing in their language rest all is taken care by the Decoder. As this being just the beginning into the world of Character Encoding. I hope this helps you understand Character encoding at a higher level.

Please Login to comment...

Similar reads.

  • OpenAI o1 AI Model Launched: Explore o1-Preview, o1-Mini, Pricing & Comparison
  • How to Merge Cells in Google Sheets: Step by Step Guide
  • How to Lock Cells in Google Sheets : Step by Step Guide
  • PS5 Pro Launched: Controller, Price, Specs & Features, How to Pre-Order, and More
  • #geekstreak2024 – 21 Days POTD Challenge Powered By Deutsche Bank

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Representation of Numbers and Characters in Computer

  • First Online: 24 November 2023

Cite this chapter

define character representation in computer

  • Orhan Gazi 2  

616 Accesses

This chapter covers the computer representation of numbers and characters. Computers use binary number system. Everything is represented by binary numbers in computer. Information is expressed using symbols that include characters, numbers, and symbols other than characters. Every symbol and number is represented by 7-bit ASCII codes. ASCII representation of positive numbers is the same as their binary representation. However, negative numbers are represented in 2’s complement form in most of the electronic devices, including computers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and affiliations.

Electrical and Electronics Engineering, Ankara Medipol University, Altındağ, Ankara, Türkiye

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Orhan Gazi .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Gazi, O. (2024). Representation of Numbers and Characters in Computer. In: Modern C Programming. Springer, Cham. https://doi.org/10.1007/978-3-031-45361-8_1

Download citation

DOI : https://doi.org/10.1007/978-3-031-45361-8_1

Published : 24 November 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-45360-1

Online ISBN : 978-3-031-45361-8

eBook Packages : Engineering Engineering (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. Character Representation-IT

    define character representation in computer

  2. Introduction to Character Representation

    define character representation in computer

  3. Representation of Characters in a computer

    define character representation in computer

  4. PPT

    define character representation in computer

  5. Character representation

    define character representation in computer

  6. PPT

    define character representation in computer

VIDEO

  1. Define "Character Development"

  2. Accountability can define character #subscribe#truth#real

  3. Actions Define Character #Wisdom#Integrity#Truth#selfGrowth#LifeLessons#CharacterMatters#motivation

  4. C12. print character representation of given ASCII value!

  5. RT3. Equivalence and Examples (Expanded)

  6. Information Representation

COMMENTS

  1. Character (computing)

    In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language. [1] Examples of characters include letters, numerical digits, common punctuation marks (such as ...

  2. Data Representation in Computer: Number Systems, Characters

    A computer uses a fixed number of bits to represent a piece of data which could be a number, a character, image, sound, video, etc. Data representation is the method used internally to represent data in a computer. Let us see how various types of data can be represented in computer memory. Before discussing data representation of numbers, let ...

  3. Representation of Characters in Computers

    Characters, including letters, punctuation, and digits, are encoded as binary integers on computers. This article will discuss data and character representation.

  4. 1.3: Character Representation

    1.3: Character Representation. All of the numbers used so far in this text have been binary whole numbers. While everything in a computer is binary, and can be represented as a binary value, binary whole numbers do not represent the universe of numbering systems that exists in computers. Two representations that will be covered in the next two ...

  5. Character Sets

    Candidates should be able to: explain the use of binary codes to represent characters explain the term character set describe with examples (for example ASCII and Unicode) the relationship between the number of bits per character in a character set and the number of characters which can be represented.

  6. PDF Characters and Strings

    Notes on Character Representation The first thing to remember about the Unicode table from the previous slide is that you don't actually have to learn the numeric codes for the characters. The important observation is that a character has a numeric representation, and not what that representation happens to be.

  7. TenMinuteTutor

    A character set is a list of all the characters which are recognised by a computer's hardware and software. It also defines which number is used to represent each character.

  8. Text

    The representation for each character is simply its number converted to a 32 bit binary number. Leading zeroes are used if there are not enough bits (just like how you can represent 254 as a 4 digit decimal number - 0254). 32 bits is a nice round number on a computer, often referred to as a word (which is a bit confusing, since we can use UTF ...

  9. Data Representation: Characters

    Basics of Characters in Data Representation: Characters are the smallest readable unit in text, including alphabets, numbers, spaces, symbols etc. In computer systems, each character is represented by a unique binary code. The system by which characters are converted to binary code is known as a character set.

  10. PDF Presentation A

    In ASCII encoding scheme, alphanumeric characters, operators, punctuation symbols, and control characters can be represented by 7-bit codes. It is convenient to use an 8-bit byte to represent a character.

  11. Character encodings: Essential concepts

    The character encoding reflects the way the coded character set is mapped to bytes for manipulation in a computer. The picture below shows how characters and code points in the Tifinagh (Berber) script are mapped to sequences of bytes in memory using the UTF-8 encoding (which we describe in this section).

  12. Representation of Characters in Computers

    This is not a currently used character set, but helps us to understand some of the issues in the representation of characters in computers.

  13. Understanding Character Encoding

    Computers can only understand numbers, so an ASCII code is the numerical representation of a character such as 'a' or '@' or an action of some sort. ASCII was developed a long time ago and now the non-printing characters are rarely used for their original purpose. Just look at the following -. Hexadecimal. Decimal.

  14. Everything About Representing Characters

    Methods to Represent Characters. The word "character" refers to a unit of information that generally equates to a grapheme, graphene-like unit, or symbol in the written form of a natural language, such as a letter or syllabary in a computer or machine-based telecommunications context. Using the keyboard is the most popular way for a computer ...

  15. Representation of Numbers and Characters in Computer

    This chapter covers the computer representation of numbers and characters. Computers use binary number system. Everything is represented by binary numbers in computer. Information is expressed using symbols that include characters, numbers, and symbols other than...

  16. Character encoding

    Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. [1] The numerical values that make up a character encoding are known as "code points" and collectively comprise a "code space", a ...

  17. PDF Number Systems and Number Representation

    Number Systems and Number Representation Q: Why do computer programmers confuse Christmas and Halloween?

  18. PDF Cmsc216

    UTF: Stands for Unicode Transformation Format UTF-32: a 32-bit representation of all characters All characters are the same size Uses lots of space (twice as much as UTF-16 for most things, four times as much as ASCII for many things) UTF-16: a 16-bit representation of characters Some characters are stored in two-character forms