ASCIIcode vs Unicode
Welcome friends!
This blog is about ASCII Code and Unicode. I will explain everything in easy words with examples, so you can understand without any confusion.
ASCII CODE
FULL FORM: American Standar Code for Information interchange.
Size: 7-bit(0-127)
Character set:A-Z,a-z,0-9,special sysmbols
Example:
A = 65 = Binary = 01000001
a = 97 = Binary = 01100001
0 = 48 = Binary = 00110000
Limitation = Works only for English. Cannot show Hindi,Chinese,Emoji,etc.
Extended ASCII
8 bit = 256 character(0-255)
Added some extra symbols like = Γ±, Γ§
still not enough for all world languages.
3 Unicode
FULL FORM: Universal Character Encoding Standard
A universal standerd for all characters of all languages + emojis
Range:U+0000 to U+10FFFF(~1.1 million Possible characters)
Every character has a unique code point.
example:
A = U+0041 (Decimal 65)
ΰ€
= U+0905 (Decimal 2309)
δΈ = U+4E00 (Chinese "one")
π = U+1F642 (Emoji)
4. Unicode Encodings (How code points are stored in memory)
UTF-8 (Most common , Internet standard )
Variable Length: 1-4Bytes
Backward Compatible with ASCII
Small for English, bigger for complex scripts
Example:
A = 41 = 01000001 (1 byte)
ΰ€
= E0 A4 85 = 11100000 10100100 10000101 (3 bytes)
π = F0 9F 99 82 = (4 bytes)
UTF-16 (Windows and Java's favorite)
variable length: 2 or 4 bytes
Many characters use 2 bytes, emojis need 4 bytes
Example:
A = 0041 = 00000000 01000001 (2 bytes)
ΰ€
= 0905 = 00001001 00000101 (2 bytes)
π = D83D DE42 = 4 bytes (surrogate pair)
UTF-32 (Simple but heavy)
Fixed length: 4 bytes for every character
Easy to calculation Positions, but wastes memory
Example:
A = 00000041 = 4 bytes
ΰ€
= 00000905 = 4 bytes
π = 0001F642 = 4 bytes
How Computer Understands Characters(Flow)
1. You type on keyboard = π
2. OS finds Unicode code point = U+1F642
3. Encoding converts it into bytes = UTF-8: F0 9F 99 82
4. Binary stored in memory = 11110000 10011111 10011001 10000010
5. Font file (TTF/OTF) says: "This is the shape of π"
I will Explain to you in simple language What is TTF and What is OTF.
6. Display system (GPU) draws pixels = Emoji appears on screen
Simple undetstanding for ASCIICode and Uniode
ASCII = Small house (Only for English language)
Unicode= one type of Shoping mall ( it's support multiple languages + Emojies + characters)
UTF8 = INTERNET KING Because it supports all languages , emojies, character, symbol world wide
UTF16 = Windows/Java Loves
Why Windows loves UTF-16
In the 1990s, people thought: “We only need 65,536 characters (16 bits). That’s enough for all languages!”
So Windows NT (1993) made UTF-16 its native encoding.
Later, Unicode grew bigger (emoji, rare scripts). Now UTF-16 sometimes needs 2 code units (4 bytes) for one character.
But by then, all Windows APIs were built on UTF-16 = they can’t change without breaking old software.
Why Java loves UTF-16
Java started in 1995. Same thinking: 16 bits is enough.
They made the char type = 16-bit (UTF-16 code unit).
Unicode grew, so now sometimes one char = half of a real character (needs a surrogate pair).
But the whole Java ecosystem already depends on UTF-16 = too late to switch.
Why not UTF-8 back then?
In the 90s, UTF-8 wasn’t popular.
People thought UTF-16 was more efficient for Asian scripts (Hindi, Chinese, Japanese), because each character fit directly in 2 bytes.
So Windows + Java went with UTF-16.
Today’s reality
Most of the world (Linux, web, Python, Rust, Go) = UTF-8
Windows + Java = still UTF-16 (because of old code and backward compatibility).
Basically: UTF-16 is their old love, they can’t leave it now
So:
UTF-32 = direct but waste full
UTF-16 = Windows & Java’s old choice
UTF-8 = modern king of the world
UTF32= Direct but west full
Advantages of UTF-32
1. Super simple: one code point = one 32-bit number. No tricky rules.
2. Easy for programmers: random access is fast (indexing characters).
Example: string[5] = just go to (5 × 4 bytes).
Disadvantages of UTF-32
1. Huge memory waste
English text: "Hello" in UTF-32 = 20 bytes
In UTF-8 = only 5 bytes
2. Most real-world text is English/ASCII-heavy = UTF-32 wastes 3× space.
3. More storage = more RAM + slower for network transfer.
Where is UTF-32 used?
Rare in normal files or web.
Used internally in some programming languages or libraries where simplicity matters more than memory.
Example: Some C libraries, or ICU (Unicode library) for easy indexing.
Comparison Example:
UTF-32 is always 4 bytes per character, no matter what.
Summary:
UTF-8 = compact, flexible = best for storage/web.
UTF-16 = middle ground = Windows & Java legacy.
UTF-32 = super simple but memory-hungry = only special cases.
Thank you for reading!
I hope now you understand ASCII Code and Unicode better. Keep visiting for more simple tech blogs and keep learning!
Thank you for reading!
— Writer Kishan
Chat
Comments
Post a Comment