You’ve probably heard of the ASCII/ANSI characters sets. They map the numeric values 0-127 to various Western characters and control codes (newline, tab, etc.). Note that values 0-127 fit in the lower 7 bits in an 8-bit byte. ASCII does not explicitly define what values 128-255 map to.
Now, ASCII encoding works great for English text (using Western characters), but the world is a big place. What about Arabic, Chinese and Hebrew?
To solve this, computer makers defined “code pages” that used the undefined space from 128-255 in ASCII, mapping it to various characters they needed. Unfortunately, 128 additional characters aren’t enough for the entire world: code pages varied by country (Russian code page, Hebrew code page, etc.).
If people with the same code page exchanged data, all was good. Character #200 on my machine was the same as Character #200 on yours. But if codepages mixed (Russian sender, Hebrew receiver), things got strange.
The character mapped to #200 was different in Russian and Hebrew, and you can imagine the confusion that caused for things like email and birthday invitations. It’s a big IF whether or not someone will read your message using the same codepage you authored your text. If you visit an international website, for example, your browser could try to guess the codepage if it was not specified (“Hrm… this text has a lot of character #213 and #218… probably Hebrew”). But clearly this method was error-prone: codepages needed to be rescued.
Unicode to the Rescue
The world had a conundrum: they couldn’t agree on what numbers mapped to what letters in ASCII. The Unicode group went back to the basics: Letters are abstract concepts. Unicode labeled each abstract character with a “code point”. For example, “A” mapped to code point U+0041 (this code point is in hex; code point 65 in decimal).
The Unicode group did the hard work of mapping each character in every language to some code point (not without fierce debate, I am sure). When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations.
Good right… read up more here
Using Unicodes :
No doubt, when displaying special characters, the first thing that comes to mind is getting the unicode value of the special character and appending to the code or possibly encoding or decoding with whatever resource you can use to simplify the task. For instance, if I wanted to write out the symbol of
Bitcoin, this is obviously not one of the characters on my keyboard. So what developers eventually resolve to doing involves:
- look up the unicode value of the character
- copy and paste to code.
For Bitcoin, it will be ‘u20BF’.
Well, what you would notice is that you do this everytime you want to display a special character. The downside to this is that :
- you might not know the actual name of the special character
- waste time searching for the characters
- might not even know if the character has a unicode value representation (weirdly enough). This will resolve in you using an image for a 5-star when you could have used the unicode representation
There’s a documentation available at “unicode chart”. It basically categorise unicodes. I started an open-source project based on that. The goal is to:
- Make referecing unicode far easier than it is now
- Enable local look-up on unicode by name and unicode value
- Make this available in more programming laguages other than
How I am doing this - Handling the Categorisization
I figured out bundling every thing as 1 dependency will take time to roll out, and it would be very heavy.
So, what I have decided to do is have an artifact for
sub category and
Looking at the “unicode chart” carefully, you will notice this structure:
unicode names │ └───othersymbols │ currency │ game symbols
To make it easier and less heavy, the artifact is categorise so developers can easiler add only the dependency needed (not the entire unicode chart library) : So, there is:
currency.jar - add this to your project to use currency only
game-symbols.jar - add this to your project to access game unicode value
othersymbols.jar - add this to your project to access all classes under othersymbols.
Using the library:
Really simple !
This returns the unicode character of the Naira. Other available methods includes:
// get the unicode value getUnicodeValue(); // get the unicode character getUnicodeSymbol(); // get the friendly eg. dollar, naira, bitcoin getUnicodeFriendlyName();
At the moment, only currency is avaialble (jar artifact). Other would be added as time goes on. To add this to your project, use this maven dependency compile link.
You can also download the artifact.
Interest and Contributions.
Send me a mail or just drop a comment.
Contributions are welcome. Generally, contributions are managed by issues and pull requests.
For issues and suggestions to be followed up, kindly open it on Github Issues
Contributions are welcome and I am looking forward to seeing great turn up from contributors across different stacks.
I would love to know if this article helped anyone and I honestly want to see wonderful stuff it helped you churn out.