Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again

23 min read
Guides, Coding, Typography

Quick summary ↬ This article relies heavily on numbers and aims to provide an agreement of character sets, Unicode, UTF-8 and the diverse problems that tin can arise.

This is a story that dates back to the primeval days of computers. The story has a plot, well, sort of. It has competition and intrigue, besides as traversing oodles of countries and languages. There is conflict and resolution, and a happyish ending. Only the chief focus are the characters: 110,116 of them. By the end of the story, they volition all observe their own unique identify in this globe.

This article will follow a few of those characters more closely, every bit they journey from Web server to browser, and back again. Along the way, you'll discover out more about the history of characters, character sets, Unicode and UTF-viii, and why question marks and odd accented characters sometimes show upward in databases and text files.

Alarm: This article contains lots of numbers, including a bit of binary — best approached subsequently your morning cup of coffee.

ASCII

Computers only deal in numbers and not letters, so it's important that all computers hold on which numbers represent which letters.

Let's say my calculator used the number 1 for A, 2 for B, iii for C, etc and yours used 0 for A, 1 for B, etc. If I sent y'all the message HELLO, then the numbers 8, 5, 12, 12, 15 would whiz across the wires. But for you 8 ways I, so you lot would receive and decode it equally IFMMP. To communicate effectively, we would demand to agree on a standard way of encoding the characters.

To this end, in the 1960s the American Standards Association created a vii-bit encoding called the American Standard Lawmaking for Information Interchange (ASCII). In this encoding HELLO is 72, 69, 76, 76, 79 and would be transmitted digitally as 1001000 1000101 1001100 1001100 1001111. Using 7 $.25 gives 128 possible values from 0000000 to 1111111, so ASCII has enough room for all lower case and upper case Latin letters, along with each numerical digit, common punctuation marks, spaces, tabs and other control characters. In 1968, US President Lyndon Johnson made it official — all computers must use and understand ASCII.

Trying Information technology Yourself

There are plenty of ASCII tables available, displaying or describing the 128 characters. Or you tin can brand one of your own with a little bit of CSS, HTML and Javascript, virtually of which is to go it to display nicely:

          <html> <body> <style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</way> <script type="text/javascript"> for (var i=0; i<128; i++) document.writeln ((i%32?':'<p>') + i + ': ' + String.fromCharCode (i) + '<br>'); </script> </body> </html>

This will display a tabular array like this:

Do-It-Yourself Javascript ASCII table viewed in Firefox
Do-Information technology-Yourself Javascript ASCII table viewed in Firefox

The virtually important chip of this is the Javascript Cord.fromCharCode function. It takes a number and turns it into a grapheme. In fact, the following four lines of HTML and Javascript all produce the same result. They all become the browser to brandish character numbers 72, 69, 76, 76 and 79:

          HELLO Hullo <script>document.write ("HELLO");</script> <script>document.write (String.fromCharCode (72,69,76,76,79));</script>

As well detect how Firefox displays the unprintable characters (like backspace and escape) in the start column. Some browsers show blanks or question marks. Firefox squeezes four hexadecimal digits into a minor box.

More afterward jump! Continue reading below ↓

The Eighth Flake

Teleprinters and stock tickers were quite happy sending 7 bits of information to each other. Just the new fangled microprocessors of the 1970s preferred to work with powers of ii. They could process 8 $.25 at a time and so used eight bits (aka a byte or octet) to shop each character, giving 256 possible values.

An 8 bit character tin can store a number upwards to 255, but ASCII only assigns upwards to 127. The other values from 128 to 255 are spare. Initially, IBM PCs used the spare slots to correspond accented letters, various symbols and shapes and a handful of Greek letters. For instance, number 200 was the lower left corner of a box: ╚, and 224 was the Greek letter of the alphabet blastoff in lower instance: α. This way of encoding the letters was later given the proper noun code page 437.

However, dissimilar ASCII, characters 128-255 were never standardized, and various countries started using the spare slots for their own alphabets. Non everybody agreed that 224 should brandish α, not even the Greeks. This led to the cosmos of a handful of new code pages. For instance, in Russian IBM computers using code page 885, 224 represents the Cyrillic letterЯ. And in Greek code page 737, it is lower example omega: ω.

Even and so at that place was disagreement. From the 1980s Microsoft Windows introduced its ain code pages. In the Cyrillic lawmaking page Windows-1251, 224 represents the Cyrillic letter a, andЯ is at 223.

In the late 1990s, an effort at standardization was fabricated. Fifteen dissimilar eight flake character sets were created to cover many different alphabets such as Cyrillic, Arabic, Hebrew, Turkish, and Thai. They are called ISO-8859-one up to ISO-8859-16 (number 12 was abandoned). In the Cyrillic ISO-8859-five, 224 represents the alphabetic character р, and Я is at 207.

So if a Russian friend sends you a document, you really need to know what lawmaking folio it uses. The document by itself is only a sequence of numbers. Graphic symbol 224 could be Я, a or р. Viewed using the incorrect lawmaking page, it will look similar a bunch of scrambled letters and symbols.

(The situation isn't quite equally bad when viewing Web pages — as Web browsers can commonly detect a page's character set based on frequency analysis and other such techniques. But this is a false sense of security — they can and do get it incorrect.)

Trying It Yourself

Code pages are also known as character sets. You can explore these character sets yourself, but y'all have to utilise PHP or a like server side language this time (roughly because the character needs to be in the page before it gets to the browser). Save these lines in a PHP file and upload information technology to your server:

          <html> <head> <meta charset="ISO-8859-v"> </head> <torso> <style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style> <?php for ($i=0; $i<256; $i++) echo ($i%32?':'<p>') . $i . ': ' . chr ($i) . '<br>'; ?> </body> </html>

This will display a table like this:

Cyrillic character set ISO-8859-5 viewed in Firefox
Cyrillic character prepare ISO-8859-5 viewed in Firefox

The PHP function chr does a similar matter to Javascript's Cord.fromCharCode. For example chr(224) embeds the number 224 into the Web page before sending it to the browser. As we've seen above, 224 tin mean many different things. So, the browser needs to know which character set to use to display the 224. That'due south what the first line above is for. Information technology tells the browser to utilise the Cyrillic grapheme set ISO-8858-5:

          <meta charset="ISO-8859-5">

If y'all exclude thecharset line, then it will display using the browser'southward default. In countries with Latin-based alphabets (similar the UK and US), this is probably ISO-8859-1, in which case 224 is an a with grave emphasis: à. Endeavour changing this line to ISO-8859-7 or Windows-1251 and refresh the page. Yous can also override the graphic symbol set in the browser. In Firefox go to View > Character Encoding. Bandy between a few to see what effect information technology has. If you try to brandish more than 256 characters, the sequence will repeat.

Summary Circa 1990

This is the situation in about 1990. Documents can be written, saved and exchanged in many languages, but you need to know which character set they use. There is too no piece of cake manner to use two or more non-English language alphabets in the same certificate, and alphabets with more than 256 characters like Chinese and Japanese have to use entirely different systems.

Finally, the Cyberspace is coming! Internationalization and globalization is about to brand this a much bigger issue. A new standard is required.

Unicode To The Rescue

Starting in the late 1980s, a new standard was proposed – one that would assign a unique number (officially known as a code point) to every letter in every language, one that would have way more than than 256 slots. Information technology was chosen Unicode. Information technology is now in version 6.1 and consists of over 110,000 code points. If you have a few hours to spare you can watch them all whiz by.

The starting time 128 Unicode code points are the same every bit ASCII. The range 128-255 contains currency symbols and other mutual signs and accented characters (aka characters with diacritical marks), and much of information technology is borrowed ISO-8859-1. Afterward 256 there are many more than accented characters. After 880 it gets into Greek letters, and so Cyrillic, Hebrew, Arabic, Indic scripts, and Thai. Chinese, Japanese and Korean start from 11904 with many others in between.

This is not bad – no more ambiguity – each letter of the alphabet is represented past its ain unique number. Cyrillic Я is ever 1071 and Greek α is always 945. 224 is ever à, and H is withal 72. Note that these Unicode lawmaking points are officially written in hexadecimal preceded by U+. And then the Unicode lawmaking point H is unremarkably written as U+0048 rather than 72 (to convert from hexadecimal to decimal: 4*16+eight=72).

The major problem is that in that location are more than 256 of them. The characters volition no longer fit into 8 $.25. All the same Unicode is not a character set or code page. So officially that is not the Unicode Consortium'southward trouble. They just came upwards with the idea and left someone else to sort out the implementation. That will be discussed in the next two sections.

Unicode Within The Browser

Unicode does not fit into 8 bits, not even into xvi. Although only 110,116 code points are in use, it has the capability to define up to 1,114,112 of them, which would require 21 bits.

Even so, computers take advanced since the 1970s. An viii bit microprocessor is a bit out of date. New computers now have 64 bit processors, so why tin can't we motion beyond an eight chip character and into a 32 fleck or 64 bit character?

The outset answer is: we can!

A lot of software is written in C or C++, which supports a "broad character". This is a 32 bit character called wchar_t. It is an extension of C'south 8 bit char blazon. Internally, modernistic Web browsers utilize these wide characters (or something similar) and can theoretically quite happily bargain with over 4 billion distinct characters. This is plenty for Unicode. So — i nternally, modernistic Spider web browers use Unicode.

Trying It Yourself

The Javascript code beneath is similar to the ASCII code above, except it goes up to a much college number. For each number, information technology tells the browser to display the corresponding Unicode lawmaking point:

          <html> <body> <style type="text/css">p {bladder: left; padding: 0 15px; margin: 0; font-size: eighty%;}</style> <script type="text/javascript"> for (var i=0; i<2096; i++)   document.writeln ((i%256?':'<p>') + i + ': ' + Cord.fromCharCode (i) + '<br>'); </script> </body> </html>

It volition output a tabular array like this:

A selection of Unicode code points viewed in Firefox

The screenshot above only shows a subset of the first few chiliad code points output past the Javascript. The option includes some Cyrillic and Arabic characters, displayed right-to-left.

The important point hither is that Javascript runs completely in the Web browser where 32 bit characters are perfectly acceptable. The Javascript function Cord.fromCharCode(1071) outputs the Unicode code point 1071 which is the letter Я.

Similarly if you put the HTML entity Я into an HTML page, a mod Web browser would display Я. Numerical HTML entities likewise refer to Unicode.

On the other paw, the PHP function chr(1071) would output a forward slash / because the chr function only deals with 8 flake numbers up to 256 and repeats itself afterward that, and 1071%256=47 which has been a / since the 1960s.

UTF-eight To The Rescue

So if browsers tin can deal with Unicode in 32 bit characters, where is the problem? The problem is in the sending and receiving, and reading and writing of characters.

The problem remains because:

A lot of existing software and protocols send/receive and read/write 8 bit characters
Using 32 bits to ship/store English text would quadruple the corporeality of bandwidth/space required

Although browsers can bargain with Unicode internally, you withal accept to get the information from the Web server to the Spider web browser and back again, and you need to save it in a file or database somewhere. And then you withal need a mode to make 110,000 Unicode code points fit into simply 8 bits.

There have been several attempts to solve this problem such every bit UCS2 and UTF-16. But the winner in contempo years is UTF-8, which stands for Universal Character Ready Transformation Format 8 scrap.

UTF-viii is a clever. It works a bit similar the Shift key on your keyboard. Normally when you press the H on your keyboard a lower case "h" appears on the screen. Simply if you printing Shift first, a capital letter H will appear.

UTF-viii treats numbers 0-127 as ASCII, 192-247 equally Shift keys, and 128-192 as the key to be shifted. For example, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я. The exact calculation is (208%32)*64 + (175%64) = 1071. Characters 224-239 are like a double shift. 226 followed by 190 and then 128 is grapheme 12160: ⾀. 240 and over is a triple shift.

UTF-viii is therefore a multi-byte variable-width encoding. Multi-byte because a single character like Я takes more than i byte to specify it. Variable-width considering some characters similar H take merely 1 byte and some upwards to 4.

All-time of all it is backward compatible with ASCII. Unlike some of the other proposed solutions, whatsoever certificate written only in ASCII, using only characters 0-127, is perfectly valid UTF-8 too — which saves bandwidth and hassle.

Trying Information technology Yourself

This is a unlike experiment. PHP embeds the 6 numbers mentioned above into an HTML page: 72, 208, 175, 226, 190, 128. The browser interprets those numbers every bit UTF-8, and internally converts them into Unicode code points. Then Javascript outputs the Unicode values. Try irresolute the character set up from UTF-8 to ISO-8859-1 and meet what happens:

          <html> <caput> <meta charset="UTF-8"> </head> <body> <p>Characters embedded in the folio:<br> <span id="chars"><?php repeat chr(72).chr(208).chr(175).chr(226).chr(190).chr(128); ?></span> <p>Character values according to Javascript:<br> <script type="text/javascript"> part ShowCharacters (due south) {var r='; for (var i=0; i<s.length; i++)   r += southward.charCodeAt (i) + ': ' + due south.substr (i, i) + '<br>'; return r;} document.writeln (ShowCharacters (document.getElementById('chars').innerHTML)); </script> </torso> </html>

If you are in a bustle, this is what it will look similar:

A sequence of numbers shown using the UTF-8 character set
The sequence of numbers above shown using the UTF-viii graphic symbol set

Same sequence of numbers shown using the ISO-8859-1 graphic symbol gear up

If y'all display the page using the UTF-8 character set up, you will meet only iii characters: HЯ⾀. If you display it using the grapheme fix ISO-8859-ane, y'all will run across six dissever characters:HÐ¯â¾€ . This is what is happening:

On your Web server, PHP is embedding the numbers 72, 208, 175, 226, 190 and 128 into a Web folio
The Web page whizzes across the Net from the Spider web server to your Web browser
The browser receives those numbers and interprets them according to the character set up
The browser internally represents the characters using their Unicode values
Javascript outputs the corresponding Unicode values

Observe that when viewed equally ISO-8859-one the first 5 numbers are the same (72, 208, 175, 226, 190) as their Unicode code points. This is because Unicode borrowed heavily from ISO-8859-1 in that range. The terminal number nonetheless, the euro symbol €, is different. It is at position 128 in ISO-8859-1 and has the Unicode value 8364.

Summary Circa 2003

UTF-8 is becoming the most pop international character set on the Cyberspace, superseding the older single-byte grapheme sets like ISO-8859-5. When yous view or send a non-English document, yous withal need to know what character set it uses. For widest interoperability, website administrators need to make sure all their web pages use the UTF-8 character sets.

Perhaps the Ð looks familiar — information technology volition sometimes show upward if you try to view Russian UTF-8 documents. The next department describes how character sets get confused and end upward storing things wrongly in a database.

Lots Of Problems

As long every bit everybody is speaking UTF-eight, this should all work swimmingly. If they aren't, then characters can get mangled. To explain way, imagine a typical interaction a website, such as a user making a comment on a weblog postal service:

A Web page displays a comment form
The user types a comment and submits.
The annotate is sent dorsum to the server and saved in a database.
The comment is later retrieved from the database and displayed on a Web folio

This uncomplicated procedure tin can go wrong in lots of ways and produce the following types of problems:

HTML Entities

Pretend for a moment that you lot don't know anything nearly character sets — erase the last 30 minutes from your memory. The course on your blog will probably brandish itself using the grapheme gear up ISO-8859-one. This character ready doesn't know any Russian or Thai or Chinese, and but a lilliputian bit of Greek. If you attempt to copy and paste any into the course and printing Submit, a modern browser will attempt to convert it into HTML numerical entities like Я for Я.

That's what will get saved in your database, and that's what will be output when the comment is displayed — which ways it will display fine on a Web folio, only cause issues when you endeavour to output it to a PDF or email, or run text searches for it in a database.

Dislocated Characters

How almost if you operate a Russian website, and you lot take not specified a character set in your Spider web folio? Imagine a Russian user whose default character set is ISO-8859-five. To say "hi", they might blazon Привет. When the user presses Submit , the characters are encoded co-ordinate to the grapheme gear up of the sending folio. In this case, Привет is encoded equally the numbers 191, 224, 216, 210, 213 and 226. Those numbers volition go sent beyond the Cyberspace to the server, and saved similar that into a database.

If somebody later on views that comment using ISO-8859-v, they will encounter the correct text. Only if they view using a dissimilar Russian character set up similar Windows-1251, they volition see їаШТХв. It'southward still Russian, merely makes no sense.

Accented Characters with Lots of Vowels

If someone views the aforementioned comment using ISO-8859-1, they will see ¿àØÒÕâ instead of Привет. A longer phrase like Я тоже рада Вас видеть ("nice to see y'all" in a formal fashion to a female), submitted equally ISO-8859-5, will bear witness up in ISO-8859-1 equally Ï âÞÖÕ àÐÔÐ. It looks like that because the 128-255 range of ISO-8859-ane contains lots of vowels with accents.

So if you meet this sort of pattern, it'southward probably because text has been entered in a single byte graphic symbol set (one of the ISO-8859s or Windows ones) and is being displayed every bit ISO-8859-i. To fix the text, y'all'll need to effigy out which character fix it was entered as, and resubmit it as UTF-8 instead.

Alternating Accented Characters

What if the user submitted the annotate in UTF-eight? In that instance the Cyrillic characters which brand upward the give-and-take Привет would each become sent equally 2 numbers each: ²⁰⁸⁄₁₅₉, ²⁰⁹⁄₁₂₈, ²⁰⁸⁄₁₈₄, ²⁰⁸⁄₁₇₈, ²⁰⁸⁄₁₈₁ and ²⁰⁹⁄₁₃₀. If you viewed that in ISO-8859-1 it would expect like: ÐŸÑ€Ð¸Ð²ÐµÑ‚.

Notice that every other grapheme is a Ð or Ñ. Those characters are numbers 208 and 209, and they tell UTF-8 to switch to the Cyrillic range. So if you lot see a lot of Ð and Ñ, you can assume that you are looking at Russian text entered in UTF-8, viewed as ISO-8859-i. Similarly, Greek volition have lots of Î and Ï, 206 and 207. And Hebrew has alternating ×, number 215.

Vowels Before a Pound and Copyright Sign

A very common issue in the UK is the currency symbol £ getting converted into Â£. This is exactly the aforementioned consequence as to a higher place with a coincidence thrown in to add defoliation. The £ symbol has the Unicode and ISO-8859-1 value of 163. Recall that in UTF-8 whatsoever character over 127 is represented by a sequence of two or more numbers. In this instance, the UTF-viii sequence is ¹⁹⁴⁄₁₆₃. Mathematically, this is considering (194%32)*64 + (163%64) = 163.

Visually it ways that the if you view the UTF-8 sequence using ISO-8859-1, it appears to gain a Â which is graphic symbol 194 in ISO-8859-1. The aforementioned affair happens for all Unicode code points 161-191, which includes © and ® and ¥.

Black Diamond Question Marks

How most the other way around? If you enter Привет as ISO-8859-5, it volition become saved as the numbers shown above: 191, 224, etc. If you then try to view this as UTF-8, you may well see lots of question marks inside black diamonds: �. The browser displays these when it can't make sense of the numbers it is reading.

UTF-8 is cocky-synchronzising. Dissimilar other multi-byte grapheme encodings, yous always know where you are with UTF-8. If you see a number 192-247, y'all know you are at the first of a multi-byte sequence. If you encounter 128-191 you know you are in the eye of 1. In that location's no danger of missing the first number and garbling the remainder of the text.

This means that in UTF-8, the sequence 191 followed by 224 will never occur naturally, so the browser doesn't know what to do with it and displays �� instead.

This tin likewise crusade £ and © related problems. £l in ISO-8859-1 is the numbers 163, 53 and 48. The 53 and 48 cause no issues, but in UTF-8, 163 can never occur by itself, so this will testify upwardly as �50. Similarly if you encounter �2012, it is probably because ©2012 was input as ISO-8859-1 simply is existence displayed as UTF-viii.

Blanks, Question Marks and Boxes

Even if they are fully up-to-speed with UTF-8 and Unicode, a browser still may non know how to display a graphic symbol. The first few ASCII characters i-31 are mostly control sequences for teleprinters (things like Acknowledge and Stop). If yous try to brandish them, a browser might show a ? or a blank or a box with tiny numbers inside it.

Also, Unicode defines over 110,000 characters. Your browser may not have the right font to display all of them. Some of the more obscure characters may also become shown as ? or blank or a minor box. In older browsers, even fairly common non-English characters may show as boxes.

Older browsers may also behave differently for some of the issues above, showing ? and bare boxes more often.

Databases

The give-and-take above has avoided the middle stride in the process — saving data to a database. Databases like MySQL can too specify a character gear up for a database, table or column. But it is less of import that the Web pages' character set.

When saving and retrieving data, MySQL deals just with numbers. If yous tell information technology to save number 163, information technology will. If you lot requite information technology ²⁰⁸⁄₁₅₉ it will relieve those 2 numbers. And when y'all call up the data, you'll get the same two numbers back.

The character set becomes more important when y'all utilize database functions to compare, convert and measure the data. For case, theLENGTH of a field may depend on its graphic symbol set, as do string comparisons usingLIKE and = . The method used to compare strings is called a collation.

Character sets and collations in MySQL are an in-depth discipline. It's not but a case of changing the character set of a table to UTF-viii. There are further SQL commands to take into business relationship to make sure the data goes in and out in the right format besides.

Trying It Yourself

The following PHP and Javascript code allows you to experiment with all these issues. You can specify which grapheme set is used to input and output text, and you tin come across what the browser thinks about it besides.

          <?php $charset = $_POST['charset']; if (!$charset) $charset = 'ISO-8859-1'; $cord = $_POST['string']; if ($string) {  echo '<p>This is what PHP thinks you entered:<br>';  for ($i=0; $i<strlen($string); $i++) {$c=substr ($string,$i,1); echo ord ($c).': '.$c.' <br/>';} }  ?>  <html> <head> <meta charset="<?=$charset?>"> </caput> <body> <form method="post"> <input name="lastcharset" blazon="hidden" value="<?php echo $charset?>"/> Form was submitted every bit: <?php echo $_POST['lastcharset']?><br/> Text is displayed as: <?php echo $charset?><br/> Text will be submitted as: <?php echo $charset?><br/> Copy and paste or type here: <input proper name="string" type="text" size="20" value="<?php echo $string?>"/><br/> Next page will display as: <select name="charset"><option>ISO-8859-1<choice>ISO-8859-5 <pick>Windows-1251<option>ISO-8859-seven<selection>UTF-8</select><br/> <input type="submit" value="Submit" onclick="ShowCharacters (this.form.string.value); render ane;"/> </course> <script blazon="text/javascript"> part ShowCharacters (southward) {   var r='You entered:';   for (var i=0; i<s.length; i++) r += 'due north' + s.charCodeAt (i) + ': ' + s.substr (i, ane);   alert (r); } </script> </body> </html>

This is an example of the code in action. The numbers at the top are the numerical values of each of the characters and their representation (when viewed individually) in the current character set:

Example of inputting and output in different grapheme sets. This shows a £ sign turning into a � in Google Chrome.

The page above shows the previous, electric current and future character sets. You can use this lawmaking to quickly encounter how text can become actually mangled. For instance, if you lot pressed Submit again to a higher place, the � has Unicode code point 65533 which is 239/191/189 in UTF-8 and will be displayed as ï¿½50 in ISO-8859-1. So if you ever get £ symbols turning into ï¿½, that is probably the route they took.

Note that the select box at the lesser will modify back to ISO-8859-one each fourth dimension.

One Solution

All the encoding problems above are caused by text beingness submitted in one character set up and viewed in another. The solution is to brand certain that every page on your website uses UTF-8. You tin can do this with one of these lines immediately later on the <head> tag:

          <meta charset="UTF-eight"> <meta http-equiv="Content-type" content="text/html; charset=UTF-8">

Information technology has to exist 1 of the first things in your Web page, as information technology will crusade the browser to look again at the folio in a whole new light. For speed and efficiency, it should do this as soon every bit possible.

Yous can also specify UTF-8 in your MySQL tables, though to fully use this feature, you'll need to delve deeper.

Note that users tin nonetheless override the character set up in their browsers. This is rare, just does hateful that this solution is not guaranteed to work. For actress safety, yous could implement a back-cease check to ensure data is arriving in the right format.

Existing Websites

If your website has already been collecting text in a variety of languages, then you will also need to catechumen your existing information into UTF-eight. If there is not much of it, you tin use a PHP folio like the ane above to figure out the original grapheme prepare, and use the browser to catechumen the data into UTF-viii.

If yous have lots of information in various character sets, you lot'll demand to get-go discover the graphic symbol prepare and so catechumen it. In PHP you can use mb_detect_encoding to observe and iconv to convert. Reading the comments for mb_detect_encoding, it looks like quite a fussy function, so exist sure to experiment to make sure you are using it properly and getting the right results.

A potentially misleading office is utf8_decode. It turns UTF-8 into ISO-8859-1. Any characters not available in ISO-8859-1 (like Cyrillic, Greek, Thai, etc) are turned into question marks. It'south misleading because you might take expected more from information technology, but it does the all-time it can.

Summary

This commodity has relied heavily on numbers and has tried to get out no stone unturned. Hopefully it has provided an exhaustive understanding of character sets, Unicode, UTF-viii and the various problems that can arise. The morals of the story are:

You need to know the graphic symbol set up in order to make sense of non-Latin text,
Internally, browsers use Unicode to stand for characters,
Brand certain all your Web pages specify the UTF-8 graphic symbol set.

For a slightly different arroyo to this subject area, this 2003 character set commodity is excellent.

Thank yous for sticking with this epic journeying!

Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again

ASCII

Trying Information technology Yourself

The Eighth Flake

Trying It Yourself

Summary Circa 1990

Unicode To The Rescue

Unicode Within The Browser

Trying It Yourself

UTF-eight To The Rescue

Trying Information technology Yourself

Summary Circa 2003

Lots Of Problems

HTML Entities

Dislocated Characters

Accented Characters with Lots of Vowels

Alternating Accented Characters

Vowels Before a Pound and Copyright Sign

Black Diamond Question Marks

Blanks, Question Marks and Boxes

Databases

Trying It Yourself

One Solution

Existing Websites

Summary

Further Reading on SmashingMag:

Belum ada Komentar untuk "Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again

ASCII

Trying Information technology Yourself

The Eighth Flake

Trying It Yourself

Summary Circa 1990

Unicode To The Rescue

Unicode Within The Browser

Trying It Yourself

UTF-eight To The Rescue

Trying Information technology Yourself

Summary Circa 2003

Lots Of Problems

HTML Entities

Dislocated Characters

Accented Characters with Lots of Vowels

Alternating Accented Characters

Vowels Before a Pound and Copyright Sign

Black Diamond Question Marks

Blanks, Question Marks and Boxes

Databases

Trying It Yourself

One Solution

Existing Websites

Summary

Further Reading on SmashingMag:

Belum ada Komentar untuk "Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel

Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again

Belum ada Komentar untuk "Ffix Ã¢â‚¬å“ lessã¢â‚¬â in the Word Ã¢â‚¬å“fearlessã¢â‚¬â Mean 3 Coins Lost Try Again 3 Coins Lost Try Again"