Chinese characters, unicode and other encoding types

This is an archive of the phpBB 2.0.x convertors forum. Support for phpBB2 has now ended.
Forum rules
Following phpBB2's EoL, this forum is now archived for reference purposes only.
Please see the following announcement for more information: viewtopic.php?f=14&t=1385785
Fozza
Registered User
Posts: 25
Joined: Mon Jan 20, 2003 2:14 pm
Location: Plymouth UK
Contact:

Chinese characters, unicode and other encoding types

Post by Fozza »

Hi!

First off, thanks for a great piece of free software - The more I play with the board, the more I see how flexilble and well designed it is... Best board I've played with so far certainly!

But...

No matter how hard I try, I can't get the Chinese language pack to work the way I want it to. I can get my board to use either English UK encoding, or Chinese (simplified) encoding as defined in the user profile.

I'm having a problem because we are not native Chinese speakers, and want to use the English template - but be able to use Chinese characters within the posts.

What's happening, is that I'm posting in Chinese successfully, but users with the English profile see gibberish - even when using Chinese enabled browsers :oops:

From the other posts on this forum, I can see that this is because the HTML standard only allows a page to use one encoding method at a time. Now, what I need to know is:

- If I set the templates to use ONLY the Chinese encoding ... what will happen to non-Chinese enabled browers... will they be able to view the English text as normal?

- If I set the encoding to unicode; are there any unpleasant side effects... again, to non-enabled browsers? ... Why not use unicode from the outset?

Thanks in advance for any help - I'm new to this encoding malarky :D
Image
Fozza
Registered User
Posts: 25
Joined: Mon Jan 20, 2003 2:14 pm
Location: Plymouth UK
Contact:

Post by Fozza »

Sorry, forgot to mention (In case it helps)

Using PHPBB 2.0.2 with no MODs and standard subsilver templates.

Board is at http://www.fozza.com/messages/
Image
Fozza
Registered User
Posts: 25
Joined: Mon Jan 20, 2003 2:14 pm
Location: Plymouth UK
Contact:

Post by Fozza »

Ok, think I cracked it by myself

Setting the encoding to "utf-8" in the meta tags in the templates seems to work, setting the browser to use unicode by default.

Now Chinese characters go in and come out without any setting needing changing on the browser. One of the questions still stands though, are there any downsides to using unicode as opposed to the default? ... Why isn't this the norm for the PHPBB distribution?

IMHO, if there are no side-effects, unicode by default would solve a lot of the language problems posted in this forum... or have I missed something? :)
Image
User avatar
psoTFX
Former Team Member
Posts: 7425
Joined: Tue Jul 03, 2001 8:50 pm

Post by psoTFX »

You've missed something :) Unicode is great and would save me a heck of a lot of time and effort ... however it's not quite "universal" at this time. Not all browsers are Unicode compliant, PHP is still primarily iso-8859-1 based and not all databases are happy to handle multibyte character sets by default. Thus we're still somewhat limited to existing codepages for "basic" implementation.
dirk-san
Registered User
Posts: 18
Joined: Tue Jan 21, 2003 7:55 am
Location: Tokyo, Japan

Post by dirk-san »

I am facing the same problem, but I am creating a English-German-Japanese board. But like the original poster, first of all I would like to say THANKS for the slick piece of software that we have the privilege of using for free even!!

I am stuck with a following issue after changing to charset-UTF-8 in templates/subSilver/overall_header.tpl and simple_header. I understand that from now on all user-contributed content is presented in UTF-8. That works fine, I can mix everything and German clients will see the Japanese text written by other users.

However all of the stuff from the installed language kits is now broken, i.e. German special characters and all Japanese in the rest of the user interface is no longer displayed properly. Do the language kits themselves have to be converted into UTF-8 somehow as well? :roll:

Regards

Dirk
User avatar
psoTFX
Former Team Member
Posts: 7425
Joined: Tue Jul 03, 2001 8:50 pm

Post by psoTFX »

Yes, they must be converted ... as noted within them they all have specific encodings. I have covered this somewhere before, try a search for "unicode recode language" something like that.
davidh44
Registered User
Posts: 386
Joined: Sat Mar 09, 2002 5:56 am

Post by davidh44 »

After updating to 2.0.4, Chinese and Japanese now display correctly in posts. It was gibberish in previous 2.0.x versions, and I made no changes to the charset when I made the switch to 2.0.4

Works here too: すごいです
When I check the encoding type in my IE browser, it's set to "Western European (ISO)", not "Unicode (UTF8)". So I'm not quite sure why it's working now.
dirk-san
Registered User
Posts: 18
Joined: Tue Jan 21, 2003 7:55 am
Location: Tokyo, Japan

Post by dirk-san »

Hello again, and thanks for replying.

I have done a dozen or so searches (even just user psoTFX and unicode), but nothing useful on conversion turned up. Yes, the question about encoding pops up many times and I understand why you get irritated by it.

However, I am coming from a different angle. Most people, including the previous poster, don't seem to understand the difference between the language packs and posts input by the user. My user input is working fine because I switched everything to utf-8. But that means that the language packs themselves are now in the wrong encoding and therefore gibberish, thus my question.

If it was only German I would do a search & replace on 3-4 special characters to &#1234 but the Japanese has to work too...

Thanks for your attention

Dirk
Fozza
Registered User
Posts: 25
Joined: Mon Jan 20, 2003 2:14 pm
Location: Plymouth UK
Contact:

Post by Fozza »

psoTFX wrote: You've missed something :) Unicode is great and would save me a heck of a lot of time and effort ... however it's not quite "universal" at this time. Not all browsers are Unicode compliant, PHP is still primarily iso-8859-1 based and not all databases are happy to handle multibyte character sets by default. Thus we're still somewhat limited to existing codepages for "basic" implementation.


Ah I see! 8)

... Maybe given the level of confusion over this matter; it could be an idea to mention this in one of the FAQ's or even give an option in the admin panel to change the encoding type (Default/Unicode for example), then people can choose which to use - without having to dabble in the templates. I understand you want the board to work on all possible platforms (naturally), but this must be a common stumbling block for multi-national language users.... :?:

Like Dirk-San here, I also spent a long time searching through this forum to check for answers before I posted... all I could find were several questions similar to mine! :)

Thanks for the help :D
Image
User avatar
psoTFX
Former Team Member
Posts: 7425
Joined: Tue Jul 03, 2001 8:50 pm

Post by psoTFX »

dirk-san wrote: However, I am coming from a different angle. Most people, including the previous poster, don't seem to understand the difference between the language packs and posts input by the user. My user input is working fine because I switched everything to utf-8. But that means that the language packs themselves are now in the wrong encoding and therefore gibberish, thus my question.

I understand what you're saying, I have gone over this before ... perhaps on the development forums rather than here I just don't remember. You need to look at the PHP function "recode", http://www.php.net/manual/en/ref.recode.php Note that I've never used this module personally and so cannot comment on it's suitability.
davidh44 wrote: After updating to 2.0.4, Chinese and Japanese now display correctly in posts. It was gibberish in previous 2.0.x versions, and I made no changes to the charset when I made the switch to 2.0.4

Works here too: すごいです
When I check the encoding type in my IE browser, it's set to "Western European (ISO)", not "Unicode (UTF8)". So I'm not quite sure why it's working now.

What you're seeing are Unicode NCR's rather than "native" support for the given character sets. In previous releases we converted all HTML entities (& &#xxxx; etc.) into a "visible form" because that's what we were asked to originally do.

However in 2.0.4 this was relaxed a little so &#xxxx; is ignore and remains "as is". There are problems with this approach though, e.g. fulltext search fails, long sentences will be cut as the database fields are of fixed (obviously!) lengths, etc.

Thus it's still heavily recommended that users utilise the appropriate language pack/encoding rather than rely on this.
dirk-san
Registered User
Posts: 18
Joined: Tue Jan 21, 2003 7:55 am
Location: Tokyo, Japan

Post by dirk-san »

Thanks for the reply.

To be honest I have no clue about PHP, so this module may do the right thing but... I was hoping this coud be sort of a open file, save as... type of job :wink:

Maybe I will contact the person who did the localisation, maybe he can supply those in Unicode to me.

From your other comment the way I understand it is that &#1234 won't work as substitutions, so I could not even do a find/replace on the file...

Regards

Dirk
dirk-san
Registered User
Posts: 18
Joined: Tue Jan 21, 2003 7:55 am
Location: Tokyo, Japan

Post by dirk-san »

OK, I solved the whole thing:

First I have solved the German problem by doing find/replace searches on the characters in my word processor and then pasting in the content into an ssh/vi session into the .php files (line count doubles because of the carriage returns). Opening the files with vi shows the characters as gibberish - but who cares when it works, I will never touch them again. :)

It also seems possible to edit the files directly, although I had some funny behaviour in vi and a core dump. :?

Then I turned to Japanese. I opened the .php file in Netscape and set encoding to the right Japanese set so that the text would display properly. Then select all and copy to clipboard, then paste the whole thing in ssh/vi session into the corresponding .php file on the server. Finished! All of this is on Mac OS X by the way, using Netscape 7, TextEdit and the Terminal.app.

Now my board runs 100% Unicode!

Thanks for all the input guys.

Dirk
Fozza
Registered User
Posts: 25
Joined: Mon Jan 20, 2003 2:14 pm
Location: Plymouth UK
Contact:

Post by Fozza »

Nice one Dirk-san :D

I too have everything working nicely with Unicode (But I'm just using the English templates for now - until I can read Chinese properly ;))

... but I've now discovered the search tool can't cope with non-latin words due to PHP being a little out-of-date with multi-byte character sets :cry:

Oh well, 9 out of 10 features on my internationalisation wish list anyway 8)
Image
dirk-san
Registered User
Posts: 18
Joined: Tue Jan 21, 2003 7:55 am
Location: Tokyo, Japan

Post by dirk-san »

Ah, I see... yes, that's one for the to-do list. But I can live without it, I am happy to get the basics done, like the user interface itself. If I ran an all Japanese board that would be different, as users would expect searches to work.

With all due respect, but I think my users won't discover the search function until 6 months later - if at all ... :wink:

Regards

Dirk
Puddin
Registered User
Posts: 4
Joined: Mon Jan 27, 2003 4:50 pm

Post by Puddin »

dirk-san wrote: Then I turned to Japanese. I opened the .php file in Netscape and set encoding to the right Japanese set so that the text would display properly. Then select all and copy to clipboard, then paste the whole thing in ssh/vi session into the corresponding .php file on the server. Finished! All of this is on Mac OS X by the way, using Netscape 7, TextEdit and the Terminal.app.

Now my board runs 100% Unicode!

Thanks for all the input guys.

Dirk

hi~I have a same problem with you, i am using 2.0.2 ver but cannot diplay both chinese and japanese as well, could you mind tell me more detail about how to encoding it? i am using mac with osx 10.2.3, thx a lot.
Locked

Return to “[2.0.x] Convertors”