Page 1 of 2

Unicode "unknown character" squares randomly appear

Posted: Mon Jun 30, 2008 2:00 pm
by Nicholas the Italian
Hi,
I'm having some troubles with my phpBB3 board.

Sometimes, when posting messages with special characters (like accents), the mysteryous "squares with question-mark for unicode unknown characters" appear.
The funny thing is that there seem to be no common pattern in such events: dates, times, users, tags used, message contents, server load, moon phases, nothing that I could think of.
Sometimes it happens to myself, sometimes to users; it is generally sufficient to edit the message, correct it and resubmit (identical to the original) to fix it; rarely you need to edit it twice.

The only common pattern is the position of the squares, when the problem happens:
- 1st accented letter: no issue;
- 2nd accented letter: square immediately after it;
- 3rd accented letter: square 2 characters after it;
- ...
- Nth accented letter: square N-1 characters after it.
Each square replaces two (sometimes three? not sure) other characters. The special characters themselves show correctly.

I found a couple of probably related topics: http://www.phpbb.com/community/viewtopi ... 6&t=579825 and http://www.phpbb.com/community/viewtopi ... 6&t=559587, but neither has a resolutive answer.

Additional info:
- PHP 5.2.5;
- MySQL 5.0.51a (no MySQLi extension);
- tables are utf8_bin, MyISAM engine;
- posts table is about 25MB (around 10K posts);
- board is a phpBB2 upgraded;
- issue has been present since I use phpBB3 (earlier 3.0.RC1, now 3.0.0), never ceased and never got worse;
- if you need other info, just ask.

Thanks for your attention.

Re: Unicode "unknown character" squares randomly appear

Posted: Sun Jul 06, 2008 10:09 pm
by Dundurs
I also have this boring problem. Sometimes posting from IE7 gives no errors, FF gives this error many times on day - not always. After month things changed IE7 gives errors FF not. The problem is old - from first RC nothing has changed. I heard the same problems for French, German and now Italian. For me it's Latvian. The only solution was to change PHP to version prior 5.2, but I have no hands on it. The host tried it for me for one month, but than changed back to 5.2.5 and my problems came back. There also hasn't been any staff answer for this.

Re: Unicode "unknown character" squares randomly appear

Posted: Mon Jul 07, 2008 12:48 pm
by Nicholas the Italian
Thanks for bringing the issue back up.
Dundurs wrote:The only solution was to change PHP to version prior 5.2, but I have no hands on it. The host tried it for me for one month, but than changed back to 5.2.5 and my problems came back.
Humm... how sure are we that PHP 5.2 is the issue?

Re: Unicode "unknown character" squares randomly appear

Posted: Mon Jul 07, 2008 1:13 pm
by Eelke
We need to clear up where this problem is originating. You state that it only happens when posting with a certain browser and that you can correct the problem by posting again. This would suggest that the problem originates in the way the browser sends the data. How do the problematic posts look from another browser? What is the encoding set to in the browser when this problem occurs (e.g. "View > Character encoding").

Re: Unicode "unknown character" squares randomly appear

Posted: Mon Jul 07, 2008 6:00 pm
by Nicholas the Italian
Eelke wrote:You state that it only happens when posting with a certain browser and that you can correct the problem by posting again. This would suggest that the problem originates in the way the browser sends the data. How do the problematic posts look from another browser? What is the encoding set to in the browser when this problem occurs (e.g. "View > Character encoding").
I didn't state that. I need to ask my users who experienced the problem, but I'm quite sure some of them uses IE (while I don't).
I didn't check the encoding, I should check every time (while the problem happens rarely), but what would trigger a char encoding change?
Honestly it looks more like a server-side issue, I might be wrong of course.
I'm gonna ask for browser and O/S too, stay tuned.

Re: Unicode "unknown character" squares randomly appear

Posted: Wed Jul 09, 2008 10:10 pm
by Nicholas the Italian
Ok, one user for example uses IE7 on Vista. I use Firefox on XP. It definitely looks like a server-side issue, either PHP or MySQL, I don't think phpBB is to blame but the issue should be investigated.
What other info would be useful?

Re: Unicode "unknown character" squares randomly appear

Posted: Thu Jul 10, 2008 12:20 am
by SamG
I apologize if you already covered this and I missed it or I'm just misunderstanding: The post is initially entered and stored this way? And then edited to correct the problem? If so, this begins client side, doesn't it? Within the context of the form?

If so, then the question would seem to be, what can randomly influence the page encoding (which is part of what I think Eelke is driving at)? But to know if the encoding as delivered is being altered, you'd want to see what the browser reports the page encoding to be when the problem occurs, I think.

Re: Unicode "unknown character" squares randomly appear

Posted: Thu Jul 10, 2008 12:37 am
by Nicholas the Italian
SamG wrote:The post is initially entered and stored this way?
It is entered correctly and stored in the db the wrong way.
If so, this begins client side, doesn't it?
Well, not necessarily, I think.
But to know if the encoding as delivered is being altered, you'd want to see what the browser reports the page encoding to be when the problem occurs, I think.
Hu-hum. So, if I post something, and the problem occurs, I should go back and look into Page info > Encoding? I can do that.

Re: Unicode "unknown character" squares randomly appear

Posted: Thu Jul 10, 2008 12:47 am
by SamG
OK, I get it now. Sort of. So it's entered correctly, and then these unknown characters are inserted into the post after submission. In the database the post is wrong, but it's entered correctly. I guess it might still be a page encoding issue...

But supposing it's not, we'd like to know what a server could be doing occassionaly that would alter the post -- basically inserting characters, though. That's the part that's most confusing to me, regardless of whether it's server or client side. I wonder if it's always the same character that gets inserted...

Re: Unicode "unknown character" squares randomly appear

Posted: Thu Jul 10, 2008 12:31 pm
by Nicholas the Italian
I'm confused too. :)

It is not always the same character -- or better, any "special" character (multibyte character in UTF8) can trigger the issue; the point is that it doesn't seem to be strictly content-related, since editing the messed up message back to the one originally entered generally solves the issue... same identical content sent by the browser (a part from possible encodings changes, which themselves should be triggered by something else -- remember, this behavior is browser-independent), different results in the db.
It looks like it's something time-related: some moments it just doesn't work, a little later it does.

I can report a few examples of messages which showed this problem, if this can help, but I can't find any common pattern, except the one I mentioned in post #1.
The fact that special characters trigger the issue but are not affected (i.e. they show correctly) is what suprises me the most. It looks like there's some byte scrambling or mixing happening somewhere.

So, hum, what PHP functions does phpBB use to prepare posts for insertion? Are they multibyte-safe?
(Still, how wouldn't this be not content-dependent?)

I'm looking into PHP bug tracker to see if I can spot anything.

Re: Unicode "unknown character" squares randomly appear

Posted: Thu Jul 10, 2008 1:11 pm
by Nicholas the Italian
Not sure whether any of these may be related:
http://bugs.php.net/bug.php?id=45311
http://bugs.php.net/bug.php?id=44868
http://bugs.php.net/bug.php?id=37661
http://bugs.php.net/bug.php?id=38926 (that's the only one that looks like might be time-dependent)
http://bugs.php.net/bug.php?id=39279
http://bugs.php.net/bug.php?id=43840
http://bugs.php.net/bug.php?id=43841
http://bugs.php.net/bug.php?id=44617

I use PHP 5.2.5.
Forum search engine is fulltext MySQL

From phpinfo():
- Multibyte string engine: libmbfl
- Multibyte regex (oniguruma) version: 4.4.4
- mbstring.detect_order: no value
- mbstring.encoding_translation: Off
- mbstring.func_overload: 0
- mbstring.http_input: pass
- mbstring.http_output: pass
- mbstring.internal_encoding: no value
- mbstring.language: neutral
- mbstring.script_encoding: no value
- mbstring.strict_detection: Off
- mbstring.substitute_character: no value
- among configure commands there are '--enable-mbstring' and '--enable-zend-multibyte'

Re: Unicode "unknown character" squares randomly appear

Posted: Mon Jul 14, 2008 11:25 pm
by Dundurs
I do corrections on most messages through all day. For one if I can understand meaning of remained text it's enough to click "edit" correct text and send it again. For others I have to to do this more than 5 times and finally text is correctly stored. But there I can't do anything meaning is completely lost: http://www.uscars.lv/nt_Diskusijas/view ... =13&t=1836 (Latvian)


On the next message meaning can be read and it says: When this boring shit will come to end?!


I feel so tired... :(

Ask me for any information to help staff solve this as fast as possible...

Re: Unicode "unknown character" squares randomly appear

Posted: Mon Jul 14, 2008 11:32 pm
by Dundurs
Just for test here:

īdzīgi glāžšķūņu rūķīši

and there it works ...

Re: Unicode "unknown character" squares randomly appear

Posted: Tue Jul 15, 2008 6:12 am
by Eelke
Someone should check the contents of their database. Preferably, also the HTTP traffic with an HTTP tracer such as YATT. It needs to be determined where the error is being introduced.

Re: Unicode "unknown character" squares randomly appear

Posted: Tue Jul 15, 2008 11:51 pm
by mkruer
Ever since I have updated the site form 3.0.1 to 3.0.2 I am getting the same exact issue. it was working fine. I even went though the hassle and replaced all the files from the full branch, and still nada luck getting it to work.

“ = “
““ = ““�
““� = ““���
““��� = ““������
etc...

After the second ““ soon as I hit preview, it stars this process. Something is being added. looking at the update, I am wondering if something in the confusable is wrong?

BTW I tried it in FF3 and IE7 same thing