Identify a character set?

Discussion of non-phpBB related topics with other phpBB.com users.
Forum rules
General Discussion is a bonus forum for discussion of non-phpBB related topics with other phpBB.com users. All site rules apply.
Post Reply
CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Identify a character set?

Post by CarolC1 » Sat Mar 09, 2019 4:56 pm

Some old posts, all belonging to one user, have characters in place of punctuation, in the subject lines only, not the body of the post. Not all of the user's posts, just a few spread out over time.
characters.PNG
characters.PNG (6.08 KiB) Viewed 8006 times
I found the Æ is a single quote/apostrophe on stackoverflow.
The ô ö combo looks like quotation marks because the same words with quotes are in the post.
I am guessing à is an ellipsis [...] from context. The û is a dash in the body of the post.
dash.PNG
dash.PNG (3.07 KiB) Viewed 7993 times
There are only a few of these substitutions and I'd like to fix them, but not at the risk of putting a wrong character in due to guessing. It has to be right or I'll leave them alone.

Any idea where to find a list to decode these?

Thanks!

EDIT TO ADD: I checked an archive.org copy of the forum which shows the posts on the old software before conversion, and there is a diamond. First thought is an ftp issue, but why just one user?
diamond.PNG
diamond.PNG (775 Bytes) Viewed 7995 times

User avatar
EA117
Registered User
Posts: 530
Joined: Wed Aug 15, 2018 3:23 am
Contact:

Re: Identify a character set?

Post by EA117 » Sat Mar 09, 2019 8:08 pm

I'm not sure we're so much trying to "identify a character set", and suspect that "the correct code point for representing the character the author intended" was probably already lost one or two events ago. (Either immediately when the post was originally entered into the old system, and/or when the system was later "converted" to the current board, whatever that conversion was.)

Unless you're continuing to see this issue on new posts being made too, I would suspect that "manually finding and fixing the characters which seem to have been intended" -- exactly as you're doing -- is likely as accurate a fix as you can make.

Given the extent to which phpBB (at least in 3.2 and later) has gone to correctly store even 32-bit Unicode character representations, I for one am not expecting that there would be a continuing issue like this -- unless it's a case of a particular web browser not delivering correctly-encoded data to phpBB in the first place.

One thing that comes to mind when seeing this -- and hearing you say it's "only seen with one specific user" -- is that the user might have been composing in something else like Microsoft Word or something else prone to using "fancy formatting", and then cut-n-pasting that work into their browser to post. When someone's quotes, apostrophes and em-dashes all come through as something other than the plain versions of those characters, I assume they constructed their thoughts in some other program first. (Maybe for the spelling or grammar help, etc.)

But there is nothing inherently problematic about using “ ” – ’ instead of " " - ', so long as the system is designed to store them in a consistent encoding able to represent those characters. That seems as though it might have been what wasn't true, at least back when the original post was made, and at least for the way in which the post subject line was stored as opposed to the actual post body.

And now you're trying to fix up the results of something that was never correctly stored in the first place. To have "a table of what's expected" in such a case would mean that we know exactly what the original problem was for why the character didn't become stored correctly. And I don't think we actually know that.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: Identify a character set?

Post by CarolC1 » Sat Mar 09, 2019 9:18 pm

The Microsoft Word or other editor theory might be a good explanation, I hadn't even thought of that. It would help explain why it was sporadic, if sometimes she composed in Word or another editor and other times she didn't.

At first I was wondering if it had something to do with the keyboard setting on her computer. I vaguely remember that you can change the encoding (whatever you call it) on some keyboards, but I don't know if you could in 2002, and I don't know if it would cause this, and it seems like it would affect all the posts uniformly, so the editor theory is better.

I was thinking back to a really interesting topic here where the problem was ftp, and this is similar, but I don't see how ftp can explain this...but they are talking about using an editor, which sounds like what you're saying.

At this point the only character I'm not sure of is the à. I guess if I have to leave a couple of posts with that character unfixed and just fix the ones I am sure of, it will still be better than it was.

Thanks for the insight, at least it might be a good explanation. Too bad there isn't a one-to-one translation of the characters, where if X problem is occurring with the editor, then this character will display as that character every time. I was hoping there'd be a whole list! :lol:

Thanks!

User avatar
mrgoldy
Jr. Extension Validator
Posts: 997
Joined: Tue Oct 06, 2009 7:34 pm
Location: The Netherlands
Name: Gijs

Re: Identify a character set?

Post by mrgoldy » Sat Mar 09, 2019 9:38 pm

My best guess for the à would be an ellipsis .

Code page 437, the à character is 80.
Looking at U+0080 is an ellipsis.

If you look at the Code page 437, you can see that Æ is 92 (9_ and _2), which in unicode is: ’ (apostrophe)

û = 96 = – Start of guarded area
ô = 93 = “ (open quote)
ö = 94 = ” (close quote)
etc..
All seem to match.

CarolC1
Registered User
Posts: 565
Joined: Sat Dec 02, 2006 4:26 pm

Re: Identify a character set?

Post by CarolC1 » Sat Mar 09, 2019 9:49 pm

Yes!!! Thankyouthankyouthankyou!

Problem solved, and I will look at the rest of that so I'll have a better idea in the future. I thought there might be something like that but didn't know where to find it, and it looks like there is some cross-referencing involved.

I was just experimenting with my text editor trying to change the encoding to see if I could make the characters happen myself.

Thank You So Very Much

Thanks both of you!

EDIT: Sorry, it's 85, right?

User avatar
EA117
Registered User
Posts: 530
Joined: Wed Aug 15, 2018 3:23 am
Contact:

Re: Identify a character set?

Post by EA117 » Sun Mar 10, 2019 3:31 pm

CarolC1 wrote:
Sat Mar 09, 2019 9:49 pm
EDIT: Sorry, it's 85, right?
That is correct.

MrGoldy's great correlations made me scratch my head for a bit though, knowing how unlikely it is for code page 437 to have actually been involved. (In Windows anyway. That's an "OEM" or think "MSDOS-compatible" code page.) But no arguing the apparent correlation.

Just raises the curiosity level for how the encoded character value got there in the first place, not withstanding that the intended characters likely did get cut-n-pasted from elsewhere. But don't think it changes anything about the manual one-time clean-up to eliminate it now. Nor the presumption that it shouldn't be something continuing to happen even from the same user, in current phpBB versions and current browsers.

Post Reply

Return to “General Discussion”