Ticket #199 (assigned defect)

Opened 7 months ago

Last modified 7 months ago

Tablewriter escaping break UTF-8 encoded strings

Reported by: cuboci@… Owned by: jtv
Priority: normal Milestone:
Component: other Version:
Severity: normal Keywords:
Cc:

Description

When using a tablewriter to fill some table I get double-encoded values in that table. Using the same string variables in an INSERT statement, everything works fine. The culprit seems to be (at first glance) the method

string pqxx::internal::Escape(const string &s, const string &null)

which does something like this for every character (thus treating many UTF-8 sequences as separate characters):

else if (unprintable(c))
{
  R += "\\\\";
  unsigned char u=c;
  for (int n=2; n>=0; --n) R += tooctdigit(u,n);
}

Since umlauts in UTF-8 fall in the unprintable() category (< ' ' || > '~') they get escaped. I don't know if PostgreSQL chokes on that, but it seems likely. Usually, a COPY statement copes with unescaped UTF-8 characters just fine.

The client encoding is set to UNICODE (via set_client_encoding()), otherwise it wouldn't work with normal INSERTs.

Perhaps I'm doing something I shouldn't but right now I'm at a loss what that could be.

I've now tested this in psql with a COPY statement. Using UTF-8 characters directly works, encoding them as \303\274 (that's a German ü, for example) doesn't. So this might even be a bug in PostgreSQL itself. I don't know what the standard behaviour is supposed to be, though.

Attachments

Change History

Changed 7 months ago by jtv

  • status changed from new to assigned

This is a specific instance of the problems vaguely anticipated by #196.

It really shouldn't be up to libpqxx to do this escaping itself. We'll need to take a more careful look to see if the usual libpq escaping functions are appropriate here after all.

Changed 7 months ago by cuboci@…

I worked around the problem by converting the data to server_encoding and setting client_encoding equal to that for the duration of the copy operation in my application. Not an elegant solution but it works for now.

Add/Change #199 (Tablewriter escaping break UTF-8 encoded strings)

Author


E-mail address and user name can be saved in the Preferences.


Change Properties
<Author field>
Action
as assigned
as The resolution will be set. Next status will be 'closed'
to The owner will change from jtv. Next status will be 'new'
 
Note: See TracTickets for help on using tickets.