Ticket #199 (assigned defect)
Tablewriter escaping break UTF-8 encoded strings
| Reported by: | cuboci@… | Owned by: | jtv |
|---|---|---|---|
| Priority: | normal | Milestone: | |
| Component: | other | Version: | |
| Severity: | normal | Keywords: | |
| Cc: |
Description
When using a tablewriter to fill some table I get double-encoded values in that table. Using the same string variables in an INSERT statement, everything works fine. The culprit seems to be (at first glance) the method
string pqxx::internal::Escape(const string &s, const string &null)
which does something like this for every character (thus treating many UTF-8 sequences as separate characters):
else if (unprintable(c))
{
R += "\\\\";
unsigned char u=c;
for (int n=2; n>=0; --n) R += tooctdigit(u,n);
}
Since umlauts in UTF-8 fall in the unprintable() category (< ' ' || > '~') they get escaped. I don't know if PostgreSQL chokes on that, but it seems likely. Usually, a COPY statement copes with unescaped UTF-8 characters just fine.
The client encoding is set to UNICODE (via set_client_encoding()), otherwise it wouldn't work with normal INSERTs.
Perhaps I'm doing something I shouldn't but right now I'm at a loss what that could be.
I've now tested this in psql with a COPY statement. Using UTF-8 characters directly works, encoding them as \303\274 (that's a German ü, for example) doesn't. So this might even be a bug in PostgreSQL itself. I don't know what the standard behaviour is supposed to be, though.
