function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
spraksprak 

3.0 API release affecting insertion of accented/UTF8 characters?

I have been using the PHP SalesForce SOAP client to insert leads into SalesForce; recently, anytime a lead insert call is made and the data contains an accented character (e.g., ö), the call fails and returns an error message in this format:
java.io.UTFDataFormatException: Invalid byte X of Y-byte UTF-8 sequence.

This problem first began on 15 April which is shortly after the various posts regarding the 3.0 API release appeared on the forum. I checked in SalesForce and have confirmed that numerous leads were generated via the same SOAP API call process prior to 13 April when the announcement post about 3.0 issues/bugs was made.
All other calls made without accented characters process fine.

Is anyone else experiencing this problem?

Gritty details:
PHP 4.2.2
PEAR 1.3
PEAR SOAP 0.8RC3 (beta)
URL for connecting to SalesForce = https://www.salesforce.com/services/Soap/u/2.5

--
Luis A. Cruz
lcruz[at]astaro]dot[com
adamgadamg
Can you post the all the details of the exception/fault you are getting? Sounds like there is a problem, as I'm not sure why sforce is returning a java exception reference.
spraksprak
The only message I receive back from SalesForce is this:
"java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence."

It has been happening consistently since roughly the 15th; if any field (street, name, city, etc.) has an accented character in it, I get the above message returned. The only difference would be the byte portion, e.g., byte 2 of 3, byte 1 of 2, etc.
Rick BanisterRick Banister


sprak wrote:
The only message I receive back from SalesForce is this:
"java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence."

It has been happening consistently since roughly the 15th; if any field (street, name, city, etc.) has an accented character in it, I get the above message returned. The only difference would be the byte portion, e.g., byte 2 of 3, byte 1 of 2, etc.





You need to observe a couple of important rules:
1. If you're reading from or writing to a database, the database must be the same character set.
2. If you're writing to a file system, use OutputStreamWriter instead of FileWriter (and equivalent techniques for reading double-byte files.

To quote from the API Doc:

"The salesforce.com server supports either full Unicode characters or ISO-8859-1 characters, depending on the instance. You can determine the encoding ahead of time using the describeGlobal call. The encoding specified by the describeGlobal call is the character set that is supported by that sforce instance.

"The response from the server will be in UTF-8 or ISO-8859-1 encoding, depending on the character set supported by the instance. This is usually handled for you by the SOAP client. All servers accept either encoding, but the ISO-8859-1 server cannot support characters outside of the ISO-8859-1 range. Data sent to that server outside of the valid ISO-8859-1 range may either be truncated or cause an error."

You cannot convert UTF-8 to ISO-8859-1 or vice versa; the high-order characters are not compatible. You can tell what an Oracle database is by the following query:

select * from NLS_DATABASE_PARAMETERS
where parameter in ('NLS_CHARACTERSET','NLS_LANGUAGE','NLS_TERRITORY');

Also, if you write to the file system, use FileOutputStream instead of FileWriter, which does not support either UTF-8 or ISO-8859-1.

// NOT UTF-8 OR ISO-8859-1 COMPLIANT
FileWriter fw = new FileWriter(updateFilename, false);
fw.write(buf.toString() + "\n");

// UTF-8 COMPLIANT
FileOutputStream os = new FileOutputStream(updateFilename, false);
OutputStreamWriter fw = new OutputStreamWriter(os, "UTF-8");
fw.write(buf.toString() + "\n");

Once again, you CANNOT convert between the two. My current client is going to have to reload their entire reporting database because it was created as ISO-8859-1, and they are on UTF-8 for Salesforce.