+ Start a Discussion
MaxPMageeMaxPMagee 

Streaming API outage last night?

We have a cron job that runs every half hour to verify the status of our Streaming API client and to make sure it's connected to SalesForce.  Last night between 2:30 AM and 4:30 AM (central time), I received several notifications that the Streaming Client was not happy.

 

I'm investigating this morning and I see a lot of errors involving empty responses and failed connections to the Streaming API via the Bayeux Client.  Error events occurred during an oauth attempt's Soap call around 2:18 AM.

 

After several errors and restarts, and after the final restart at 4:18 AM, the StreamingAPI, our listener, the Bayeux Client, and every other piece has been happy and 100% functional.  It's been running perfectly well for about 5.5 hours now so there was something transient that took place that I'd just like to understand better.

 

The oauth that started this whole error snowball was a part of a Rest API call, but when that failed, we shut down the entire process.  On restart, the Streaming API connection failed. (Up until that point, there had been no errors and everything had been up and working 100% for around 11 hours according to our logs.)

2012-04-03 02:17:51,743 - ERROR - Stopping because of error- Traceback (most recent call last):
  File "/usr/local/svn/SalesForceLicensing/RestingDaemon/src/resting_daemon.py", line 560, in <module>
    oauth=authenticate(verbose=False)
  File "/usr/local/svn/SalesForceLicensing/RestingDaemon/src/resting_daemon.py", line 50, in authenticate
    data = urllib2.urlopen(token_url, params).read()
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 394, in open
    response = self._open(req, data)
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 412, in _open
    '_open', req)
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 1207, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/opt/python-2.7.2/lib/python2.7/urllib2.py", line 1174, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
2012-04-03 02:17:51,744 - INFO - License Signer terminating subprocesses
2012-04-03 02:17:52,760 - DEBUG - Verifying subprocess pid 6104 is dead
2012-04-03 02:17:52,760 - INFO - Shutdown complete.

 

That killed it good, so it wasn't until 2:30 AM when the next Cron job came along to restart the process that we saw:

2012-04-03 02:30:01,656 - INFO - SalesForce License Signer Started
2012-04-03 02:30:01,657 - INFO - Version Info: [$Id: resting_daemon.py 35 2012-04-02 19:53:19Z mmagee $]
2012-04-03 02:30:01,671 - DEBUG - Starting Streaming API listener
2012-04-03 02:30:56,043 - DEBUG - Received stream data: 2012-04-03 02:30:55.713:WARN::CONNECTION FAILED ContentExchange@6182315=POST//test.salesforce.com:443//services/Soap/u/22.0/#9
2012-04-03 02:30:56,044 - ERROR - Found CONNECTION FAILED message: 2012-04-03 02:30:55.713:WARN::CONNECTION FAILED ContentExchange@6182315=POST//test.salesforce.com:443//services/Soap/u/22.0/#9
To be continued on the next post...

MaxPMageeMaxPMagee

Again, it was failed down until the next cron job at 3:00 AM, which restarted and everything was fine until 3:51 AM:

 

2012-04-03 03:00:08,609 - DEBUG - Received stream data: Streaming API Connected
2012-04-03 03:00:09,115 - DEBUG - Received stream data: Endpoint: https://cs14-api.salesforce.com
2012-04-03 03:00:09,620 - DEBUG - Received stream data: Sessionid=00Dc00000000ST2!ARMAQBR5u_mxDejLUoib9bwy8Xsu1AVlBRFKQptSWIIX8YRalhrvMOHd304xoMOSXDSlh3j4oAtts.6OWlKmibwDIkDH.Jv1
2012-04-03 03:00:10,123 - DEBUG - Received stream data: Waiting for handshake
2012-04-03 03:00:10,629 - DEBUG - Received stream data: [CHANNEL:META_HANDSHAKE]: {"id":"1","minimumVersion":"1.0","supportedConnectionTypes":["long-polling"],"successful":true,"channel":"/meta/handshake","clientId":"c1tzxie9ligay4zzzwg182quic","version":"1.0"}
2012-04-03 03:00:13,678 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"2","successful":true,"advice":{"interval":0,"reconnect":"retry","timeout":110000},"channel":"/meta/connect"}
2012-04-03 03:00:14,182 - DEBUG - Received stream data: Subscribing for channel: /topic/UnsignedXML
2012-04-03 03:00:14,686 - DEBUG - Received stream data: Ready to receive streaming data- waiting...
2012-04-03 03:00:15,193 - DEBUG - Received stream data: [CHANNEL:META_SUBSCRIBE]: {"id":"4","subscription":"/topic/UnsignedXML","successful":true,"channel":"/meta/subscribe"}
2012-04-03 03:00:18,743 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"3","successful":true,"channel":"/meta/connect"}
2012-04-03 03:02:08,905 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"5","successful":true,"channel":"/meta/connect"}
2012-04-03 03:03:59,019 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"6","successful":true,"channel":"/meta/connect"}
2012-04-03 03:05:49,479 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"7","successful":true,"channel":"/meta/connect"}
2012-04-03 03:07:39,960 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"8","successful":true,"channel":"/meta/connect"}
2012-04-03 03:09:29,779 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"9","successful":true,"channel":"/meta/connect"}
2012-04-03 03:11:19,760 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"10","successful":true,"channel":"/meta/connect"}
2012-04-03 03:13:10,130 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"11","successful":true,"channel":"/meta/connect"}
2012-04-03 03:14:59,939 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"12","successful":true,"channel":"/meta/connect"}
2012-04-03 03:16:49,952 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"13","successful":true,"channel":"/meta/connect"}
2012-04-03 03:18:40,764 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"14","successful":true,"channel":"/meta/connect"}
2012-04-03 03:20:30,587 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"15","successful":true,"channel":"/meta/connect"}
2012-04-03 03:22:20,873 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"16","successful":true,"channel":"/meta/connect"}
2012-04-03 03:24:10,709 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"17","successful":true,"channel":"/meta/connect"}
2012-04-03 03:26:00,814 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"18","successful":true,"channel":"/meta/connect"}
2012-04-03 03:27:50,966 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"19","successful":true,"channel":"/meta/connect"}
2012-04-03 03:29:41,040 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"20","successful":true,"channel":"/meta/connect"}
2012-04-03 03:30:31,147 - DEBUG - Performing cleanup query to verify no licenses are waiting/were missed by StreamingAPI
2012-04-03 03:30:31,656 - DEBUG - Check to make sure no notifications are coming in from StreamingAPI at this instant
2012-04-03 03:30:32,937 - DEBUG - No stragglers were found.  Returning to normal wait cycle.
2012-04-03 03:31:31,224 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"21","successful":true,"channel":"/meta/connect"}
2012-04-03 03:33:21,277 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"22","successful":true,"channel":"/meta/connect"}
2012-04-03 03:35:11,390 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"23","successful":true,"channel":"/meta/connect"}
2012-04-03 03:37:00,910 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"24","successful":true,"channel":"/meta/connect"}
2012-04-03 03:38:51,259 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"25","successful":true,"channel":"/meta/connect"}
2012-04-03 03:40:41,102 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"26","successful":true,"channel":"/meta/connect"}
2012-04-03 03:42:31,376 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"27","successful":true,"channel":"/meta/connect"}
2012-04-03 03:44:21,186 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"28","successful":true,"channel":"/meta/connect"}
2012-04-03 03:46:11,572 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"29","successful":true,"channel":"/meta/connect"}
2012-04-03 03:48:01,487 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"30","successful":true,"channel":"/meta/connect"}
2012-04-03 03:49:51,710 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"31","successful":true,"channel":"/meta/connect"}
2012-04-03 03:51:41,529 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"32","successful":true,"channel":"/meta/connect"}
2012-04-03 03:51:42,036 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"33","error":"402::Unknown client","successful":false,"advice":{"interval":500,"reconnect":"handshake"},"channel":"/meta/connect"}
2012-04-03 03:51:42,545 - DEBUG - Received stream data: Error during CONNECT: 402::Unknown client
2012-04-03 03:51:43,053 - DEBUG - Received stream data: Exiting...
2012-04-03 03:51:53,227 - WARNING - Status poll of StreamingAPI indicates status '1'- not alive.

 

This particular error didn't kill our process, it just broke the stream, which automatically restarted then worked until 4:18 AM at which point there was a Bayeux Client error:

2012-04-03 03:51:53,228 - INFO - Restarting StreamingAPI Listener
2012-04-03 03:51:53,230 - DEBUG - Starting Streaming API listener
2012-04-03 03:51:55,277 - DEBUG - Received stream data: Streaming API Connected
2012-04-03 03:51:55,785 - DEBUG - Received stream data: Endpoint: https://cs14-api.salesforce.com
2012-04-03 03:51:56,293 - DEBUG - Received stream data: Sessionid=00Dc00000000ST2!ARMAQBR5u_mxDejLUoib9bwy8Xsu1AVlBRFKQptSWIIX8YRalhrvMOHd304xoMOSXDSlh3j4oAtts.6OWlKmibwDIkDH.Jv1
2012-04-03 03:51:56,799 - DEBUG - Received stream data: Waiting for handshake
2012-04-03 03:51:57,309 - DEBUG - Received stream data: [CHANNEL:META_HANDSHAKE]: {"id":"1","minimumVersion":"1.0","supportedConnectionTypes":["long-polling"],"successful":true,"channel":"/meta/handshake","clientId":"71qubesdz12bvrxbekeditspea","version":"1.0"}
2012-04-03 03:52:00,366 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"2","successful":true,"advice":{"interval":0,"reconnect":"retry","timeout":110000},"channel":"/meta/connect"}
2012-04-03 03:52:00,873 - DEBUG - Received stream data: Subscribing for channel: /topic/UnsignedXML
2012-04-03 03:52:01,380 - DEBUG - Received stream data: Ready to receive streaming data- waiting...
2012-04-03 03:52:01,889 - DEBUG - Received stream data: [CHANNEL:META_SUBSCRIBE]: {"id":"3","subscription":"/topic/UnsignedXML","successful":true,"channel":"/meta/subscribe"}
2012-04-03 03:53:50,692 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"4","successful":true,"channel":"/meta/connect"}
2012-04-03 03:55:40,952 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"5","successful":true,"channel":"/meta/connect"}
2012-04-03 03:57:30,695 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"6","successful":true,"channel":"/meta/connect"}
2012-04-03 03:59:20,742 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"7","successful":true,"channel":"/meta/connect"}
2012-04-03 04:01:01,650 - DEBUG - Performing cleanup query to verify no licenses are waiting/were missed by StreamingAPI
2012-04-03 04:01:02,162 - DEBUG - Check to make sure no notifications are coming in from StreamingAPI at this instant
2012-04-03 04:01:03,471 - DEBUG - No stragglers were found.  Returning to normal wait cycle.
2012-04-03 04:01:12,100 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"8","successful":true,"channel":"/meta/connect"}
2012-04-03 04:03:01,048 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"9","successful":true,"channel":"/meta/connect"}
2012-04-03 04:04:51,090 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"10","successful":true,"channel":"/meta/connect"}
2012-04-03 04:06:41,236 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"11","successful":true,"channel":"/meta/connect"}
2012-04-03 04:08:31,104 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"12","successful":true,"channel":"/meta/connect"}
2012-04-03 04:10:20,945 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"13","successful":true,"channel":"/meta/connect"}
2012-04-03 04:12:11,329 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"14","successful":true,"channel":"/meta/connect"}
2012-04-03 04:14:01,141 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"15","successful":true,"channel":"/meta/connect"}
2012-04-03 04:15:51,434 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"16","successful":true,"channel":"/meta/connect"}
2012-04-03 04:17:41,313 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"17","successful":true,"channel":"/meta/connect"}
2012-04-03 04:18:16,423 - DEBUG - Received stream data: 2012-04-03 04:18:16.022:INFO:org.cometd.client.BayeuxClient@11416420:Messages failed [{"id":"18","connectionType":"long-polling","channel":"/meta/connect","clientId":"71qubesdz12bvrxbekeditspea"}]

The stack trace continued while we shut down the client and attempted to restart:

 

2012-04-03 04:18:16,931 - DEBUG - Received stream data: java.net.ProtocolException: Empty response: TransportExchange@19673895=POST//cs14-api.salesforce.com:443/cometd/23.0/connect#7
2012-04-03 04:18:17,444 - DEBUG - Received stream data: at org.cometd.client.BayeuxClient$PublishTransportListener.onProtocolError(BayeuxClient.java:1106)
2012-04-03 04:18:17,953 - DEBUG - Received stream data: at org.cometd.client.transport.LongPollingTransport$TransportExchange.onResponseComplete(LongPollingTransport.java:300)
2012-04-03 04:18:18,462 - DEBUG - Received stream data: at org.eclipse.jetty.client.HttpExchange$Listener.onResponseComplete(HttpExchange.java:968)
2012-04-03 04:18:18,971 - DEBUG - Received stream data: at org.eclipse.jetty.client.HttpExchange.setStatus(HttpExchange.java:279)
2012-04-03 04:18:19,482 - DEBUG - Received stream data: at org.eclipse.jetty.client.HttpConnection$Handler.messageComplete(HttpConnection.java:619)
2012-04-03 04:18:19,992 - DEBUG - Received stream data: at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:862)
2012-04-03 04:18:19,993 - WARNING - Status poll of StreamingAPI indicates status '1'- not alive.
2012-04-03 04:18:19,995 - INFO - Restarting StreamingAPI Listener
2012-04-03 04:18:19,995 - DEBUG - Starting Streaming API listener
2012-04-03 04:18:20,516 - DEBUG - Received stream data: at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:220)
2012-04-03 04:18:21,020 - DEBUG - Received stream data: at org.eclipse.jetty.client.HttpConnection.handle(HttpConnection.java:275)
2012-04-03 04:18:21,525 - DEBUG - Received stream data: at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535)
2012-04-03 04:18:22,031 - DEBUG - Received stream data: at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
2012-04-03 04:18:22,553 - DEBUG - Received stream data: at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529)
2012-04-03 04:18:23,065 - DEBUG - Received stream data: at java.lang.Thread.run(Thread.java:636)
2012-04-03 04:18:23,575 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"message":[{"id":"18","connectionType":"long-polling","channel":"/meta/connect","clientId":"71qubesdz12bvrxbekeditspea"}],"id":"18","exception":"java.net.ProtocolException: Empty response: TransportExchange@19673895=POST//cs14-api.salesforce.com:443/cometd/23.0/connect#7","successful":false,"channel":"/meta/connect"}
2012-04-03 04:18:24,093 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"19","error":"402::Unknown client","successful":false,"advice":{"interval":500,"reconnect":"handshake"},"channel":"/meta/connect"}
2012-04-03 04:18:24,600 - DEBUG - Received stream data: Error during CONNECT: 402::Unknown client
2012-04-03 04:18:25,108 - DEBUG - Received stream data: Exiting...

After all of that, and after the restart at 4:18 AM, the StreamingAPI, our listener, the Bayeux Client, and every other piece has been happy and 100% functional. (the rest of this is just happy connection messages)

2012-04-03 04:18:25,619 - DEBUG - Received stream data: Streaming API Connected
2012-04-03 04:18:26,131 - DEBUG - Received stream data: Endpoint: https://cs14-api.salesforce.com
2012-04-03 04:18:26,641 - DEBUG - Received stream data: Sessionid=00Dc00000000ST2!ARMAQBR5u_mxDejLUoib9bwy8Xsu1AVlBRFKQptSWIIX8YRalhrvMOHd304xoMOSXDSlh3j4oAtts.6OWlKmibwDIkDH.Jv1
2012-04-03 04:18:27,150 - DEBUG - Received stream data: Waiting for handshake
2012-04-03 04:18:27,659 - DEBUG - Received stream data: [CHANNEL:META_HANDSHAKE]: {"id":"1","minimumVersion":"1.0","supportedConnectionTypes":["long-polling"],"successful":true,"channel":"/meta/handshake","clientId":"1cwzjuxncgueu1nufo6r299xkl","version":"1.0"}
2012-04-03 04:18:28,173 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"2","successful":true,"advice":{"interval":0,"reconnect":"retry","timeout":110000},"channel":"/meta/connect"}
2012-04-03 04:18:28,681 - DEBUG - Received stream data: Subscribing for channel: /topic/UnsignedXML
2012-04-03 04:18:29,191 - DEBUG - Received stream data: Ready to receive streaming data- waiting...
2012-04-03 04:18:29,701 - DEBUG - Received stream data: [CHANNEL:META_SUBSCRIBE]: {"id":"4","subscription":"/topic/UnsignedXML","successful":true,"channel":"/meta/subscribe"}
2012-04-03 04:18:32,751 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"3","successful":true,"channel":"/meta/connect"}
2012-04-03 04:20:22,641 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"5","successful":true,"channel":"/meta/connect"}
2012-04-03 04:21:53,669 - DEBUG - Removing 1333443113.226292 from restart_times (more than a half hour old)
2012-04-03 04:22:12,976 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"6","successful":true,"channel":"/meta/connect"}
2012-04-03 04:24:02,812 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"7","successful":true,"channel":"/meta/connect"}
2012-04-03 04:25:53,124 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"8","successful":true,"channel":"/meta/connect"}
2012-04-03 04:27:43,381 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"9","successful":true,"channel":"/meta/connect"}
2012-04-03 04:29:33,149 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"10","successful":true,"channel":"/meta/connect"}
2012-04-03 04:31:23,126 - DEBUG - Received stream data: [CHANNEL:META_CONNECT]: {"id":"11","successful":true,"channel":"/meta/connect"}

...

MaxPMageeMaxPMagee

New information that may shed some light on the issue-  at some point last night, our DNS server was undergoing maintenance.  This probably explains the initial error, and possibly most of the errors, but what about the last couple of issues?

 

At 3:51 AM, after successfully connecting and listening to the Streaming API channel for almost an hour, (but not 120 minutes), we received a 402 Unknown Client error; followed closely by a Bayeux notification of

 

java.net.ProtocolException: Empty response: TransportExchange@19673895=POST//cs14-api.salesforce.com:443/cometd/23.0/connect#7

 

That still strikes me as strange.

VinodMVinodM

Current implementation of Streaming API depends on stickiness with specific appservers where the cometd session is created (after handshake). BTW, there is massive rearchitecture work in progress to get rid of dependency on stickiness. But that will take a few months before it is available.

 

Anyway, with the current implementation stickiness is important. So if for some reasons the loadbalancer cannot stick to the requested appserver, the clients will receive "402-unknown clientid" error. Ideally this should happen rarely. But we have identified some bugs in the system which were causing the stickiness to fail once in a while. We have identified thos issues in internal system testing and those issues are being worked out and will be deployed in production soon. So you should see better streaming experience.

 

We appreciate the diligence and efforts to monitor health of streaming API.

 

Thanks!

Vinod.