+ Start a Discussion

getUpdated and very large numbers of changes

Is there a limit to the number of records that the getUpdated method can return?

If this limit is 2000, say, and a change occurs, such as changing a picklist value of a field, which affects more than this number is a very short time span, how should this be handled?

There doesn't appear to be a similar method to queryMore, so looping to retrieve all changes or setting a smaller batch size is not an option here.

My thanks for any advice.



I'm interested in the answer to this as well.

It's well documented that query will return a maximum of 2000 objects and queryMore is used to get batches of additional objects.

But the documentation is silent on whether getUpdated returns 2000 IDs maximum, or if it is possible to get an array of > 2000 objects (which you would then pass to retrieve in batches of 2000).  Has anyone done an experiment?

If getUpdated is limited to returning 2000 IDs, what's the recommended method for getting all changes?  Brian's picklist-change example is a good one...I might also expect a lot of changes when importing a batch of new leads.

Does getDeleted behave in the same was as getUpdated?


I have received the following response:

The getUpdated call contains certain limitations.  I'm in the middle of running several tests on this but here are my thoughts on the matter so far...

In theory, the getUpdated call will return an infinite amount of update id's for each specific object instance submitted.  In reality, I've encountered that it presents possible problems at approximately 20,000 records.
You could possibly insert or update more records than can be pulled through getUpdated which would create differences between databases.   In this case ( whenever it fails ), you would have to clear your local cache ( database ) and regenerate all your tables via the normal methods of query and queryMore to maintain the same db info on both .  Your other option would be to not use getUpdated and just use query and queryMore with a where clause containing LastModifiedDate less than an hour ago or whatever time interval.    LastModifiedDate is an indexed field and therefore would give you respectable performance.
We are currently looking for ways to improve the getUpdated and getDeleted calls in future versions of our API.

Brian's response is accurate, but I would make one modification to the recommended field to use for the date.  Rather than use lastModifiedDate use the SystemModstamp.  LastModifiedDate is updated when a user makes a change directly to the record.  Records can be changed idirectly by workflow or other process in salesforce.com and the SystemModstamp will reflect both direct and indirect changes.

Cheers and thanks Brian.

One problem I have discovered with using SystemModstamp as suggested is that this field does not exist on all objects. This causes a problem if you have a query like:

SELECT FROM WHERE SystemModstamp > AND SystemModstamp

If you are iterating through a set of sForce objects and running this query on each object, the query will cause an exception for the objects without SystemModstamp. Therefore, you would have to determine which objects don't have the SystemModstamp field and execute a different query for them. It is an unfortunate complexity.

- Eli
which is why getUpdated exists.
But Brian and DevAngel stated (two posts up in the history) that getUpdated has limitations which limit its applicability to large result sets. I guess the question becomes: Is that answer obsolete? Has getUpdated been modified to eliminate any such limitations?
Neither of them say what the problem is when you approach 20,000 records, i own the code for getUpdated and am not aware of any issues with large numbers of changes.
Unfortunately, in the forums here I found these claims that there was a problem but I saw nothing to refute that until your post. Thanks for setting the record straight.

- Eli

Hi SeattleEli,

Scanning back in my memory, I believe the issue referenced as problems in the previous posts really reflect the inability to apply an appropriate model to handling very large change sets.  Certainly a number as large as 20000 changes will take a fair amount of time to retrieve making the actual timestamps used as start and end insignificant when processing the results.  By the time you got to the last one, an hour may have elapsed.  In fact, the change that caused the record to be included in getUpdated occurred between the start and end times, and the record you are processing may actually have been changed more than once by the time you finally have retrieved it making the retrieval time the actual "time of current state" to coin a term.

So, then where do you stand with respect to the next pass for getUpdated?  To try an example suppose we take all the records that changed in the last hour (start - 1 hour).  Suppose this is a large enough set that by the time we are done processing them 30 minutes has elapsed (start + 30 min).  Also assume we are polling once an hour.  This would make our next time span now + 1 hour.  If the last record we processed was changed at start + 20, we will have already handled the oringinal change and the start + 20 min change since we ended up retrieving it 30 minutes after it's initial change was reported.  The goal is to not have to process that record twice (remember the second time it showed up in getUpdated, all changes from start - 1 hour to start + 30 min have already been accounted for).

So, the problems are in how to efficiently manage these large change sets, not that the system has any problems reporting them.


Now I'm better understanding what you meant in the original discussion about getUpdated! So considering the issue you described, if I want to avoid the time issues associated with getUpdated, I can use query and queryMore to do so, correct?

I am *assuming* that the result set created by query() (which I would iterate through with repeated queryMore calls when the result set is large) is a snapshot in time of the records that met the where clause criteria when query() was executed. Is that assumption correct?

To elaborate. I am presuming that the query approach works as follows:
Assume I call query() at time T and specify the where clause to retrieve records that changed between T-1 hour and T. If I call queryMore at T+10 minutes, and a record in that QueryResult changed at both T-30 minutes and T+5 minutes, the record I read from the QueryResult will be the record that changed at T-30 minutes, not the version of that record that changed at T+5 minutes because the QueryResult will draw its contents from the result set selected by the original query() call.

I haven't seen that documented, but I'm guessing that's the way it works. Is that correct?

Thanks for your time.
- Eli
No, that's not how it currently works (and this is subject to change). when you call query, it records all the Ids that meet your criteria, and when you call queryMore, you get the *current* state of the records with that Id. So if the record changes between you calling query and calling queryMore, you'll get the modified version of the record.
I'm glad I asked. As long as I know what's going on, I can design accordingly. You should change the documentation for query: http://www.sforce.com/us/docs/sforce60/wwhelp/wwhimpl/common/html/wwhelp.htm?context=sforceAPI_WWHelp&file=sforce_API_calls_query.html#wp1580700

It states "Upon invocation, the Sforce Web service executes the query against the specified object, caches the results of the query on the Sforce Web service, and returns a query response object to the client application." I read "caches the results" to mean that the entire result set is cached - implying a snapshot in time. Knowing the truth, I can see that it doesn't *exactly* say that, but it sure is easy to read it that way.

Thanks again,