Salesforce Selective Queries

A basic understanding of the selective query concept is fundamental to scalability on the Salesforce platform. Non-selective queries have a significant performance impact on List Views, Reports and SOQL and can often result in Apex Trigger runtime exceptions as below – as the data volume increases over time.

System.QueryException Cause: null Message: Non-selective query against large object type (more than 200000 rows)

SOQL queries executed from Apex Triggers have to be selective where the record count in the queried object is 200,000 or more. The determination of a selective query is a function of volume plus field-level selectivity. It is therefore the case that the selectivity state of a given query is volatile meaning in practical terms that the initiating Apex Trigger may work one day and not the next.

Selective Query Definition

Selectivity is determined by the index state of the query filter conditions and the number of records the filter returns (selectivity) versus the object total records. The thresholds below show the difference between the selectivity calculation for a standard index versus a custom index.

Selectivity Thresholds:
Standard Index – 30% (first 1M) then 15%
Custom Index – 10% (first 1M) then 5%

Unary filter:

e.g. select Name from Account where IndexedField__c=’ABC’

With a custom index on IndexedField__c the filter must return <10% of the total records in the object to be considered selective – up to the first 1 million records from that point the threshold drops to 5%.

Multiple filters AND (exclusively):

e.g. select Name from Account where IndexedField__c=’ABC’ and SecondIndexedField__c=’123′

The Query Optimiser will set the leading operation on the basis of lowest cost. If no filters are selective a table scan is performed.

If all filters have an index then a Composite Index Join optimisation can be applied.

In this case each filter must be less than 2x (two-times) the selectivity threshold.
All filters combined must be less than selectivity threshold.

If all filter fields are standard then use the standard index selectivity threshold – otherwise use custom index selectivity threshold.

Multiple filters OR (at least one):

e.g. select Name from Account where IndexedField__c=’ABC’ or SecondIndexedField__c=’123′

Selective AND filter indexes could be set as the Leading Operation – if none exist, then a table scan occurs unless all filters have an index then a Composite Index Union optimisation becomes possible.

In this case each filter must be less than selectivity threshold.
All filters combined must be less than selectivity threshold.

If all fields are standard then use the standard index selectivity threshold – otherwise use custom index selectivity threshold.

Parent Field Filter:

e.g. select Name from Contact where IndexedField__c=’ABC’ and Account.IndexedField__c=’ABC’

Where parent object fields are referenced in a filter, each filter index is individually and the lowest cost option selected as the leading operation.

Note, the parent field is not indexed on the queried object, so Account.Id can incur a table scan on Opportunity whereas AccountId may allow the standard index to become the leading operation.

The notes above provide a basic outline of the concepts but should be sufficient to convey the key concepts.

 

Implementation Approach

As data volumes grow query behaviour can change dramatically, to mitigate this database queries originating in Apex code, list view and report definitions must consider the future peak data volume and field-level data characteristics (primarily selectivity). This considered approach can help identify an appropriate indexing strategy that maintains query performance by ensuring query selectivity. So, forward planning is absolutely key; queries should be designed to be selective up to the projected peak data volumes. Thinking ahead to this extent is very seldom applied in my experience, particularly where the Salesforce implementation evolves project-by-project and the longer term picture is not a priority.

In order to evaluate the selectivity of a given query – the following 2 approaches can be applied.

REST API Query Resource Feedback Parameter

The Force.com REST API exposes a Query resource that accepts an explain parameter which can set with a SOQL query, List View Id or Report Id. The results show the options considered by the Query Optimiser and the lowest cost option (leading operation) taken. A relative cost value of less than 1 indicates a selective query, anything higher indicates non-selective. The example below shows the output for a report where a report filter hits an indexed field.

selective-queries-api

Developer Console Query Plan

From Summer ’14 on the Developer Console can be enabled [Help>Preferences>Enable Query Plan] to display a Query Plan option on the Query Editor tab. The construct of the output is the same as the API approach. Note, this appproach is limited to SOQL queries.

selective-queries-developer-console

The references section below provides link to the technical detail of the 2 approaches introduced above.

 

References

Query & Search Optimisation Cheat Sheet

Developing Selective Force.com Queries through the Query Resource Feedback Parameter Beta

Developer Console Query Editor

Improve performance with Custom indexes using Selective SOQL Queries

Custom Index Request Checklist

Salesforce Query Optimisation

Understanding how the Salesforce Query Optimiser (SQO) works is fundamental to delivering performance at the data layer. Good practice in this regard, as part of an optimised physical data model implementation, is key to delivering efficient data access at the application layer.

In short, certain fields are indexed by default, custom indexes can be added via External Id and unique field attributes or by Salesforce support. For composite indexes the latter is required. The SQO utilises an index statistics table which records the distribution of data in each index. When processing a query, a pre-query is made to the statistics table to determine whether a candidate index is useful or not, in terms of selectivity thresholds etc.

The Query & Search Optimization Cheat Sheet provides a very useful reference in terms of the logic applied by the SQO, plus the fields indexed by default. Bookmark as a guide next time you’re writing a complex SOQL query or designing a data model. In the latter case performance of data retrieval is an often overlooked design consideration.

Salesforce Data Storage

In reading this interesting title (via MEAP) on the trendy topic of Big Data, I was reminded of the many times I’ve had to explain the general principles of the Salesforce data storage model and specifically how many records can be stored per user.

In short each user license gets an allowance, the edition of Salesforce dictates how big the allowance is. A minimum threshold of 1GB applies regardless of users and edition. For illustration, Enterprise Edition (EE) customer with 100 standard user licenses gets 100*20MB=2GB (20MB per-user for EE, UE is 120MB). If the same customer had only 10 users, they would have 1GB, as 10*20MB=200MB is less than the minimum value.

So – how does this equate to physical record storage? The answer to this is straightforward, in general terms each record held consumes 2KB regardless of the number of custom fields, therefore simply divide the allocation by 2KB as below.

2GB = 1,073,741,824*2 = 2,147,483,648 / 2,048 = 1,048,576 = approximately 1 million records

Note – exceptions to the 2KB per-record rule are Article (4KB) Campaign (8KB), CampaignMember (1KB), PersonAccount (4KB) and EmailMessage (variable). Article Types and Tags all consume 2KB for each record. Certain records such as Products, PriceLists, Chatter Posts and Comments etc. do not count. *Update* – Contact Roles and Partner Roles also do not count toward data storage.

This post is not exhaustive, but the maths and noted exceptions should be sufficient to gain an approximation of required versus actual storage. In a world where cloud storage is perceived to be almost a free commodity, the Salesforce model looks restrictive when considered simply in GB terms, particularly in comparison to other cloud services such as iCloud, where 5GB is free.

It always makes sense to implement a fit-for-purpose archiving strategy (off-platform perhaps), balancing business need for data versus storage cost – a further consideration being the performance implications of larger data sets in areas such as reporting and search.

Salesforce Large Data Volumes

My own simplistic definition of an LDV scenario in the Salesforce context is in excess of 2 million records in a single object. This is a rule-of-thumb rather than prescriptive.

The following Salesforce functional areas are most affected by LDV, in terms of performance, timeouts etc. :

Reports
Search
Listviews
SOQL

Working with LDV :
In LDV cases, first and foremost you should consider options for Data Reduction such as archiving, UI mashups etc.. Where LDV is unavoidable the concept of Selectivity should be applied to the functional areas impacted most.

Data Reduction Considerations:
Archiving – consider off-platform archiving solutions
Data Warehouse – consider a data warehouse for analytics
Mashups – leave data in-situ in external systems and integrate at the UI level

Selectivity Considerations:
Selective – reduce the number of objects and fields used in a query.
Narrow Range Reporting – apply selective report filtering.
Filtering – apply restrictive filtering to Listviews and SOQL where clauses.
Custom Indexes – can be effective where query values in the indexed field represent less than 10% (or <300K) of the overall record count.
Skinny Tables – can be effective as a report accelerator at 5M records plus.
SOQL – avoid NULL values (can’t be indexed)
Horizontal partitioning of data – split objects by geography for example.
Denormalisation – to remove expensive joins use Apex Triggers to resolve FK
Divisions – acts like a DB partition (by geography, ownership etc.)
Avoid over parenting – 10K limit for child records, per parent record. For example avoid one parent account having more than 10,000 contacts etc.
Data Skew – look for even distribution.

Loading Data :
Use the Bulk API to load large data volumes, via the Apex Data Loader perhaps (requires setting value to be explicitly set). The Bulk API, in simple terms, uploads the data into temporary tables then executes processing of the data (actual load into target objects) using parallel asynchronous processes. This offers potential performance improvements over the serial and synchronous combined upload+process model employed by all other loading techniques.

Data Loading Considerations;
Defer complex sharing rules
Disable Triggers and Workflow (post-process via Batch Apex)
Speed of operation; Insert then Update then Upsert (involves implicit query)
Group and sequence data to avoid parent record locking
Remember database statistics calculate overnight, wait to do performance testing
Tune the batch size (HTTP keepalives, GZIP compression)