Bookstore My Book Workshops MagiXstick X-ray Tutorial Contact
  Home > Crystallographic Web Applets > CrysPred predictor  
 

Prediction of crystallization probability based on pI

 


The purpose of this small program is to show how prior information can be used to optimize efficiency of initial crystallization screening in High Throughput Protein Crystallography (HTPX).  Effective initial crystallization screening aims to identify with the highest overall efficiency (least material, supplies, and resources and thus cost) the proteins that are most likely to yield successful crystals and structures. The purpose of efficient initial screening is not to find conditions for each and any protein, but to focus resources (upscale, Se-Met, etc) on those proteins which have the highest probability to yield structures with least effort (a.k.a. 'the first cut, cherry picking, etc).  

The isoelectric point (pI) is the pH at which the charges of amino acid residues (C, D, E, H, K, R, Y) and the amino- and carboxy-terminus of the peptide chain compensate to a zero net charge resulting in minimum solubility of the protein in aqueous solution. Although the relevance of decreased solubility for crystallization success is still debated, there exists no simple correlation between pI and pH of crystallization. However, following the proper distributions of crystallization pH or pH-pI for a given pI increases the likelihood of crystallization, and thus pI can be employed as a predictor for crystallization success. The data have been extracted from the 9000+ sequence records of the PDB and the corresponding reported pH of crystallization.

Following caveats apply:

a)
The pI calculation is not exact. Only after the structure is known, local environment determining the actual pKa values of the residues could be accurately calculated.
b)
The distributions are not further discriminated by protein properties, and represent probabilities for the average, 'garden variety' protein reported in the PDB. They may not be valid for special cases, such as membrane proteins, complexes etc. The provide, however, the most efficient overall strategy for initial screening.
c)
It is mandatory to use the actually crystallized construct sequence when calculating the pI. All affinity tags, fusions, linkers, cleavage site residuals, etc must be included.
d)
The delta-distributions used are coarsely binned (9000 data points for a 2-d set of distributions is not much) but show clearly how the shape and mean/mode of the distributions differ. The binning width of 1 pH unit is a realistic estimate of the error in the calculated pIs (and reported pH perhaps).
e)
The distribution of the pH in the PDB is biased by usage (no negatives) and the distributions - judging from random experiments -  are perhaps broader than extracted from the PDB.

Please cite the published reference when you use this program :

The bin data extracted from the latest non-redundant PDB data set can be downloaded from here.

Output:

Above the distribution graphs, you will see a set of tables. The following is returned for pI of 8.0 for example:

Table for cutoff excluding bins with expected success rates below 1.0%
pH-pI bin  : -8.0 -7.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  5.0
Expected % : 0.0 0.0 0.0 0.0 4.8 12.5 21.8 24.4 26.7 7.3 1.9 0.0 0.0 0.0 Population of 288 experiments in 7 bins :
equal pop. : 0 0 0 0 41 41 41 41 41 41 41 0 0 0 287
suggested : 0 0 0 0 13 36 63 70 77 21 5 0 0 0 287 Expected relative hit rates
equal pop. : 0.0 0.0 0.0 0.0 2.0 5.1 9.0 10.0 10.9 3.0 0.8 0.0 0.0 0.0 40.8
suggested : 0.0 0.0 0.0 0.0 0.7 4.5 13.8 17.3 20.6 1.5 0.1 0.0 0.0 0.0 58.5 pH --- --- --- --- 4.0 5.0 6.0 7.0 8.0 9.0 10.0 --- --- --- Experiments: --- --- --- --- 13 36 63 70 77 21 5 --- --- --- Expected efficiency increase compared to pH screening with equally populated bins: 43%

The first set of blue lines indicates:

  • the pH-pI bin of the distribution (shown in the graph) as column headers

  • the prior (expected) distribution of successes based on the analysis from the PDB crystallization data

  • the population of the screen with experiments, first with equal frequency, then with the frequency suggested by evidence

  • the relative expected hit rates (scale is arbitrary) for equal frequency and for suggested frequency.

The final red lines give:

  • the suggested pH range for screening

  • the number of experiments to set up

  • and finally, the estimated increase in efficiency based on the expected hit rates.

The table repeats for for different screen widths (i.e., neglecting bins with populations below a certain cutoff as
listed). Note how this effects the efficiency increase - the gain is largest for wide (improbable) screen ranges.

In the above example, one sees that there is not much point in screening far above the pI, but even up to 3 pH units below pI there is a good statistical chance that the pH is conducive to crystallization. For pI 6.0, this distribution would have similar centroid values, but a different shape.  For more extreme values, both the centroids and the distribution shape change substantially. Using the suggested values and frequencies maximizes the chance for success with a minimal number of experiments. 

In an initial screening, for example, you might consider a more limited range of pHs - at the risk of loosing a few percentage points of chance for success. Comprehensiveness versus material demands need to be balanced for maximum efficiency - a decision you need to make based on your situation.   

NOTE: For consistency with the pI calculation used to derive the statistics, use the calculator provided below. Depending on which pI calculator you use, deviations of +/- 0.5 pH units or more are not uncommon - see
disclaimer a).

Enter either pI, or the sequence of your protein to be crystallized (see above):

pI : (if zero, sequence must be entered, if value is used, leave default sequence, it will be ignored)
Number of experiments :   (optional)

Sequence format  :

Enter sequence below (up to 5000) residues, avoid trailing blanks past last character in each line):

 

 
 

Telephone: 925-209-7429 • The entire site 2005-2013 by Bernhard Rupp. All rights reserved.