William R. Pearson


  • BS, University of Illinois, Urbana Champaign
  • PhD, California Institute of Technology
  • Postdoc, Johns Hopkins School of Medicine

Primary Appointment

  • Professor, Biochemistry and Molecular Genetics


Research Interest(s)

Protein Evolution; Computational Biology

Research Description

We have a long-standing interest in exploiting protein sequence information, both for understanding better how new protein sequences arise and for understanding the relationship between protein sequence and protein structure. Since the description of the FASTP program in 1985, our group has been developing more effective methods for identifying distantly related protein sequences. Over the past 10 years, state-of-the-art methods have improved to where proteins that have diverged from a common ancestor in the past billion years are likely to be detected by sequence similarity searching. We hope to push back that threshold to beyond 2 billion years (near the time when prokaryotes and eukaryotes diverged), but already it is possible to identify novel proteins that are likely to have emerged in the last 500 - 800 million years. If we can identify proteins that emerged in the last 100 - 250 million years, it may be possible to identify the mechanisms by which new proteins are formed. We are also exploring alignment-based strategies for integrating variation, domain, and functional annotations into protein and DNA sequence alignments. Traditionally, alignment programs display a protein or DNA sequence. To find out more about the homologous sequence, an investigator must click on links and read web pages to learn about functional information. The latest version of the FASTA program integrates functional and variation information into alignment displays.

Selected Publications

  • Pearson W, Mackey A. Using SQL Databases for Sequence Similarity Searching and Analysis. Current protocols in bioinformatics. 2017;59 9.4.1-9.4.22. PMID: 28902397
  • Finding Protein and Nucleotide Similarities with FASTA. Current protocols in bioinformatics. 2016;53 3.9.1-25. PMID: 27010337 | PMCID: PMC5072362
  • Pearson W, Li W, Lopez R. Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold. Nucleic acids research. 2016;45(7): e46. PMID: 27923999
  • Welch L, Brooksbank C, Schwartz R, Morgan S, Gaeta B, Kilpatrick A, Mietchen D, Moore B, Mulder N, Pauley M, Pearson W, Radivojac P, Rosenberg N, Rosenwald A, Rustici G, Warnow T. Applying, Evaluating and Refining Bioinformatics Core Competencies (An Update from the Curriculum Task Force of ISCB's Education Committee). PLoS computational biology. 2016;12(5): e1004943. PMID: 27175996 | PMCID: PMC4866758
  • Protein Function Prediction: Problems and Pitfalls. Current protocols in bioinformatics. 2015;51 4.12.1-8. PMID: 26334923
  • Triant D, Pearson W. Most partial domains in proteins are alignment and annotation artifacts. Genome biology. 2015;16 99. PMID: 25976240 | PMCID: PMC4443539
  • Selecting the Right Similarity-Scoring Matrix. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]. 2014;43 3.5.1-3.5.9. PMID: 24509512 | PMCID: PMC3848038
  • An introduction to sequence similarity ("homology") searching. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]. 2013; Unit3.1. PMID: 23749753 | PMCID: PMC3820096
  • BLAST and FASTA Similarity Searching for Multiple Sequence Alignment. Methods in molecular biology (Clifton, N.J.). 2013;1079 75-101. PMID: 24170396
  • Furnham N, Holliday G, de Beer T, Jacobsen J, Pearson W, Thornton J. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic acids research. 2013;42 D485-9. PMID: 24319146 | PMCID: PMC3964973
  • Mills L, Pearson W. Adjusting scoring matrices to correct overextended alignments. Bioinformatics (Oxford, England). 2013;29(23): 3007-13. PMID: 23995390 | PMCID: PMC3834790
  • Li W, McWilliam H, Goujon M, Cowley A, Lopez R, Pearson W. PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics (Oxford, England). 2012;28(12): 1650-1. PMID: 22539666 | PMCID: PMC3371869
  • Holliday G, Andreini C, Fischer J, Rahman S, Almonacid D, Williams S, Pearson W. MACiE: exploring the diversity of biochemical reactions. Nucleic acids research. 2011;40 D783-9. PMID: 22058127 | PMCID: PMC3244993
  • Gonzalez M, Pearson W. Homologous over-extension: a challenge for iterative similarity searches. Nucleic acids research. 2010;38(7): 2177-89. PMID: 20064877 | PMCID: PMC2853128
  • Gonzalez M, Pearson W. RefProtDom: a protein database with improved domain boundaries and homology relationships. Bioinformatics (Oxford, England). 2010;26(18): 2361-2. PMID: 20693322 | PMCID: PMC2935417
  • Sierk M, Smoot M, Bass E, Pearson W. Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments. BMC bioinformatics. 2010;11 146. PMID: 20307279 | PMCID: PMC2850363
  • Lavelle D, Pearson W. Globally, unrelated protein sequences appear random. Bioinformatics (Oxford, England). 2009;26(3): 310-8. PMID: 19948773 | PMCID: PMC2852211
  • Pearson W, Sierk M. The limits of protein sequence comparison? Current opinion in structural biology. 2005;15(3): 254-60. PMID: 15919194 | PMCID: PMC2845305