DNA MOTIFS

While one can use established lists of motifs to search one's DNA sequence one can also discover them directly. In order to do this one has to derive a consensus sequence or probability matrix. In the case of bacterial proteins for which the binding sites have been determined good places to start are the E. coli DNA-Binding Site Matrices (A.M. McGuire, Harvard University, U.S.A.), and, DBTBS: a database of transcriptional regulation in Bacillus subtilis (University of Torkyo, Japan). The following sites provides one with a training set which can be used to derive a Gibbs screening matrix.

See additional pages on Promoters, Terminators, and Transcriptional Factors. Recent Review of Different Sequence Motif Finding Algorithms (Reference: Hashim FA et al. Avicenna J Med Biotechnol. 2019; 11(2):130-148).

RSAT (Regulatory Sequence Analysis Tools) - is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. (Reference: Nguyen NTT et al. Nucleic Acids Res. 2018. 46(Web Server issue):W209-W214). This provides links to the following specialized sites:

RSAT Fungi
RSAT Prokaryotes
RSAT Metazoa
RSAT Protists
RSAT Plants

You may also want to consider the MEME Suite

.
Motif Sampler - tries to find over-represented motifs (cis-acting regulatory elements) in the upstream region of a set of co- regulated genes. This motif finding algorithm uses Gibbs sampling to find the position probability matrix that represents the motif. Be sure to "uncheck" the appropriate box if you don't want the complementary strand included in the analysis. (Reference: Thijs G et al. 2002. J. Comput. Biol. 9: 447-464.

P2RP (Predicted Prokaryotic Regulatory Proteins) - including transcription factors (TFs) and two-component systems (TCSs) based upon analysis of DNA or protein sequences. (Reference: Barakat M., 2013. BMC Genomics 14: 269)

DMINDA2 (Regulatory DNA motif identification and analyses) - This server contains: (i) five motif prediction and analyses algorithms, including a phylogenetic footprinting framework; (ii) 2125 species with complete genomes to support the above five functions, covering animals, plants and bacteria and (iii) bacterial regulon prediction and visualization. (Reference: Yang J et al. Bioinformatics. 2017; 33(16):2586-2588.)

kmer analysis - A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence.

KmerFinder 3.2 – predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the number of co-occurring k-mers (substrings of k nucleotides in DNA sequence data, in this case 16-mers) between the genomes of reference bacteria in a database and the genome provided by the user. (Reference: Hasman H et al. 2013. J Clin Microbiol. 52:139-146)

kpLogo - motifs of only 1–4 letters can play important roles when present at key locations within macromolecules. Because existing motif-discovery tools typically miss these position-specific short motifs, we developed kpLogo, a probability-based logo tool for integrated detection and visualization of position-specific ultra-short motifs from a set of aligned sequences. (Reference: X. Wu, & D.P. Bartel (2017) Nucleic Acids Res 45 (Issue W1): W534–W538)