Home

Research

Servers

News

Information

Personnel

About CGR

intranet

Dotter: A dot-matrix program with interactive greyscale rendering for genomic DNA and Protein sequence analysis


Download Dotter via FTP
There is also an electronic article on Dotter published in Gene-Combis

Introduction

Dotter is a graphical dotplot program for detailed comparison of two sequences. Here, every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match.

To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window which runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of greyscales - the darker grey of a peak, the higher it is.

Dotter provides a tool to explore the visual appearance of this landscape, as well as a tool to examine the sequence alignment it represents. These tools are explained below.


Running Dotter

Command syntax:

dotter [options] query_seq subject_seq [X options]

The sequences may be either protein or DNA but when Dottering DNA vs. protein the query_seq must be DNA. The sequences should be in Fasta format or just raw sequence. Fasta format looks like this:
>Name  annotation of any sort
MYWTTTAFLYFWQKSTGA
LMKQYWNCYLLPSLYTAV

Options:

 -b       Batch mode, write dotplot to 
 -l       Load dotplot from 
 -m      Memory usage limit in Mb (default 0.5)
 -z        Set zoom (compression) factor
 -p        Set pixel factor manually (ratio pixelvalue/score)
 -W        Set sliding window size. (K => Karlin/Altschul estimate)
 -M       Read in score matrix from  (Blast format; Default: Blosum62).
 -f       Read feature segments from 
 -i             Do NOT use installed private colormap, but share with other apps
 -r             Reverse and complement horizontal_sequence (DNA vs Protein)
 -D             Don't display mirror image in self comparisons
 -w             For DNA: horizontal_sequence top strand only (Watson)
 -c             For DNA: horizontal_sequence bottom strand only (Crick)
 -q        Horizontal_sequence offset
 -s        Vertical_sequence offset

The most important X options:
 -acefont < font> Main font.
 -font    < font> Menu font.
 (Any standard X option can also be used, such as -bg green -fg red.)

Dottering large DNA sequences like cosmids vs. cosmids, will take at least 15 minutes even of the fastest workstation. In such cases, use the -b (batch mode) option and run Dotter niced in the background. Once it's finished, you read in the precalculated Dotter file with the -l option. The only drawback of Dottering large sequences is that the width of the sliding window size over which the averaging is done cannot be changed quickly, since no pre-averaged matrix is stored. However, extensive testing has showed that changing the sliding window size from the default of 25 residues has no or very marginal positive effects.

Dotter runs linear in space so has no practical limit for the length of sequences - it will just take n^2 more time. If the matrix becomes too big, Dotter automatically zooms out to fit it inside a 707x707 pixel window. The user can choose to use more memory with the -m option - if it doesn't fit the screen Dotter will provide scrollbars.

Normally, the identical mirror image of self-comparisons is not displayed. Use the -D option to force it on. For DNA, both top strands and the reverse complement of query_seq vs the top strand of subject_seq will be calculated. Use the -w and -c options if you want to only see one of these.


The Greyramp tool

To improve visualization, little peaks (noise) can be nullified by a min cutoff. Similarly, significant peaks above a certain score can be saturated by a max cutoff. Peaks between min and max use the greyscales to show their strength. Since the cutoffs for the min and max scores depend on the nature of the sequences at hand, it is impossible to a priori know what they should be. The main novelty of Dotter is that the user can 'play' with the min and max cutoffs until he/she achieves the optimal separation between noise and signal. This is not cheating, but a necessary visual aid.


The Alignment tool

To see the match that causes a given peak in the dotplot, move the crosshair with the left mouse button to the peak and pop up the alignment tool. Once in the proximity, use the cursor keys to move the crosshair one residue at the time. See HELP for key movements.
Note that dragging the crosshair with the alignment tool active is very slow - it's best to quit it if you want to drag a lot.


Zooming in

Zoom in to a region by dragging with the middle mouse button. Dotter will then start up a new independent Dotter job for that region.


Set width of the sliding window

The default width of 25 residues over which the pairwise scores are averaged has proven very robust. There's normally no need to change this and I don't expect any other windowsize to improve a lot. Remember that the whole matrix has to be recalculated, so if it took a long time to calculate it the first time, stay away from this menu item!


Displaying multiple dotplots simultaneously

When looking for overlaps between many sequences, for instance when assembling contigs, it can be uselful to make a dotplot of all sequences vs. each other. This way any overlap will show up as a diagonal in the corner of a subsequence dotplot. Dotter has a built-in mechanism for this. To run Dotter on many sequences at once, concatenate the sequence files (in fasta format (see above)). Then run dotter on the concatenated sequence file against itself, and green partitioning lines will appear between the sequences. At each partitioning line, the name of the following sequence is printed. These lines can be turned on and off with the button "Draw lines a segment ends" in the "Feature series selection tool", which is launched from the main menu.


Marking sequence feature segments with coloured boxes

"Sequence Feature Segments" (SFS) are user-defined custom segments, marking a particular sequence region that has a particular property. This is for instance used for marking protein domains, protein secondary structure elements, low complexity segments, or exons/introns. Feature segments can either be loaded on the command line with the -f option, or with the menu option "Load features from file". These segments are then displayed on the screen as coloured boxes, or lines.

Format:

Your file with feature segments should contain first a format specifying line with exactly this wording: # SFS type=SEG. After that, any number of lines can follow, with one segment per line, in this format:

score sequence series start end colour annotation...

where the fields are:

score = Int [1..100] - The score of the segment, reflected by its width. Note that segments with score=0 are always shown.

sequence = String [one word] - Name of sequence this feature segment refers to. Instead of the name, it is also valid to use "@1" for the horizontal sequence and "@2" for the vertical.

series = String [one word] - An identifier to separate different series on different lines. This is only used in the Feature Series Selection Tool and is not shown on the screen. Use the "annotation" field for text you want to see on the screen (see below).

start, end = Int - The coordinates of segment. Note: If start equals end, a line will be drawn across the dotplot instead of a box.

colour = String - Colour of box. Valid colours: "WHITE", "BLACK", "LIGHTGRAY", "DARKGRAY", "RED", "GREEN", "BLUE", "YELLOW", "CYAN", "MAGENTA", "LIGHTRED", "LIGHTGREEN", "LIGHTBLUE", "DARKRED", "DARKGREEN", "DARKBLUE", "PALERED", "PALEGREEN", "PALEBLUE", "PALEYELLOW", "PALECYAN", "PALEMAGENTA", "BROWN", "ORANGE", "PALEORANGE", "PURPLE", "VIOLET", "PALEVIOLET", "GRAY", "PALEGRAY", "CERISE", "MIDBLUE"

annotation... = String [any number of words] - This text is shown on the screen below the coloured boxes.

Segment display control

You may not always want to see all series at the same time, but instead focus on the series with relevant information. To this end, Dotter and Blixem include a "Feature segment selection tool" (on the main menu), allows the user to hide or unhide any of the series interactively.

Example

(for a sequence of length 400):

# SFS type=SEG
100 @1 zero   35  75 BLUE series zero, first seg
50  @1 zero  335 375 BLUE
75  @1 one   205 205 magenta Ligne
25  @1 THREE 115 125 green un petit vert

To testrun, put the example above in a file foo, and run on fasta formatted sequence in file "seq" of length 400:
  dotter -f foo seq seq &

Dot matrix file format

Since Dotter allows saving and loading of dot-matrices, it can also be used for displaying dot-matrices generated by other programs. The dot-matrix is simply stored as a stream of bytes, one byte per pixel. All rows of pixels (bytes) are concatenated to each other in a wrap-around manner. To specify the size and other aspects of the dot-matrix, a header precedes the pixel values. There are presently two header formats supported by Dotter: a simple (old) and a more complex, which Dotter saves its own dot-plots in. If you want to use Dotter to display some arbitrary dot-matrix, you may not care about things such as score matrices or window length. In that case you should specify format 1 and omit the format 2 features (everything after vertical_len).

The header consists of the following fields:


VARIABLE                 TYPE (bytes)       RANGE           USED_BY_FORMAT
--------                 ------------       -----           --------------
fileformat               unsigned char (1)  1-2             1, 2
zoomfactor               int (4)                            1, 2
horizontal_len           int (4)                            1, 2
vertical_len             int (4)                            1, 2
pixel_factor             int (4)                            2
window_length            int (4)                            2
score_matrix_name_length int (4)                            2
score_matrix_name        char (score_matrix_name_length)    2
score_matrix[24][24]     int (4)*576                        2
Fileformat is simply a version number for backwards compatibility, and is currently 2. The zoomfactor (compression factor) equals 1, 2, 3... for the number of dots (residue pairs) compressed into one pixel (zoomfactor 2 => 4 dots/pixels).

The most important thing to keep in mind is that for technical reasons, horizontal_len and vertical_len have to be the smallest multiple of 4 greater or equal to the actual sequence length. So if the horizontal sequence is e.g. 197 long, horizontal_len must be set to 200, and the pixel map must contain this number of pixels. So if your matrix was made from two sequences of length 197 and 199, the pixelmap must contain 200x200 pixels.

The pixel_factor is the scaling factor between the real score of a dot and the pixel value, which was used to generate the dot-matrix. The value doesn't affect the display of the dot-matrix, only it's meaning in absolute values.

The window_length is the length of the sliding window used to generate the dot-matrix.

The score_matrix fields define the pairwise residue score matrix that was used to generate the dot-matrix. The order of residues is: ARNDCQEGHILKMFPSTWYVBZX*

Note that all integers are stored with the most significant byte first! This is the default for fwrite on Irix and Sun, but the reverse of DEC Alpha and Linux.


Limitations

There is no Mac version, only Unix and Windows.

Note that the old problems related to colormaps and the inability of Dotter to work on 16 and 24-bit displays have been resolved since version 3.0, thanks to Simon Kelley's introduction of the GTK graphics library. Note however that Dotter runs slower on 16 and 24-bit displays than on an 8-bit display.


Reference

If you use this program for your work, please reference:

"A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis"
Erik L.L. Sonnhammer and Richard Durbin
Gene 167:GC1-10 (1995)



Karolinska
Institutet