-
Notifications
You must be signed in to change notification settings - Fork 3
/
readme.txt
289 lines (203 loc) · 10.3 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
DAWG VERSION 2-CURRENT
Copyright (c) (2004-2010) Reed A. Cartwright - All rights reserved.
DESCRIPTION
Dawg is an application that will simulate sequence evolution with gaps.
ABSTRACT
DNA Assembly with Gaps (Dawg) is an application designed to simulate the
evolution of recombinant DNA sequences in continuous time based on the robust
general time reversible model with gamma and invariant rate heterogeneity and a
novel length-dependent model of gap formation. The application accepts
phylogenies in Newick format and can return the sequence of any node, allowing
for the exact evolutionary history to be recorded at the discretion of users.
Dawg records the gap history of every lineage to produce the true alignment in
the output. Many options are available to allow users to customize their
simulations and results.
Many tools and procedures exist for reconstructing alignments and phylogenies
and estimating evolutionary parameters from extant data. True phylogenies and
alignments are known in very rare instances. In the absence of known data with
true phylogenies, we are left with using simulations to test the accuracy of
such procedures. Proper simulation of sequence evolution should involve both
nucleotide substitution and indel formation. However, existing tools for
simulating sequence evolution either do not include indels, like Seq-gen or
evolver, or include a rather inexact model of indel formation, like Rose. I
developed Dawg to fill in these gaps.
CONTACT
racartwright@uh.edu or reed@scit.us
REFERENCE
Cartwright, R.A. (2005) DNA Assembly With Gaps (Dawg): Simulating Sequence
Evolution. Bioinformatics 21 (Suppl. 3): iii31-iii38
LICENSE
See copying.txt for license information.
DOWNLOAD
Dawg can be downloaded from <http://scit.us/projects/dawg/>.
PREREQUISITES
If installing from source you, need to ensure that the development libraries of
Boost <http://www.boost.org/> and GSL <http://www.gnu.org/software/gsl/> are
installed and findable on your machine. Dawg can compile with older versions of
Boost, but requires a recent version of Boost's Spirit library. If the version
of Boost on your machine doesn't comes with a recent version of Spirit, just
download a recent version of the boost library and copy the header files for
spirit into [dawg source]/src/include/boost/spirit. That way they will be found
by the compiler before older versions of spirit.
If you have to install Boost locally on a Unix machine, the following works for
me:
cd boost_1_44_0
./bootstrap.sh --prefix=$HOME
./bjam install
cd ../dawg-current
cmake -DBOOST_ROOT=$HOME . make
Change as needed.
INSTALLATION
See Dawg's website for binary packages for Windows, Mac OSX, and other systems.
Alternatively, you can compile Dawg from the source. Dawg requires CMake 2.6
(http://www.cmake.org/) to build it from sources. Many Unix-like operating
systems can install CMake through their package systems. Extract the Dawg
source code and issue the following commands in the extracted directory:
cmake .
make
make install
The '-G' option to cmake is used to specify different build systems, e.g. Unix
Makefiles versus KDevelop3 project. The '-D' option to cmake can be used to set
different cmake variables from the command line:
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr .
make
make install
This will build an optimized version of Dawg and install it to '/usr/bin'. To
specify your own build flags you need to set the environment variables CFLAGS
and LDFLAGS as necessary. Then specify
cmake -DCMAKE_BUILD_TYPE= .
See CMake's manual for additional information.
If you would prefer to run the command line version, then open up a command
console through the Visual Studio tools shortcut (or similar shortcut). This
will add the required compiler programs to your command console environment.
After changing to the source code directory issue the following commands:
cmake -G "NMake Makefiles" .
nmake
If successful, you should find dawg.exe in the "src" directory.
If you are trying to compile Dawg on a UNIX machine that does not have CMake
installed, and you can't install it from a package, then you may need to install
it locally. After downloading and extracting CMake in your home directory,
change to its directory and issue the following commands.
./configure --prefix=$HOME
make
make install
If "make" fails, try using "gmake" instead.
COMMAND LINE USAGE
dawg [options] trick.dawg
Use "dawg --help" for help information. Dawg will read stdin if filename is
"-".
Use "dawg --help-trick" for help regarding input files.
DESCRIPTION OF SIMULATION
Dawg splits the simulation of a single replicate into jobs based on the tree and
sequence section. Different areas of the tree and different areas of the
sequence can have different evolutionary models. By varying the phylogeny in
different parts of the sequence you can produce recombinant sequences.
At the first level, the root sequence is split into a series of "segments",
which evolve independently, except for deletions which may span many segments.
Next the phylogeny of each segment is split into multiple sections allowing
different evolutionary models for different parts of the tree. This all can be
easily controlled from the input file format.
INPUT FILE FORMAT
Dawg is controlled by a series of input files, referred to as tricks. Tricks
contain a series of Sections which define different models for different
sequence or tree regions:
[[SectionA]]
Parameter.A = valueA
Parameter.B = valueB, valueC
[[SectionB]]
Parameter.C = valueZ
Parameter.B = valueD, valueDD
Double square brackets define a new section. At the start of a trick file there
is an implied [[_initial_]] section header, allowing you to skip specifying one
if it is unneeded.
By default sections inherit the values of the section above it. Section
_initial_ inherits from an implied _default_ section. In the example above,
SectionA inherits the values from _initial_, and SectionB from SectionA. This
makes it easy to specify a new section without having to write out everything.
For SectionB, Parameter.A has valueA, just like in SectionA.
You can also change the default behavior of inheriting from the previous
section:
[[SectionC = SectionA]]
Here SectionC will inherit the values from SectionA, not SectionB. If you use a
blank header, [[]], the name of the section will be generated automatically.
Another shortcut is to use parameter headers:
[[SectionA]]
Foo.Bar = A
Foo.Par = B
[[SectionB]]
[Foo]
Bar = A
Par = B
Both sections produce identical results. Parameters in Dawg are named such that
the trick files can be simplified using headers. Some parameter headers are
also special:
[] # clears the current header
[.part] # add a new part to the current header
[..part] # replace last part of the header
[....part] # replace last two parts
If 'part' is blank it simply deletes the last part of the current header.
Each parameter line in a trick file contains a parameter id, equals, and a list
of strings, separated by commas. Ids can contain one or more numbers, letters,
dashes, underscores, and periods: e.g. [A-Za-z0-9._-]+. There are four
different ways to specify strings.
Bare Strings contain a series of non-space characters except, excluding ,#"[]=()
Tree Strings lack spaces and start with '(' and end with ';'.
Quoted Strings are a series of printable characters between double quotation
marks. Newlines are not acceptable
Triple Quoted Strings can contain any character between two sets of three double
quotes.
An example of all four string types:
AList = Bare_String, (Tree:0.1,String:0.1);,
"Quoted String",
"""Double
Quoted
String"""
Comments start with '#' and go to the end of the line.
OUTPUT FILE
Dawg can automatically detect the format of the output file based on its
extension. Supported formats and their extensions are:
Clustal: aln
Fasta: fasta, fas, fsa
Nexus: nexus, nex
Phylip: phylip, phy
Poo: poo
Dawg also supports the filename format of "ext:file" to output to "file" with
the format specified by extension "ext". That way one can use "nex:-" to output
to stdout in Nexus format. Partial matches of extensions are allowed.
If the --split option is on, the each replicate will be saved to its own file,
based on the filename in the output option.
NOTES
The meaning of the "Params" vector is different for each substitution model.
GTR: Substitution rates A-C, A-G, A-T, C-G, C-T, G-T
JC: Ignored
K2P: Transition rate, Transversion rate
K3P: Alpha (Transitions), Beta (A-T & G-C), Gamma (A-C & G-T)
HKY: Transition rate, Transversion rate
F81: Ignored
F84: Kappa
TN: Alpha1 (A-G), Alpha2 (C-T), Beta (Transversions)
Parameter "Freqs" is ignored by the models "JC", "K2P", and "K3P".
If "Lambda" is a single value, then it specifies the rate of indel formation,
e.g. "Lambda = 0.1" is the same as "Lambda = {0.05, 0.05}". The first parameter
is the insertion rate and the second parameter is the deletion rate.
The first parameter of "GapModel" specifies the distribution model of insertion
sizes. The second parameter specifies the distribution model of deletion sizes.
If only one parameter is given it is the model for both insertions and
deletions.
The first parameter of "GapParams" is a vector specifying the parameters for the
gap model of insertions. Likewise the second parameter is a vector specifying
the parameters for the gap model of deletions. If "GapParams" is not a vector
of vectors, then it specifies the vector of parameters for both insertions and
deletions.
The meaning of the GapParams vector is different for each gap model. US: The
distribution of gap sizes. NB: The number of failures (r), the probability of
success (q). PL: The rate parameter (a), the maximum gap size.
To create a recombinant tree, you may need to specifically describe and label
the inner nodes at which the recombination events occur.
Gamma takes precedence over Alpha.
Sequence takes precedence over Length.
If Out.Block.* is the name of a file, the code is read from that file.
The following vector parameters have a size of "Width": "Scale", "Alpha",
"Gamma", and "Iota". If their size is less than width then the first value in
the vector will be used to fill in the rest of the values, e.g. "Scale = 1.0" is
the same as "Scale = {1.0,1.0,1.0}" when "Width = 3".