Appendix
========

.. only:: core or extra

  .. include:: appendix_alignment.inc.rst

RTG reference file format
--------------------------

Many RTG commands can make use of additional information about the
structure of a reference genome, such as expected ploidy, sex
chromosomes, location of PAR regions, etc.  When appropriate, this
information may be stored inside a reference genome's SDF directory in a
file called ``reference.txt``.

The ``format`` command will automatically identify several common
reference genomes during formatting and will create a ``reference.txt``
in the resulting SDF.  However, for non-human reference genomes, or less
common human reference genomes, a pre-built reference configuration file
may not be available, and will need to be manually provided in order to
make use of RTG sex-aware pipeline features.

Several example ``reference.txt`` files for different human reference
versions are included as part of the RTG distribution in the ``scripts``
subdirectory, so for common reference versions it will suffice to copy
the appropriate example file into the formatted reference SDF with the
name ``reference.txt``, or use one of these example files as the basis
for your specific reference genome.

To see how a reference text file will be interpreted by the chromosomes
in an SDF for a given sex you can use the ``sdfstats`` command with the
``--sex`` flag. For example:

.. code-block:: text

  $ rtg sdfstats --sex male /data/human/ref/hg19

  Location            : /data/human/ref/hg19
  Parameters          : format -o /data/human/ref/hg19 -I chromosomes.txt
  SDF Version         : 11
  Type                : DNA
  Source              : UNKNOWN
  Paired arm          : UNKNOWN
  SDF-ID              : b6318de1-8107-4b11-bdd9-fb8b6b34c5d0
  Number of sequences : 25
  Maximum length      : 249250621
  Minimum length      : 16571
  Sequence names      : yes
  N                   : 234350281
  A                   : 844868045
  C                   : 585017944
  G                   : 585360436
  T                   : 846097277
  Total residues      : 3095693983
  Residue qualities   : no

  Sequences for sex=MALE:
  chrM POLYPLOID circular 16571
  chr1 DIPLOID linear 249250621
  chr2 DIPLOID linear 243199373
  chr3 DIPLOID linear 198022430
  chr4 DIPLOID linear 191154276
  chr5 DIPLOID linear 180915260
  chr6 DIPLOID linear 171115067
  chr7 DIPLOID linear 159138663
  chr8 DIPLOID linear 146364022
  chr9 DIPLOID linear 141213431
  chr10 DIPLOID linear 135534747
  chr11 DIPLOID linear 135006516
  chr12 DIPLOID linear 133851895
  chr13 DIPLOID linear 115169878
  chr14 DIPLOID linear 107349540
  chr15 DIPLOID linear 102531392
  chr16 DIPLOID linear 90354753
  chr17 DIPLOID linear 81195210
  chr18 DIPLOID linear 78077248
  chr19 DIPLOID linear 59128983
  chr20 DIPLOID linear 63025520
  chr21 DIPLOID linear 48129895
  chr22 DIPLOID linear 51304566
  chrX HAPLOID linear 155270560 ~=chrY
      chrX:60001-2699520 chrY:10001-2649520
      chrX:154931044-155260560 chrY:59034050-59363566
  chrY HAPLOID linear 59373566 ~=chrX
      chrX:60001-2699520 chrY:10001-2649520
      chrX:154931044-155260560 chrY:59034050-59363566

The reference file is primarily intended for XY sex determination but
should be able to handle ZW and X0 sex determination also.

The following describes the reference file text format in more detail.
The file contains lines with TAB separated fields describing the
properties of the chromosomes. Comments within the ``reference.txt`` file
are preceded by the character ``#``. The first line of the file that is
not a comment or blank must be the version line.

.. code-block:: text

  version1

The remaining lines have the following common structure:

.. code-block:: text

  <sex>	<line-type>	<line-setting>...

The sex field is one of ``male``, ``female`` or ``either``. The
line-type field is one of ``def`` for default sequence settings, ``seq``
for specific chromosomal sequence settings and ``dup`` for defining
pseudo-autosomal regions. The *line-setting* fields are a variable
number of fields based on the line type given.

The default sequence settings line can only be specified with ``either``
for the sex field, can only be specified once and must be specified if
there are not individual chromosome settings for all chromosomes and
other contigs. It is specified with the following structure:

.. code-block:: text

  either	def	<ploidy>	<shape>

The *ploidy* field is one of ``diploid``, ``haploid``, ``polyploid`` or
``none``. The *shape* field is one of ``circular`` or ``linear``.

The specific chromosome settings lines are similar to the default
chromosome settings lines. All the sex field options can be used,
however for any one chromosome you can only specify a single line for
``either`` or two lines for ``male`` and ``female``. They are specified
with the following structure:

.. code-block:: text

  <sex>	seq	<chromosome-name>	<ploidy>	<shape>	[allosome]

The *ploidy* and *shape* fields are the same as for the default
chromosome settings line. The *chromosome-name* field is the name of the
chromosome to which the line applies. The *allosome* field is optional
and is used to specify the allosome pair of a haploid chromosome.

The pseudo-autosomal region settings line can be set with any of the
*sex* field options and any number of the lines can be defined as
necessary. It has the following format:

.. code-block:: text

  <sex>	dup	<region>	<region>

The regions must be taken from two haploid chromosomes for a given sex,
have the same length and not go past the end of the chromosome. The
regions are given in the format ``<chromosome-name>:<start>-<end>`` where
start and end are positions counting from one and the end is
non-inclusive.

An example for the HG19 human reference:

.. code-block:: text

  # Reference specification for hg19, see
  # http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=184117983&chromInfoPage=
  version 1
  # Unless otherwise specified, assume diploid linear. Well-formed
  # chromosomes should be explicitly listed separately so this
  # applies primarily to unplaced contigs and decoy sequences
  either	def	diploid	linear
  # List the autosomal chromosomes explicitly. These are used to help
  # determine "normal" coverage levels during mapping and variant calling
  either	seq	chr1	diploid	linear
  either	seq	chr2	diploid	linear
  either	seq	chr3	diploid	linear
  either	seq	chr4	diploid	linear
  either	seq	chr5	diploid	linear
  either	seq	chr6	diploid	linear
  either	seq	chr7	diploid	linear
  either	seq	chr8	diploid	linear
  either	seq	chr9	diploid	linear
  either	seq	chr10	diploid	linear
  either	seq	chr11	diploid	linear
  either	seq	chr12	diploid	linear
  either	seq	chr13	diploid	linear
  either	seq	chr14	diploid	linear
  either	seq	chr15	diploid	linear
  either	seq	chr16	diploid	linear
  either	seq	chr17	diploid	linear
  either	seq	chr18	diploid	linear
  either	seq	chr19	diploid	linear
  either	seq	chr20	diploid	linear
  either	seq	chr21	diploid	linear
  either	seq	chr22	diploid	linear
  # Define how the male and female get the X and Y chromosomes
  male	seq	chrX	haploid	linear	chrY
  male	seq	chrY	haploid	linear	chrX
  female	seq	chrX	diploid	linear
  female	seq	chrY	none	linear
  #PAR1 pseudoautosomal region
  male	dup	chrX:60001-2699520	chrY:10001-2649520
  #PAR2 pseudoautosomal region
  male	dup	chrX:154931044-155260560	chrY:59034050-59363566
  # And the mitochondria
  either	seq	chrM	polyploid	circular

As of the current version of the RTG software the following are the
effects of various settings in the ``reference.txt`` file when processing
a sample with the matching sex.

A ploidy setting of ``none`` will prevent reads from mapping to that
chromosome and any variant calling from being done in that chromosome.

A ploidy setting of ``diploid``, ``haploid`` or ``polyploid`` does not
currently affect the output of mapping.

A ploidy setting of ``diploid`` will treat the chromosome as having two
distinct copies during variant calling, meaning that both homozygous and
heterozygous diploid genotypes may be called for the chromosome.

A ploidy setting of ``haploid`` will treat the chromosome as having one
copy during variant calling, meaning that only haploid genotypes will be
called for the chromosome.

A ploidy setting of ``polyploid`` will treat the chromosome as having one
copy during variant calling, meaning that only haploid genotypes will be
called for the chromosome. For variant calling with a pedigree, maternal
inheritance is assumed for polyploid sequences.

The shape of the chromosome does not currently affect the output of
mapping or variant calling.

The allosome pairs do not currently affect the output of mapping or
variant calling (but are used by simulated data generation commands).

The pseudo-autosomal regions will cause the second half of the region
pair to be skipped during mapping. During variant calling the first half
of the region pair will be called as diploid and the second half will
not have calls made for it. For the example ``reference.txt`` provided
earlier this means that when mapping a male the X chromosome sections of
the pseudo-autosomal regions will be mapped to exclusively and for
variant calling the X chromosome sections will be called as diploid
while the Y chromosome sections will be skipped. There may be some edge
effects up to a read length either side of a pseudo-autosomal region
boundary.

.. only:: core or extra

  .. include:: appendix_taxonomy.inc.rst

Pedigree PED input file format
-------------------------------

The PED file format is a white space (tab or space) delimited ASCII
file. Lines starting with ``#`` are ignored. It has exactly six required
columns in the following order.

+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Column            | Definition                                                                                                                                                        |
+===================+===================================================================================================================================================================+
| *Family ID*       | Alphanumeric ID of a family group. This field is ignored by RTG commands.                                                                                         |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| *Individual ID*   | Alphanumeric ID of an individual. This corresponds to the Sample ID specified in the read group of the individual (``SM`` field).                                 |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| *Paternal ID*     | Alphanumeric ID of the paternal parent for the individual. This corresponds to the Sample ID specified in the read group of the paternal parent (``SM`` field).   |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| *Maternal ID*     | Alphanumeric ID of the maternal parent for the individual. This corresponds to the Sample ID specified in the read group of the maternal parent (``SM`` field).   |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| *Sex*             | The sex of the individual specified as using 1 for male, 2 for female and any other number as unknown.                                                            |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| *Phenotype*       | The phenotype of the individual specified using -9 or 0 for unknown, 1 for unaffected and 2 for affected.                                                         |
+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. note:: The PED format is based on the PED format defined by
    the PLINK project:
    http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

The value '0' can be used as a missing value for Family ID, Paternal ID
and Maternal ID.

The following is an example of what a PED file may look like.

.. code-block:: text

  # PED format pedigree
  # fam-id ind-id pat-id mat-id sex phen
  FAM01 NA19238 0 0 2 0
  FAM01 NA19239 0 0 1 0
  FAM01 NA19240 NA19239 NA19238 2 0
  0 NA12878 0 0 2 0

When specifying a pedigree for the ``lineage`` command, use either the
pat-id or mat-id as appropriate to the gender of the sample cell
lineage. The following is an example of what a cell lineage PED file may
look like.

.. code-block:: text

  # PED format pedigree
  # fam-id ind-id pat-id mat-id sex phen
  LIN BASE 0 0 2 0
  LIN GENA 0 BASE 2 0
  LIN GENB 0 BASE 2 0
  LIN GENA-A 0 GENA 2 0

RTG includes commands such as ``pedfilter`` and ``pedstats`` for simple
viewing, filtering and conversion of pedigree files.

RTG commands using indexed input files
--------------------------------------

Several RTG commands require coordinate indexed input files to operate
and several more require them when the ``--region`` or ``--bed-regions``
parameter is used. The index files used are standard tabix or BAM index
files.

The RTG commands which produce the inputs used by these commands will by
default produce them with appropriate index files. To produce indexes
for files from third party sources or RTG command output where the
``--no-index`` or ``--no-gzip`` parameters were set, use the RTG
``bgzip`` and ``index`` commands.

.. only:: core or extra

  .. include:: appendix_output_formats.inc.rst

RTG JavaScript filtering API
----------------------------

The ``vcffilter`` command permits filtering VCF records via user-supplied
JavaScript expressions or scripts containing JavaScript functions that
operate on VCF records. The JavaScript environment has an API provided
that enables convenient access to components of a VCF record in order to
satisfy common use cases.

VCF record field access
~~~~~~~~~~~~~~~~~~~~~~~

This section describes the supported methods to access components of an
individual VCF record. In the following descriptions, assume the input
VCF contains the following excerpt (the full header has been omitted):

.. code-block:: text

  #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877 NA12878
  1 11259340 . G C,T . PASS DP=795;DPR=0.581;ABC=4.5 GT:DP 1/2:65 1/0:15

CHROM, POS, ID, REF, QUAL
`````````````````````````

Within the context of a ``--keep-expr`` or ``record`` function these
variables will provide access to the String representation of the VCF
column of the same name.

.. code-block:: text

  CHROM; // "1"
  POS; // "11259340"
  REF; // "G"

ALT, FILTER
```````````

Will retrieve an array of the values in the column.

.. code-block:: text

  ALT; // ["C", "T"]
  FILTER; // ["PASS"]

INFO.{INFO\_FIELD}
``````````````````

The values in the ``INFO`` field are accessible through properties on the
``INFO`` object indexed by ``INFO ID``. These properties will be the string
representation of info values with multiple values delimited with "``,``".
Missing fields will be represented by "``.``". Assigning to these
properties will update the VCF record. This will be undefined for fields not
declared in the header.

.. code-block:: text

  INFO.DP; // "795"
  INFO.ABC; // "4,5"

  INFO.DPR = "0.01"; // Will change the value of the DPR info field


{SAMPLE\_NAME}.{FORMAT\_FIELD}
``````````````````````````````

The JavaScript String prototype has been extended to allow access to the
format fields for each sample. The string representation of values in
the sample column are accessible as properties on the string matching
the sample name named after the ``FORMAT`` field ``ID`` These properties can
be assigned in order to make modifications. This will be undefined for fields
not declared in the header.

.. code-block:: text

  'NA12877'.GT; // "1/2"
  'NA12878'.GT; // "1/0"
  'NA12877'.DP = "10"; // Will change the DP field of the NA12877 sample

VCF header modification
~~~~~~~~~~~~~~~~~~~~~~~

Functions are provided that allow the addition of new ``INFO`` or ``FORMAT``
fields to the header and records. It is recommended that the following
functions only be used within the run-once portion of ``--javascript``.
They may be called on every record, but this will be slow.

ensureFormatHeader(FORMAT\_HEADER\_STRING)
``````````````````````````````````````````

Add a new ``FORMAT`` field to the VCF if it is not already present. This
will add a ``FORMAT`` declaration line to the header and define the
corresponding accessor methods for use in record processing.

.. code-block:: text

  ensureFormatHeader('##FORMAT=<ID=GL,Number=G,Type=Float,' +
    'Description="Log_10 scaled genotype likelihoods.">');

ensureInfoHeader(INFO\_HEADER\_STRING)
``````````````````````````````````````

Add a new ``INFO`` field to the VCF if it is not already present. This
will add an ``INFO`` declaration line to the header and define the
corresponding accessor methods for use in record processing.

.. code-block:: text

  ensureInfoHeader('##INFO=<ID=CT,Number=1,Type=Integer,' +
    'Description="Coverage threshold that was applied">');

Additional information and functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SAMPLES
```````

This variable contains an array of the sample names in the VCF header.

.. code-block:: text

  SAMPLES; // ['NA12877', 'NA12878']

print({STRING})
```````````````

Writes the provided string to standard output.

.. code-block:: text

  print('The samples are: ' + SAMPLES);

.. seealso::
  For javascript filtering usage and examples see :ref:`vcffilter`


.. only:: core or extra

  .. include:: appendix_parallel_processing.inc.rst


Distribution Contents
---------------------

The contents of the RTG distribution zip file should include:

-  The RTG executable JAR file.
-  RTG executable wrapper script.
-  Example scripts and files.
-  This operations manual.
-  A release notes file and a readme file.

Some distributions also include an appropriate java runtime environment
(JRE) for your operating system.

README.txt
----------

For reference purposes, a copy of the distribution ``README.txt`` file
follows:

.. literalinclude:: include-README.txt
  :language: text

Notice
------

Real Time Genomics does not assume any liability arising out of the
application or use of any software described herein. Further, Real Time
Genomics does not convey any license under its patent, trademark,
copyright, or common-law rights nor the similar rights of others.

Real Time Genomics reserves the right to make any changes in any
processes, products, or parts thereof, described herein, without
notice. While every effort has been made to make this guide as complete
and accurate as possible as of the publication date, no warranty of
fitness is implied.

© 2017 Real Time Genomics All rights reserved.

Illumina, Solexa, Complete Genomics, Ion Torrent, Roche, ABI, Life
Technologies, and PacBio are registered trademarks and all other brands
referenced in this document are the property of their respective owners.