Appendix ======== .. only:: core or extra .. include:: appendix_alignment.inc.rst RTG reference file format -------------------------- Many RTG commands can make use of additional information about the structure of a reference genome, such as expected ploidy, sex chromosomes, location of PAR regions, etc. When appropriate, this information may be stored inside a reference genome's SDF directory in a file called ``reference.txt``. The ``format`` command will automatically identify several common reference genomes during formatting and will create a ``reference.txt`` in the resulting SDF. However, for non-human reference genomes, or less common human reference genomes, a pre-built reference configuration file may not be available, and will need to be manually provided in order to make use of RTG sex-aware pipeline features. Several example ``reference.txt`` files for different human reference versions are included as part of the RTG distribution in the ``scripts`` subdirectory, so for common reference versions it will suffice to copy the appropriate example file into the formatted reference SDF with the name ``reference.txt``, or use one of these example files as the basis for your specific reference genome. To see how a reference text file will be interpreted by the chromosomes in an SDF for a given sex you can use the ``sdfstats`` command with the ``--sex`` flag. For example: .. code-block:: text $ rtg sdfstats --sex male /data/human/ref/hg19 Location : /data/human/ref/hg19 Parameters : format -o /data/human/ref/hg19 -I chromosomes.txt SDF Version : 11 Type : DNA Source : UNKNOWN Paired arm : UNKNOWN SDF-ID : b6318de1-8107-4b11-bdd9-fb8b6b34c5d0 Number of sequences : 25 Maximum length : 249250621 Minimum length : 16571 Sequence names : yes N : 234350281 A : 844868045 C : 585017944 G : 585360436 T : 846097277 Total residues : 3095693983 Residue qualities : no Sequences for sex=MALE: chrM POLYPLOID circular 16571 chr1 DIPLOID linear 249250621 chr2 DIPLOID linear 243199373 chr3 DIPLOID linear 198022430 chr4 DIPLOID linear 191154276 chr5 DIPLOID linear 180915260 chr6 DIPLOID linear 171115067 chr7 DIPLOID linear 159138663 chr8 DIPLOID linear 146364022 chr9 DIPLOID linear 141213431 chr10 DIPLOID linear 135534747 chr11 DIPLOID linear 135006516 chr12 DIPLOID linear 133851895 chr13 DIPLOID linear 115169878 chr14 DIPLOID linear 107349540 chr15 DIPLOID linear 102531392 chr16 DIPLOID linear 90354753 chr17 DIPLOID linear 81195210 chr18 DIPLOID linear 78077248 chr19 DIPLOID linear 59128983 chr20 DIPLOID linear 63025520 chr21 DIPLOID linear 48129895 chr22 DIPLOID linear 51304566 chrX HAPLOID linear 155270560 ~=chrY chrX:60001-2699520 chrY:10001-2649520 chrX:154931044-155260560 chrY:59034050-59363566 chrY HAPLOID linear 59373566 ~=chrX chrX:60001-2699520 chrY:10001-2649520 chrX:154931044-155260560 chrY:59034050-59363566 The reference file is primarily intended for XY sex determination but should be able to handle ZW and X0 sex determination also. The following describes the reference file text format in more detail. The file contains lines with TAB separated fields describing the properties of the chromosomes. Comments within the ``reference.txt`` file are preceded by the character ``#``. The first line of the file that is not a comment or blank must be the version line. .. code-block:: text version1 The remaining lines have the following common structure: .. code-block:: text ... The sex field is one of ``male``, ``female`` or ``either``. The line-type field is one of ``def`` for default sequence settings, ``seq`` for specific chromosomal sequence settings and ``dup`` for defining pseudo-autosomal regions. The *line-setting* fields are a variable number of fields based on the line type given. The default sequence settings line can only be specified with ``either`` for the sex field, can only be specified once and must be specified if there are not individual chromosome settings for all chromosomes and other contigs. It is specified with the following structure: .. code-block:: text either def The *ploidy* field is one of ``diploid``, ``haploid``, ``polyploid`` or ``none``. The *shape* field is one of ``circular`` or ``linear``. The specific chromosome settings lines are similar to the default chromosome settings lines. All the sex field options can be used, however for any one chromosome you can only specify a single line for ``either`` or two lines for ``male`` and ``female``. They are specified with the following structure: .. code-block:: text seq [allosome] The *ploidy* and *shape* fields are the same as for the default chromosome settings line. The *chromosome-name* field is the name of the chromosome to which the line applies. The *allosome* field is optional and is used to specify the allosome pair of a haploid chromosome. The pseudo-autosomal region settings line can be set with any of the *sex* field options and any number of the lines can be defined as necessary. It has the following format: .. code-block:: text dup The regions must be taken from two haploid chromosomes for a given sex, have the same length and not go past the end of the chromosome. The regions are given in the format ``:-`` where start and end are positions counting from one and the end is non-inclusive. An example for the HG19 human reference: .. code-block:: text # Reference specification for hg19, see # http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=184117983&chromInfoPage= version 1 # Unless otherwise specified, assume diploid linear. Well-formed # chromosomes should be explicitly listed separately so this # applies primarily to unplaced contigs and decoy sequences either def diploid linear # List the autosomal chromosomes explicitly. These are used to help # determine "normal" coverage levels during mapping and variant calling either seq chr1 diploid linear either seq chr2 diploid linear either seq chr3 diploid linear either seq chr4 diploid linear either seq chr5 diploid linear either seq chr6 diploid linear either seq chr7 diploid linear either seq chr8 diploid linear either seq chr9 diploid linear either seq chr10 diploid linear either seq chr11 diploid linear either seq chr12 diploid linear either seq chr13 diploid linear either seq chr14 diploid linear either seq chr15 diploid linear either seq chr16 diploid linear either seq chr17 diploid linear either seq chr18 diploid linear either seq chr19 diploid linear either seq chr20 diploid linear either seq chr21 diploid linear either seq chr22 diploid linear # Define how the male and female get the X and Y chromosomes male seq chrX haploid linear chrY male seq chrY haploid linear chrX female seq chrX diploid linear female seq chrY none linear #PAR1 pseudoautosomal region male dup chrX:60001-2699520 chrY:10001-2649520 #PAR2 pseudoautosomal region male dup chrX:154931044-155260560 chrY:59034050-59363566 # And the mitochondria either seq chrM polyploid circular As of the current version of the RTG software the following are the effects of various settings in the ``reference.txt`` file when processing a sample with the matching sex. A ploidy setting of ``none`` will prevent reads from mapping to that chromosome and any variant calling from being done in that chromosome. A ploidy setting of ``diploid``, ``haploid`` or ``polyploid`` does not currently affect the output of mapping. A ploidy setting of ``diploid`` will treat the chromosome as having two distinct copies during variant calling, meaning that both homozygous and heterozygous diploid genotypes may be called for the chromosome. A ploidy setting of ``haploid`` will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome. A ploidy setting of ``polyploid`` will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome. For variant calling with a pedigree, maternal inheritance is assumed for polyploid sequences. The shape of the chromosome does not currently affect the output of mapping or variant calling. The allosome pairs do not currently affect the output of mapping or variant calling (but are used by simulated data generation commands). The pseudo-autosomal regions will cause the second half of the region pair to be skipped during mapping. During variant calling the first half of the region pair will be called as diploid and the second half will not have calls made for it. For the example ``reference.txt`` provided earlier this means that when mapping a male the X chromosome sections of the pseudo-autosomal regions will be mapped to exclusively and for variant calling the X chromosome sections will be called as diploid while the Y chromosome sections will be skipped. There may be some edge effects up to a read length either side of a pseudo-autosomal region boundary. .. only:: core or extra .. include:: appendix_taxonomy.inc.rst Pedigree PED input file format ------------------------------- The PED file format is a white space (tab or space) delimited ASCII file. Lines starting with ``#`` are ignored. It has exactly six required columns in the following order. +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Column | Definition | +===================+===================================================================================================================================================================+ | *Family ID* | Alphanumeric ID of a family group. This field is ignored by RTG commands. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Individual ID* | Alphanumeric ID of an individual. This corresponds to the Sample ID specified in the read group of the individual (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Paternal ID* | Alphanumeric ID of the paternal parent for the individual. This corresponds to the Sample ID specified in the read group of the paternal parent (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Maternal ID* | Alphanumeric ID of the maternal parent for the individual. This corresponds to the Sample ID specified in the read group of the maternal parent (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Sex* | The sex of the individual specified as using 1 for male, 2 for female and any other number as unknown. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Phenotype* | The phenotype of the individual specified using -9 or 0 for unknown, 1 for unaffected and 2 for affected. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. note:: The PED format is based on the PED format defined by the PLINK project: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped The value '0' can be used as a missing value for Family ID, Paternal ID and Maternal ID. The following is an example of what a PED file may look like. .. code-block:: text # PED format pedigree # fam-id ind-id pat-id mat-id sex phen FAM01 NA19238 0 0 2 0 FAM01 NA19239 0 0 1 0 FAM01 NA19240 NA19239 NA19238 2 0 0 NA12878 0 0 2 0 When specifying a pedigree for the ``lineage`` command, use either the pat-id or mat-id as appropriate to the gender of the sample cell lineage. The following is an example of what a cell lineage PED file may look like. .. code-block:: text # PED format pedigree # fam-id ind-id pat-id mat-id sex phen LIN BASE 0 0 2 0 LIN GENA 0 BASE 2 0 LIN GENB 0 BASE 2 0 LIN GENA-A 0 GENA 2 0 RTG includes commands such as ``pedfilter`` and ``pedstats`` for simple viewing, filtering and conversion of pedigree files. RTG commands using indexed input files -------------------------------------- Several RTG commands require coordinate indexed input files to operate and several more require them when the ``--region`` or ``--bed-regions`` parameter is used. The index files used are standard tabix or BAM index files. The RTG commands which produce the inputs used by these commands will by default produce them with appropriate index files. To produce indexes for files from third party sources or RTG command output where the ``--no-index`` or ``--no-gzip`` parameters were set, use the RTG ``bgzip`` and ``index`` commands. .. only:: core or extra .. include:: appendix_output_formats.inc.rst RTG JavaScript filtering API ---------------------------- The ``vcffilter`` command permits filtering VCF records via user-supplied JavaScript expressions or scripts containing JavaScript functions that operate on VCF records. The JavaScript environment has an API provided that enables convenient access to components of a VCF record in order to satisfy common use cases. VCF record field access ~~~~~~~~~~~~~~~~~~~~~~~ This section describes the supported methods to access components of an individual VCF record. In the following descriptions, assume the input VCF contains the following excerpt (the full header has been omitted): .. code-block:: text #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877 NA12878 1 11259340 . G C,T . PASS DP=795;DPR=0.581;ABC=4.5 GT:DP 1/2:65 1/0:15 CHROM, POS, ID, REF, QUAL ````````````````````````` Within the context of a ``--keep-expr`` or ``record`` function these variables will provide access to the String representation of the VCF column of the same name. .. code-block:: text CHROM; // "1" POS; // "11259340" REF; // "G" ALT, FILTER ``````````` Will retrieve an array of the values in the column. .. code-block:: text ALT; // ["C", "T"] FILTER; // ["PASS"] INFO.{INFO\_FIELD} `````````````````` The values in the ``INFO`` field are accessible through properties on the ``INFO`` object indexed by ``INFO ID``. These properties will be the string representation of info values with multiple values delimited with "``,``". Missing fields will be represented by "``.``". Assigning to these properties will update the VCF record. This will be undefined for fields not declared in the header. .. code-block:: text INFO.DP; // "795" INFO.ABC; // "4,5" INFO.DPR = "0.01"; // Will change the value of the DPR info field {SAMPLE\_NAME}.{FORMAT\_FIELD} `````````````````````````````` The JavaScript String prototype has been extended to allow access to the format fields for each sample. The string representation of values in the sample column are accessible as properties on the string matching the sample name named after the ``FORMAT`` field ``ID`` These properties can be assigned in order to make modifications. This will be undefined for fields not declared in the header. .. code-block:: text 'NA12877'.GT; // "1/2" 'NA12878'.GT; // "1/0" 'NA12877'.DP = "10"; // Will change the DP field of the NA12877 sample VCF header modification ~~~~~~~~~~~~~~~~~~~~~~~ Functions are provided that allow the addition of new ``INFO`` or ``FORMAT`` fields to the header and records. It is recommended that the following functions only be used within the run-once portion of ``--javascript``. They may be called on every record, but this will be slow. ensureFormatHeader(FORMAT\_HEADER\_STRING) `````````````````````````````````````````` Add a new ``FORMAT`` field to the VCF if it is not already present. This will add a ``FORMAT`` declaration line to the header and define the corresponding accessor methods for use in record processing. .. code-block:: text ensureFormatHeader('##FORMAT='); ensureInfoHeader(INFO\_HEADER\_STRING) `````````````````````````````````````` Add a new ``INFO`` field to the VCF if it is not already present. This will add an ``INFO`` declaration line to the header and define the corresponding accessor methods for use in record processing. .. code-block:: text ensureInfoHeader('##INFO='); Additional information and functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SAMPLES ``````` This variable contains an array of the sample names in the VCF header. .. code-block:: text SAMPLES; // ['NA12877', 'NA12878'] print({STRING}) ``````````````` Writes the provided string to standard output. .. code-block:: text print('The samples are: ' + SAMPLES); .. seealso:: For javascript filtering usage and examples see :ref:`vcffilter` .. only:: core or extra .. include:: appendix_parallel_processing.inc.rst Distribution Contents --------------------- The contents of the RTG distribution zip file should include: - The RTG executable JAR file. - RTG executable wrapper script. - Example scripts and files. - This operations manual. - A release notes file and a readme file. Some distributions also include an appropriate java runtime environment (JRE) for your operating system. README.txt ---------- For reference purposes, a copy of the distribution ``README.txt`` file follows: .. literalinclude:: include-README.txt :language: text Notice ------ Real Time Genomics does not assume any liability arising out of the application or use of any software described herein. Further, Real Time Genomics does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Real Time Genomics reserves the right to make any changes in any processes, products, or parts thereof, described herein, without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied. © 2017 Real Time Genomics All rights reserved. Illumina, Solexa, Complete Genomics, Ion Torrent, Roche, ABI, Life Technologies, and PacBio are registered trademarks and all other brands referenced in this document are the property of their respective owners.