Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

General Tools

FastQC

Output files
  • fastqc/
    • *_fastqc.html: FastQC report containing quality metrics.
    • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

TrimGalore

Output files
  • trimgalore/
    • *_trimming_report.txt: Trimgalore trimming report.
    • fastqc/*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
    • fastqc/*_fastqc.html: FastQC report containing quality metrics.

TrimGalore combines the trimming tool Cutadapt for the removal of adapter sequences, primers and other unwanted sequences with the quality control tool FastQC

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome.

Such files are intermediate and not kept in the final files delivered to users.

Output files

Output directory: results/Reports/[SAMPLE]/SamToolsStats

  • [SAMPLE].bam
    • Alignment file containing information about the read alignment to the reference genome

Samtools

samtools stats

samtools stats collects statistics from BAM files and outputs in a text format.

Output files

Output directory: results/Reports/[SAMPLE]/SamToolsStats

  • [SAMPLE].bam.samtools.stats.out
    • Raw statistics used by MultiQC

Plots will show:

  • Alignment metrics.

For further reading and documentation see the samtools manual

Mark Duplicates

GATK MarkDuplicates

By default, circdna will use GATK MarkDuplicates, which locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.

Output files

Output directory: results/markduplicates/bam

  • [SAMPLE].md.bam and [SAMPLE].md.bai
    • BAM file and index

For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.

Samtools view - Duplicates Filtering

By default, circdna removes all duplicates marked by GATK MarkDuplicates using samtools view

Output files

Output directory: results/markduplicates/duplicates_removed

  • [SAMPLE].md.filtered.sorted.bam and [SAMPLE].md.filtered.sorted.bai
    • BAM file and index

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

circdna branches

Branch: circle_finder

Circle_finder

Output files

Output directory: results/circlefinder/

  • [SAMPLE].microDNA-JT.txt
    • BED file containing information about putative circular DNA regions

Circle_finder identifies putative circular DNA junctions from paired-end sequencing data. Circle_finder uses split and discordant read information to identify junctions that could be generated through the formation of ecDNAs. For more information please see Circle_finder.

Branch: circexplorer2

CIRCexplorer2

CIRCexplorer2 identifies putative circular DNA junctions from paired-end sequencing data. CIRCexplorer2 was developed to identify circular RNAs from RNA-seq data. However, it can be also used to call putative circular DNAs from genomic data. For more information see CIRCexplorer2 docs

Output files

Output directory: results/circexplorer2/

  • [SAMPLE].circexplorer_circdna.bed
    • BED file containing information about putative circular DNA regions
  • [SAMPLE].CIRCexplorer2_parse.log
    • log file

Branch: circle_map_realign

circle_map_realign uses the functionality of Circle-Map to call putative circular DNAs from mappable regions. To identify circular DNAs it uses information about split and discordant reads and uses realignment steps to identify the exact breakpoint of the circular DNA. For more information, please see the original paper or the GitHub Page

Circle-Map Readextractor

Circle-Map Readextractor extracts read candidates for circular DNA identification.

Output files

Output directory: results/circlemap/readextractor

  • [SAMPLE].qname.sorted.circular_read_candidates.bam
    • BAM file containing candidate reads

Circle-Map Realign

Circle-Map Realign detects putative circular DNA junctions from read candidates extracted by Circle-Map Readextractor

Output files

Output directory: results/circlemap/realign

  • [SAMPLE]_circularDNA_coordinates.bed
    • BED file containing information about putative circular DNA regions

Branch: circle_map_repeats

Circle-Map Readextractor

Circle-Map Readextractor extracts read candidates for circular DNA identification.

Output files

Output directory: results/circlemap/readextractor

  • [SAMPLE].qname.sorted.circular_read_candidates.bam
    • BAM file containing candidate reads

Circle-Map Repeats

Circle-Map Repeats identifies chromosomal coordinates from repetetive circular DNAs.

Output files

Output directory: results/circlemap/repeats

  • [SAMPLE]_circularDNA_repeats_coordinates.bed
    • BED file containing information about repetetive circular DNAs

Branch: unicycler

This Branch utilises the ability of Unicycler to denovo assemble circular DNAs in combination with the long read mapping capabilities of Minimap2, to identify the origin of the circular DNAs.

Unicycler

Unicycler was originally built as an assembly pipeline for bacterial genomes. In nf-core/circdna it is used to denovo assemble circular DNAs.

Output files

Output directory: results/unicycler/

  • [SAMPLE].assembly.gfa.gz
    • gfa file containing sequence of denovo assembled sequences
  • [SAMPLE].assembly.scaffolds.fa.gz
    • fasta file containing sequences of denovo assembled sequences in fasta format with information if denovo assembled seoriginated from a circular DNA.quence forms a circular contig.

Minimap2

Minimap2 uses circular DNA sequences identified by Unicycler and maps it to the given reference genome.

Output files

Output directory: results/unicycler/minimap2

  • [SAMPLE].paf
    • paf file containing mapping information of circular DNA sequences

Branch: ampliconarchitect

This pipeline branch ampliconarchitect is only usable with WGS data. This branch uses the utility of PrepareAA to collect amplified seeds from copy number calls, which will be then fed to AmpliconArchitect to characterise amplicons in each given sample.

CNVkit

CNVkit uses alignment information to make copy number calls. These copy number calls will be used by AmpliconArchitect to identify circular and other types of amplicons. The Copy Number calls are then connected to seeds and filtered based on the copy number threshold using scripts provided by PrepareAA

Output files

Output directory: results/ampliconarchitect/cnvkit

  • [SAMPLE]_CNV_GAIN.bed
    • bed file containing filtered Copy Number calls
  • [SAMPLE]_AA_CNV_SEEDS.bed
    • bed file containing filtered and connected amplified regions (seeds). This is used as input for AmpliconArchitect
  • [SAMPLE].cnvkit.segment.cns
    • cns file containing copy number calls of CNVkit segment.

AmpliconArchitect

AmpliconArchitect uses amplicon seeds provided by CNVkitand PrepareAAto identify different types of amplicons in each sample.

Output files

Output directory: results/ampliconarchitect/ampliconarchitect

  • amplicons/[SAMPLE]_[AMPLICONID]_cycles.txt
    • txtfile describing the amplicon segments
  • amplicons/[SAMPLE]_[AMPLICONID]_graph.txt
    • txt file describing the amplicon graph
  • cnseg/[SAMPLE]_[SEGMENT]_graph.txt
    • txt file describing the copy number segmentation file
  • summary/[SAMPLE]_summary.txt
    • txt file describing each amplicon with regards to breakpoints, composition, oncogene content, copy number
  • sv_view/[SAMPLE]_[AMPLICONID].{png,pdf}
    • png or pdf file displaying the amplicon rearrangement signature

AmpliconClassifier

AmpliconClassifier classifies each amplicon by using the cycles and the graph files generated by AmpliconArchitect.

Output files

Output directory: results/ampliconarchitect/ampliconclassifier

  • input/[SAMPLE].AmpliconClassifier.input
    • txt file containing the input used for AmpliconClassifier and AmpliconSimilarity.
  • classification/[SAMPLE]_amplicon_classification_profiles.tsv
    • tsv file describing the amplicon class of each amplicon in the sample.
  • ecDNA_counts/[SAMPLE]_ecDNA_counts.tsv
    • tsv file describing if an amplicon is circular [1 = circular, 0 = non-circular].
  • gene_list/[SAMPLE]_gene_list.tsv
    • tsv file detailing the genes on each amplicon.
  • log/[SAMPLE].classifier_stdout.log
    • log file
  • similarity/[SAMPLE]_similarity_scores.tsv
    • tsv file containing amplicon similarity scores calculated by AmpliconSimilarity.
  • bed/[SAMPLE]_amplicon[AMPLICONID]_[CLASSIFICATION]_[ID]_intervals.bed
    • bed files containing information about the intervals on each amplicon. unknown intervals were not identified to be located on the respective amplicon.

AmpliconArchitect Summary

The Summary script merges the output of AmpliconArchitect and AmpliconClassifer to give full information about each amplicon in a sample. Please refer to AmpliconClassifier for more accurate ecDNA interval calling. Some intervals classified in the AmpliconArchitect and Summary output are not located on ecDNAs.

Output files

Output directory: results/ampliconarchitect/summary

  • [SAMPLE].aa_results_summary.tsv
    • tsv file containing the merged results.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.