Table 7 — Summary of metrics used in the analysis.
Metric name
|
Units
|
Description
|
N50
|
—
|
A weighted median of the lengths of items, equal to the length of the longest item i such that the sum of the lengths of items greater than or equal in length to i is greater than or equal to half the length of all of the items. With regard to assemblies the items are typically contigs or scaffolds.
|
NG50
|
—
|
Whereas N50 sets the median in relation to the total length of all items in the set, we define NG50 to be normalised by the average length of the α1 and α2 haplotypes instead of the total length of all sequences as in N50, it is thus more reliable than N50 for comparison between assemblies.
|
CPNG50
|
bp
|
Contig path NG50. The weighted median of the lengths of contig paths. Contig paths represent maximal subsequences of contigs that are entirely consistent with α1,2.
|
SPNG50
|
bp
|
Scaffold path NG50. The weighted median of the lengths of scaffold paths. Scaffold paths represent maximal concatenations of contig paths and scaffold breaks that maintain correct order and orientation.
|
Structural errors
|
Counts
|
An error within a contig or scaffold. Errors include intra and inter chromosomal joins, insertions, deletions, simultaneous insertion and deletions and insertions at the ends of assembled sequences.
|
CC50
|
bp
|
Correct contiguity 50. The empirically sampled distance between two points in an assembly, where the two points are as likely to be correctly aligned as not.
|
Substitution errors
|
Counts per correct bits
|
Number of substitution errors per correct bit. Substitution errors are columns in the alignment where the α1 and α2 haplotypes contain either the same base (homozygous) or different bases (heterozygous) and the alignment contains a base (or IUPAC symbol) different from either α1 or α2. The metric uses a bit score to allow for IUPAC symbols.
|
Copy number errors
|
Proportions
|
For a given haplotype column in the MSA the copy number of the simulated genome can be described as an interval [min, max]. Assemblies with copy number outside of this interval are classified either as an excess, for being above the interval, or a deficiency, for being below the interval.
|
Coverage
|
Percent
|
The coverage is the percent of columns in the MSA of the target (the whole genome, regions of a specific annotation type, etc) that contain positions of the assembly.
|
Genic correctness
|
Percent
|
The genic correctness is the percentage of bps in spliced transcripts from the haplotype sequences that align to the assembly with 95% coverage using WU-BLAST.
|