Alphaherpesvirus Genomics: Past, Present and Future

Alphaherpesviruses, as large double-stranded DNA viruses, were long considered to be genetically stable and to exist in a homogeneous state. Recently, the proliferation of high-throughput sequencing (HTS) and bioinformatics analysis has expanded our understanding of herpesvirus genomes and the variations found therein. Recent data indicate that herpesviruses exist as diverse populations, both in culture and in vivo, in a manner reminiscent of RNA viruses. In this chapter, we discuss the past, present, and potential future of alphaherpesvirus genomics, including the technical challenges that face the field. We also review how recent data has enabled genome-wide comparisons of sequence diversity, recombination, allele frequency, and selective pressures, including those introduced by cell culture. While we focus on the human alphaherpesviruses, we draw key insights from related veterinary species and from the betaand gamma-subfamilies of herpesviruses. Promising technologies and potential future directions for herpesvirus genomics are highlighted as well, including the potential to link viral genetic differences to phenotypic and disease outcomes.


Introduction
Herpesviruses are ubiquitous worldwide, with nine species that infect humans and dozens more infecting other hosts (Davison, 2010). The lifelong nature of herpesvirus infection creates a burden for individual health as well as for public health policy (Gantt et al., 2016;Jansen et al., 2016;Looker et al., 2015aLooker et al., , 2015b. They also have a global impact through their effects on veterinary species involved in food production and those kept as companion animals (Davison, 2010;Loncoman et al., 2017). Within the Herpesviridae family, the alphaherpesvirinae subfamily is characterized by common genomic features, and epithelial and mucosal sites of active replication (Pellett and Roizman, 2013;Roizman et al., 2013). Most alphaherpesviruses establish lifelong latency in neurons, with a few non-neuronal exceptions such as the Mardivirus genus (e.g. Marek's disease virus (MDV) or gallid alphaherpesvirus 2). The human alphaherpesviruses consist of herpes simplex virus 1 and 2 (HSV-1,2 or human herpesvirus 1,2, HHV-1,2) and varicella zoster virus (VZV or HHV-3) (Davison, 2010). These viruses, as with all herpesviruses, have large dsDNA genomes that have presented challenges for genomic characterization, particularly in comparison to other viral families (Koonin et al., 2015).
Herpesviruses have co-existed with humanity for as long as we have written records. Ancient Egyptian hieroglyphics record the presence of herpetic lesions, implying that these viruses have been a notable health issue for millennia (Pellett and Roizman, 2013). However, detailed scientific studies of these viruses, especially those investigating the entirety of the viral genome, have occurred in quite recent times (Davison, 2010;Depledge et al., 2018a;Loncoman et al., 2017;Renner and Szpara, 2018). The progression of herpesvirus genomics studies has been greatly expanded by the advancement of DNA sequencing technologies. Sanger sequencing enabled the study of individual viral genes, or with painstaking effort, a composite full-length viral genome (Klupp et al., 2004;McGeoch et al., 1985McGeoch et al., , 1988. With the advent of high throughput sequencing (HTS), entire viral genomes can now be assembled from one sample and its sequencing "run", and in a matter of days rather than years (Greninger et al., 2018;Parsons et al., 2015). HTS approaches also yield a depth of sequence information that provides insight into the genetic diversity within a viral sample. This relative wealth of data has revealed a substantial amount of genetic diversity between strains of any given alphaherpesvirus, as well as genetic diversity indicative of a population of viruses within a single sample. The effects that this level of sequence diversity may have on virus biology or clinical outcomes remain an area of active research (Houldcroft et al., 2017;Loncoman et al., 2017;Renner and Szpara, 2018).
Despite the recent expansion of alphaherpesvirus genomics, substantial challenges remain in the field. Alphaherpesvirus genomes have multiple features that are difficult to analyze with current short-read HTS technology, including highly repetitive areas, regions of high G+C content, and even the presence of genomic isomers. Handling these challenges is a key focus of the future of herpesvirus genomics. In this chapter, we will describe the progression of alphaherpesvirus genomics, including historical context, current areas of research, and potential future directions for the field.

The inherent (in)stability of herpesvirus genomes
A prevailing opinion among virologists is that DNA viruses are inherently stable and RNA viruses are inherently variable (Sanjuan et al., 2010). Indeed, RNA viruses are often described as existing as a quasispecies swarm of varying genomes, rather than as a defined genetic species (Andino and Domingo, 2015). Mechanistically, this view stems from the lack of error correction in most RNA-dependent RNA polymerases (Sanjuán and Domingo-Calap, 2016). By contrast, most DNA viruses have high fidelity polymerases with error correction, and have lower reported mutational rates (Hall and Almy, 1982;Drake and Hwang, 2005). Early studies of the mutation rate of HSV-1 examined single genes, and detected mutation rates on the order of 1 x 10 -7 or 1 x 10 -8 mutations per base per infectious cycle (Hall and Almy, 1982;Drake and Hwang, 2005).
The rates identified in such studies are those that are then subsequently quoted in comparisons between RNA and DNA viruses (Drake, 1991;Sanjuan et al., 2010;Sanjuán and Domingo-Calap, 2016). While those mutation rates fit well with older restriction-fragment length polymorphism (RFLP) comparisons of herpesvirus genomes, these mutation rates fail to explain the ease with which herpesvirus variants are selected or revealed under strong selective pressures.
For example, analysis of multiple HSV-1 and HSV-2 strains revealed the presence of drug-resistance mutations in 1 out of 10 3 to 10 4 PFU in the overall virus population (Sarisky et al., 2000). This result suggests that alphaherpesviruses, even in culture, exist as populations of diverse genomesa phenomenon otherwise known as standing variation (Firth et al., 2010;Renzette et al., 2014). An alternative explanation is that the rate of evolution observed under these laboratory conditions is substantially different than natural conditions, although the frequency with which drug-resistant viruses are detected in vivo suggests otherwise (Burrel et al., 2010;Sauerbrei et al., 2010).
Multiple lines of evidence support the idea that alphaherpesviruses exist in populations with standing variation. Genome-wide HTS comparisons of HSV-1 have revealed approximately 3-4% nucleotide variation genome-wide, with differences of 1-2% observed between plaque-picked subclones of a given strain (Parsons et al., 2015;Szpara et al., 2014;Bowen et al., 2016).
Additionally, a study that inferred mutation rates from phylogenetic history found a higher than expected frequency of observed mutations in herpesvirus populations (Firth et al., 2010). There is also experimental evidence to support a hypothesis that genetic diversity is inherently beneficial to herpesviruses. In an investigation of Muller's ratchet -the hypothesis that small asexual populations will accumulate deleterious mutations -Jaramillo et al. subjected ten individual sub-clones of HSV-1 to sequential plaque-to-plaque transfers (Jaramillo et al., 2013). As a result of the extreme genetic bottlenecks induced by this experimental approach, two clonal lineages were completely lost, with the remaining clones exhibiting an attenuation of host mortality after intracerebral inoculation into mice (Jaramillo et al., 2013). Genetic analysis of these clones revealed a mutation frequency of 3.6 x 10 -4 substitutions per base per plaque transfer, a far higher rate than those described in single gene studies (Hall and Almy, 1982;Drake and Hwang, 2005;Drake, 1991;Jaramillo et al., 2013).
Additionally, while not the focus of this chapter, studies of the beta-herpesvirus human cytomegalovirus (HCMV) have described the existence and expansion of standing variation within immunocompromised hosts (Gorzer et al., 2010;Sijmons et al., 2015;Houldcroft et al., 2016;Hage et al., 2017). On the whole, while studies of the mutation rate of the herpesvirus polymerases have been useful, the breadth of genetic diversity within herpesviruses is substantial and has relevance to both laboratory and clinical settings (Houldcroft et al., 2017;Loncoman et al., 2017;Renner and Szpara, 2018). With the advent of advanced sequencing technologies, researchers are beginning to appreciate the effects that genetic diversity can have on disease outcomes, drug resistance, and other viral phenotypes of interest.

Comparative genomics reveals the true diversity of herpesviruses
Two seminal examples of VZV genome analysis by Peters et al. 2006 andTyler et al. 2007 predate the era of HTS, but nonetheless set the tone for many later analyses (Peters et al., 2006;Tyler et al., 2007). We applied many of the same comparative genomics analyses laid out in these papers in our first analyses of HSV-1 and pseudorabies virus (PRV) genomes (Szpara et al., 2010(Szpara et al., , 2011. These included phylogenetic and recombination analyses, as well as comparisons of coding diversity, tandem repeats, and specific genetic loci (Peters et al., 2006;Szpara et al., 2014;Tyler et al., 2007). Many subsequent studies have followed the same approach, albeit for different alphaherpesvirus species or for interesting subsets of viruses within a species (see for example (Depledge et al., 2014a;Newman et al., 2015;Bryant et al., 2018;Lewin et al., 2018)). Studies using these methods have illuminated the evolutionary history of several alphaherpesvirus species (Zell et al., 2012;Norberg et al., 2015;Browning et al., 2016;Vaz et al., 2016a;Burrel et al., 2017;Koelle et al., 2017;Trimpert et al., 2017), demonstrated that recombination can occur between and within a single species (Loncoman et al., 2017;Lewin et al., 2018;Zell et al., 2012;Norberg et al., 2015;Kolb et al., 2015;Burrel et al., 2017;Koelle et al., 2017;Vaz et al., 2016b;Kolb et al., 2017) (as detailed below in "Mechanisms that drive genetic diversity"), and shown that vaccine strains can regenerate virulent viruses if given the opportunity (Lee et al., 2012;Ye et al., 2016). What remains under-represented in many HTS studies is a connection of the comparative genomics data to the measurement of biological phenotypes.
These phenotypes could include classic virological measures, such as plaque morphology or replication fitness Parsons et al., 2015), or in vivo measures such as neuroinvasion and mortality rate (Bryant et al., 2018;Pandey et al., 2017). The alphaherpesvirus literature reflects decades of research on the in vitro and in vivo phenotypic effects of individual gene deletions, mutations, or over-expression (Pellett and Roizman, 2013;Roizman et al., 2013). The next frontier for comparative genomics is to integrate these prior data with the insights from comparative genomics, as detailed below (see "Future Frontiers").

Mechanisms that drive genetic diversity
As we gain insight into the level of genetic diversity within the alphaherpesvirinae subfamily, one outstanding question is how such diversity is generated. One straightforward source of genetic diversity is DNA replication error by the viral polymerase (Hall and Almy, 1982;Drake and Hwang, 2005).
However, it is unclear how much of a role DNA polymerase error plays in the generation of alphaherpesvirus genetic diversity. Studies of the error rate of herpesvirus polymerases have indicated a very low mutation rate, on the order of 1 x 10 -7 or 1 x 10 -8 mutations per base per infectious cycle, which would imply that many rounds of genomic replication are necessary if polymerase error were the sole or primary source of genetic diversity (Hall and Almy, 1982;Drake and Hwang, 2005). However, those studies were also done in the context of a single coding region, which may well underestimate polymerase error rates (Brown, 2004). Polymerases are known to have higher error rates in contexts of high G +C content or repetitive regions, and these sequence contexts are overrepresented in alphaherpesvirus genomes with high inter-strain diversity, such as HSV-1 ( Figure 1). In contrast the smaller VZV genome has an overall lower rate of inter-strain diversity. The VZV genome has reduced intergenic regions and ~50% G+C content as compared to ~68% G+C content for HSV-1 (peaking at ~80% average G+C in highly repetitive regions) ( Figure 1) (Peters et al., 2006;Tyler et al., 2007;Szpara et al., 2011;Sijmons et al., 2015).
Beyond polymerase error, it is now clear that mutation and evolution in herpesviruses result not only from base substitutions, but also from recombination between strains of a given species, and to a less frequent extent, recombination between species (Loncoman et al., 2017;Norberg, 2010;Renner and Szpara, 2018). As with segmented RNA virus reassortment, recombination in herpesviruses can provide a strong force for evolutionary shifts. Studies of lab-generated recombinants of HSV-1 have revealed a bias towards recombination breakpoints occurring in repetitive regions, areas of locally high G+C content, and intergenic regions (Lee et al., 2015). However, most studies of recombination within the alphaherpesviruses have focused on naturally circulating variants, and have inferred recombination and phylogenetic relationships for historical sites of recombination from a comparison of extant strains (Burrel et al., 2017;Koelle et al., 2017;Loncoman et al., 2017;Norberg et al., 2015). For example, as more strains of VZV have been sequenced, the number of phylogenetic clades have expanded and there is now evidence for ancient, inter-clade recombination in addition to modern recombination between strains (Norberg et al., 2015;Zell et al., 2012). Within circulating populations of Figure 1. Comparison of the known genetic diversity within the human alphaherpesviruses. We generated SplitsTree diagrams (Huson, 1998) with recently published analyses of known genetic diversity in the three human alphaherpesviruses (Akhtar et al., 2019;Shipley et al., 2019;Zell et al., 2012). Each virus' tree is presented at the same scale, so that the size of the tree is proportional to the observed genetic diversity for each virus. The HSV-1 diagram includes 51 strains (Shipley et al., 2019), HSV-2 includes 68 strains (Akhtar et al., 2019), and VZV includes 42 strains (Zell et al., 2012).
HSV-1 and HSV-2, there is evidence of rampant recombination among strains within each species, although it is unknown if the observed recombination is historical or recent (Szpara et al., 2014;Kolb et al., 2015;Norberg, 2010;Norberg et al., 2007). Recently, two separate groups have found evidence for inter-species recombination between HSV-1 and HSV-2 (Burrel et al., 2015(Burrel et al., , 2017Casto et al., 2019;Koelle et al., 2017). These findings are based on the presence of several regions in the HSV-2 genome that have a high degree of sequence similarity to extant HSV-1 genomes, whereas the remainder of the HSV-2 genome is more closely related to a chimpanzee herpesvirus (ChHV) than to HSV-1 (Severini et al., 2013;Wertheim et al., 2014). For the non-human
Since recombination requires the co-occurrence of two distinct viral lineages in one infected cell in the host, this may be rare and difficult to detect in clinical or field settings.

Other contributions to functional diversity
While single-nucleotide changes, recombination, and even horizontal gene transfer are generally accepted mechanisms for genetic variation within herpesviruses, data indicate that other mechanisms may also contribute to genetic diversity. These mechanisms include ribosome slippage and novel coding and non-coding RNAs revealed by transcriptome analysis or ribosome profiling. These genomic outputs cannot be predicted solely by analysis of genomic DNA, and instead require approaches that integrate transcriptional and/or translational data in combination with a matched DNA genome comparison.
Several studies have applied sensitive RNA-sequencing-based approaches to demonstrate an expanded range of transcriptional impacts of HSV-1 infection on host cells. A recent series of papers used directional and nascent RNAsequencing, in combination with ribosome profiling, to reveal previously undetected effects on the host transcriptome during HSV-1 replication (Rutkowski et al., 2015;Wyler et al., 2017;Hennig et al., 2018). These included the disruption of transcription termination (DoTT) or transcriptional read-through, which leads to the production of many run-on RNA transcripts (Rutkowski et al., 2015;Hennig et al., 2018). It also included a large number of antisense transcripts from the host genome (Wyler et al., 2017). More recently, these analyses have been applied to several alphaherpesvirus genomes, revealing many non-canonical transcripts and novel open reading frames (ORFs) expressed by these viruses, along with previously undetected transcript extensions, truncations, and splicing events (Tombácz et al., 2017a;Prazsák et al., 2018;Tombácz et al., 2019;Depledge et al., 2019;Whisnant et al., 2020).
Future studies will need to explore the expression and potential function(s) of these transcripts in different cellular and in vivo models of infection, and their conservation across other herpesviruses (Arias et al., 2014;Tirosh et al., 2015).
Ribosome frameshifting is a necessary aspect of translation for retroviruses like HIV, where production of the nucleocapsid and polymerase proteins are directed by the same RNA transcript and ribosome frameshifting is required to obtain both products. While it occurs at a lower frequency, ribosome frameshifting has been demonstrated on thymidine kinase (TK) transcripts in HSV-1 (Griffiths, 2011;Griffiths et al., 2003;Pan and Coen, 2012). This can be detected in genomes with a homopolymer-based frameshift in the TK gene, which often arise in response to treatment with the antiviral drug acyclovir (Burrel et al., 2010;Griffiths, 2011;Sauerbrei et al., 2010). While a direct translation of the RNA encoded by such altered genes would lead to non-functional proteins, ribosomal frameshifting enables a low level of functional protein production in these drug-resistant viruses (Griffiths, 2011;Griffiths et al., 2003;Pan and Coen, 2012).
RNA editing is a mechanism by which viral transcripts may be recoded at the single-nucleotide level to differ from the existing viral genome. While hostdefensive or therapeutic CRISPR-based RNA editing is a destructive process that may be used against herpesviruses (Nakaya et al., 2016;Oh et al., 2019;Suspene et al., 2011), there is also the potential for other outcomes of RNA

Clear nomenclature to document viral strains and genome sources
As increasing numbers of alphaherpesvirus isolates of a given species are studied at the genomic level, it has become increasingly important to document the different sources of material used to obtain viral genomes. While descriptive terms such as "clinical isolate" and "laboratory strain" are often used to denote either low or high numbers of cell culture passages for a given viral isolate, these terms are interpreted variably by different research groups (Kuhn et al., 2013;Wilkinson et al., 2015). There are no standards for how many times a viral sample can be passaged in culture before it is no longer considered a "clinical isolate", nor are there standards as to whether a viral population should be plaque purified before calling it a strain. The introduction of sensitive DNA isolation and amplification techniques has also enabled whole viral genomes to be collected and sequenced directly from hosts, avoiding cell culture entirely (Greninger et al., 2018;Depledge et al., 2011;Watson et al., 2013;Johnston et al., 2017a;Shipley et al., 2018Shipley et al., , 2020. It is important that researchers  (Breuer et al., 2010). The recommended nomenclature includes the following data for each strain (moving from left to right in the diagram): viral status as a cultured isolate or an uncultured sequence, the geographic city and country of origin, collection date, whether or not any related sequences were collected, the disease presentation (e.g. varicella, zoster or latent), and the strain clade based on phylogenetic clustering or single-nucleotide polymorphism (SNP) profile (Breuer et al., 2010). The nomenclature components are color-coded consistently throughout the diagram. In a review article recommending this approach , we suggested the addition of strain name to this nomenclature, in keeping with recommended practices for RNA virus nomenclature (Kuhn et al., 2013). (B) However, the current usage of this nomenclature is limited, with variable applications whose content is not always straightforward to interpret. A selection of recent variations in nomenclature usage are shown from the following GenBank records and publications: KJ847330 (Bondre et al., 2016), KR135321 (Newman et al., 2015), and MG764307 (Depledge et al., 2018b). A more consistent application of the recommended viral strain nomenclature would provide clear benefits for sample data retrieval and the accurate linking of sequence data with published results. Figure adapted from Breuer et al. 2008(Breuer et al., 2010. generating new genomic data, and those using sequences from the GenBank databases for purely computational studies, appreciate these distinctions. The VZV community has recommended a viral naming system that is similar to those used for RNA viruses or bacteria to help record and clarify these details for each new genome (Figure 2A) (Breuer et al., 2010). This standard emphasizes defining not only a name for each strain, but also its geographic origin, collection date, disease type (e.g. varicella, zoster or latent), and its status as a cultured isolate or an uncultured sequence (Breuer et al., 2010).
Whenever possible the strain clade, based on phylogenetic clustering or singlenucleotide polymorphism (SNP) profile, is included as well. While not yet widely or consistently adopted ( Figure 2B), this nomenclature approach would clearly be beneficial if applied to other alphaherpesviruses as well. This would improve data-accessibility across the primary literature and sequence databases, as well as enabling easier comparison of data across research groups.

Genomes: consensus vs. population
Most prior studies that catalog herpesvirus diversity have produced a single, consensus genome for each new sample. The consensus genome is representative of the most common allele or nucleotide at each position in the genome (Figure 3). In the simplest scenario, the consensus genome from a sample is built from the most common nucleotide at each position throughout the genome ( Figure 3A). However, this scenario is complicated by evidence that herpesvirus isolates generally exist as populations that contain a degree of sequence diversity. Because more than one nucleotide may exist at a given location across the viral population in a sample, the creation of a single consensus genome necessarily removes the details of underlying components of the viral genetic population. The consensus genome for a given sample may also be an amalgamation of frequent but non-contiguous alleles ( Figure 3B).
Genetic variants or alleles that exist at a sub-consensus level are often referred to as minor variants. Depending on the intended purpose of a study, an analysis of the presence and identity of minor variants may be crucial to a proper interpretation of the data, or that information may be viewed as a luxury. For example, studies of transmission or evolution of HSV-1 (Greninger et al., 2018;Pandey et al., 2017;Shipley et al., 2018;Lassalle et al., 2020), HSV-2 (Akhtar Figure 3. A sample of virus contains a population of viral genomes, which includes both common and rare or low-frequency (minor) genetic variants. A viral consensus genome represents the most common allele or nucleotide every position in the genome. In this diagram, lines represent individual genomes, with alleles or loci that vary between genomes denoted as stars. The thicker line represents the consensus genome that results from deep sequencing and assembly of the most common (or most frequently observed) alleles. (A) In example 1, the consensus genome is the most common or frequent genotype present in the viral sample. (B) In example 2, the consensus genome contains all of the most common alleles, but this amalgamation of alleles does not exist at high frequency in the viral sample. (C-D) In either case, selection pressures such as the application of an antiviral drug can lead to a genome that carries a rare drug-resistance variant (denoted by the red star) being positively selected. The drugresistant genotype may then become the most commonly observed allele. Depending on the genomic context of the drug-resistance allele, this allele may appear in the same genetic context as the previously dominant genotype (C), or it may appear alongside other variants that were previously rare (D).  Figure 3CD) (Burrel et al., 2010;Houldcroft et al., 2017;Sauerbrei et al., 2010). Phylogenetic studies, by contrast, would have little use for analyzing the minor variants in their samples (Zell et al., 2012;Kolb et al., 2017;Johnston et al., 2017a;Pfaff et al., 2016). So, when designing a genetic study of herpesviruses, careful consideration must be given to the analysis of minor variants and the depth of sequence coverage needed to do so.
Another consideration that must be made when analyzing consensus sequences from deep sequencing data relates to the large size of alphaherpesvirus genomes. Since the sequence read length of commonly used HTS or deep sequencing approaches are substantially shorter than alphaherpesvirus genomes (e.g. ~300 bases vs hundreds of kilobases), it is difficult to determine which genetic variants coexist within one intact genome. If each variant is the most common at its location, it will be included as an amalgamation of several genotypes as the consensus genome (see Figure 3).
For smaller viruses, researchers have the ability to clone and sequence individual genomes, which has enabled the measurement of viral genotypes in a given population and the development of software to infer likely haplotypes from the HTS data (Lou et al., 2013;Töpfer et al., 2014;Jayasundara et al., 2015).
Barcoded methods offer the potential to improve the linking of variant alleles into haplotypes (Lauring and Andino, 2011;Keys et al., 2015), but these have not yet been widely applied to herpesviruses. Trimpert et al recently applied a combination of HTS-based assessment, with bacterial artificial chromosome (BAC)-based genome capture, to validate MDV population diversity in a series of polymerase mutants (Trimpert et al., 2019). This combination enabled them to confirm the co-location of minor variants observed by HTS, in single BACcloned viral genomes (Trimpert et al., 2019). Other technologies such as nanopore-based (MinION©, Oxford Nanopore) or single-molecule real-time sequencing (SMRT©), Pacific Biosystems (PacBio)) offer the promise of longer read lengths and the resulting ability to link minor alleles, although they have not yet been widely applied to analyze minor variants in herpesvirus populations (Tombácz et al., 2017a;Prazsák et al., 2018;Tombácz et al., 2019;Depledge et al., 2019;Whisnant et al., 2020). However, both long-read technologies currently exhibit a degree of sequencing error that is sufficient to limit their useful application at this point in their development. Thus short-read HTS platforms such as Illumina© (e.g. MiSeq or HiSeq) remain the most common choice for herpesvirus genomics, where the technology is sufficient to identify the location and prevalence of minor variants, but not to evaluate their cooccurrence on single genomes (Hage et al., 2017;Shipley et al., 2018;Depledge et al., 2016b;Pandey et al., 2016). Newer hybrid sequencing approaches may help to bridge this gap, by employing both the sensitivity of Illumina short-read HTS, and the longer scaffold accuracy of long-read Nanopore or PacBio HTS. This "oligo-enrichment" approach uses synthetic RNA or DNA probes, also known as baits, that are designed to hybridize with sparse amounts of the targeted viral genome(s) using sequence complementarity. These are mixed into solution with the sample of interest. Once they have hybridized with viral genetic material, the baits can be isolated from the mixed sample by virtue of an attached tag, such as biotin. The sample that results from this isolation is greatly enriched for the targeted DNA and can be amplified and sequenced using standard HTS approaches. Oligo-enrichment has enabled the sequencing of alphaherpesviruses directly from samples such as skin swabs, saliva, blood, vesicle fluid, and tissue samples, among others (Greninger et al., 2018;Depledge et al., 2011;Johnston et al., 2017a;Shipley et al., 2018). The sensitivity of viral genome capture with this method has been used to gain new insights into viral population diversity in infected humans. This has included tracking changes at the consensus genome level in multiple reactivation episodes (Shipley et al., 2018;Greninger et al., 2018;Casto et al., 2019;Depledge et al., 2014b), or during transmission between human hosts (Depledge et al., 2016b;Shipley et al., 2019), and examining how specific allelic variants (i.e. minor variants) change in frequency over time within individual hosts (Minaya et al., 2017b;Shipley et al., 2018).

The path to obtaining a viral genome
Once viral genomic DNA is prepared for sequencing, the next step for a researcher is to decide which platform they will choose to generate sequence data. The most common approach is to use short-read HTS platforms, such as Illumina©. A detailed examination of how HTS data is processed into full-length consensus genomes and/or minor variants has been described in depth elsewhere (Houldcroft et al., 2017;Posada-Cespedes et al., 2016). The basic approach is for the short sequencing reads to be stitched, via bits of overlapping sequence, into longer and longer contiguous pieces that eventually form a new consensus genome. In a second quality-control step, individual sequencing reads are then aligned (or "mapped") back to their matching locations on the newly constructed consensus genome, usually with a great number of sequencing reads present per nucleotide (i.e. the sequencing or coverage depth). When sequencing coverage is deep (e.g. >300-fold coverage), genomes can be created with greater confidence than with lesser coverage. Deep sequence coverage also allows low-frequency minor variants to be identified with greater confidence. Publications generally only report minor variants above a certain minimum frequency threshold, such as 2% (Greninger et al., 2018;Hage et al., 2017;Shipley et al., 2018;Depledge et al., 2014b). It is important to appreciate that with 100X coverage, a 2% threshold for minor variant detection would mean that only 2 sequencing reads were sufficient to identify (or "call") the minor variant. For this reason, we recommend a sequencing depth far exceeding 100-fold for the confident and accurate detection of minor variants.
Greater coverage depth means that more sequence reads must support the minor variant for it to be detected or called, thus minimizing the chance of an incorrectly identified minor variant. The cost of high coverage depth is that fewer viral strain genomes can be assembled from a given sequencing "run" (e.g. one iteration of operating an Illumina MiSeq). Again, this is where the priority of the researcher factors into decisions made in the sequencing process. If the priority is to sequence as many genomes as possible in one sequencing run, the limitations of existing technology mean that such genomes will have lower coverage than a sequencing run that involves only a limited number of unique samples. However research that is focused solely on species identification or serotype determination may benefit from increased efficiency and cost-savings by foregoing coverage depth and the detection of minor variants (Greninger et al., 2018;Zell et al., 2012;Vaz et al., 2016a;Pfaff et al., 2016).
Since the analysis of sequenced genomes often depends on how accurately individual sequencing reads can be overlapped to create the scaffold of a fulllength genome, repetitive regions present a distinct problem (Figure 4) (Treangen and Salzberg, 2011). If a given sequencing read consists entirely of a perfectly repeating sequence, it may be unclear how the read overlaps with its neighbors, i.e. how a long a given repeating array of sequences truly is. Many alphaherpesvirus genomes have multiple extremely long repetitive arrays, such as the classic "reiterations" recognized in early studies of HSV-1 and VZV (Davison and Wilkie, 1981;Moss et al., 1981). Length variations in these tandem repeats have been recognized since the era of restriction fragment length polymorphism (RFLP) analyses, and these fluctuations further confound the true length determination of these repeats (Moss et al., 1981;Simon et al., 1989;Sakaoka et al., 1994). In HTS genome assemblies, the issue of uncertain overlaps in repetitive sequence arrays can cause the termination of de novo assembled fragments, creating gaps in the resulting consensus genome. It may also confound the ability to map sequence reads back to repeated regions in alignment-based methods of genome analysis (Parsons et al., 2015;Treangen and Salzberg, 2011). In addition to having large structural repeats at internal (A) Challenges in the physical sequencing of G+C-rich regions and/or areas containing DNA secondary structures such as the stem-loop at the origin of replication (Ori) can lead to a lower quantity of raw sequence-read data being produced in during high-throughput sequencing (HTS). (B) During de novo assembly or alignment of viral genome sequence data, challenges arise from ambiguities in the overlap and/or mapping of sequence reads that contain tandem repeats, homopolymers, or adjacent mixtures of these motifs. (C) Alphaherpesvirus genomes generally have two unique regions (termed unique long and unique short; shown as gradients of gray), each of which are flanked by repeats (known as repeat long (RL) or repeat (RS); shown as white and orange boxes). RL and RS exist in two copies, one terminal and one internal and inverted, on each genome. Isomers of these genomes -which appear to have either the left or right unique and repeated regions in the reverse orientation -exist in viral populations (Mahiet et al., 2012;Roizman, 1979). Sequence read mapping issues can result from ambiguities in mapping or placement of paired-end sequences that span the isomerization boundary of the long and short regions of alphaherpesvirus genomes (gradients of gray denote orientation of the unique regions). The combination of decreased sequence quantity (A) and ambiguities in sequence read mapping (B) combine to create regions of low-coverage and gaps in the final consensus genome produced by HTS (C). This phenomenon is frequently observed in the terminal and internal repeats of HSV-1 and HSV-2 consensus genomes (Greninger et al., 2018;Parsons et al., 2015;Szpara et al., 2011;Newman et al., 2015;Norberg et al., 2011). and terminal regions, and the classically-defined large reiterations, many alphaherpesviruses are rich in short tandem repeated sequences (also known as short sequence repeats or SSRs, and variable number tandem repeats, or VNTRs) (Szpara et al., 2011). The G+C content tends to be even higher within these repeats (Szpara et al., 2011(Szpara et al., , 2014. This leads to secondary structures such as stem-loops and G-quadruplexes that decrease the overall yield of sequence data in these areas and further complicates the assembly of these high G+C repetitive regions (Figure 4) (Szpara et al., 2011;Parsons et al., 2015;Artusi et al., 2015;Biswas et al., 2016).
Alphaherpesviruses often have latency-or neurovirulence-associated genes encoded in the large structural repeated regions of the genome (e.g. LAT and ICP34.5 in HSV-1), in addition to the key transcriptional regulator ICP4 (also known as IE180 or IE62, depending on the virus). Therefore, researchers must balance the desire to study the biological functions associated with these regions of the genome, with the technical difficulties of sequencing them. Based on the priorities of each study, some groups have decided that omission of the repeats was the most useful (Greninger et al., 2018;Newman et al., 2015;Norberg et al., 2011). Other studies include an analysis of the repetitive regions, with certain caveats given to the analysis of the data (Szpara et al., 2014(Szpara et al., , 2011Kolb et al., 2015;Pandey et al., 2016). Other groups seek to address the technical difficulties of using HTS approaches in repetitive areas by using Sanger sequencing to resolve troublesome regions of the genome (Minaya et al., 2017a). The challenges surrounding the repetitive regions of these genomes, and how to deal with them, remain one of the most challenging aspects of alphaherpesvirus genomics.

High quality genomes result from dedicated bioinformatics and quality control
While the decisions that go into choosing a sequencing method are not trivial, they mean little without proper bioinformatic efforts to assemble viral genomes and analyze them. The ability to rapidly sequence and assemble large DNA virus genomes like the alphaherpesviruses has been facilitated by advances in software and computational workflows, although the diversity of approaches to the task has led to a wide variety in the quality of finished genomes. One of the first choices in the bioinformatic analysis of viral HTS data is whether to construct a genome by alignment of the sequencing reads to a prior reference genome, or by de novo assembly (Posada-Cespedes et al., 2016). Each approach has its own strengths and weaknesses that should be considered.
Alignment-based approaches use a previously determined reference genome to achieve faster results, but the end result is biased by the reference genome. As a result, these approaches are prone to mis-calling of minority and structural variants, and may have repeat-lengths that are missing (gapped) or biased by the reference genome chosen for alignment. De novo approaches are unbiased by prior data, but they are significantly more computationally intensive, and can require more curation following the initial assembly to obtain optimal results. However, the effort required by de novo assembly is often rewarded by a greater ability to detect structural changes in genomes, such as insertions or deletions, or minor variants in the viral population. We have found that a combination of iterative de novo assembly and reference-guided ordering of the resulting sequence blocks, as first demonstrated for herpesviruses by Cunningham et al. (Cunningham et al., 2010), has been the most useful and robust approach thus far -whether for cultured viral genomes, sparse field samples, or newer oligo-enrichment derived genomes (Parsons et al., 2015;Szpara et al., 2014;Shipley et al., 2018;Pandey et al., 2016) The tools for performing the steps of genome assembly and analysis are often open-source and include web-based platforms as well as others using command-line (Unix) interfaces. The vast majority of viral de novo assembly programs have been developed and tested only for RNA viruses. We have developed both web-based (VirAmp; (Wan et al., 2015)) and command-line based (VirGA; (Parsons et al., 2015)) workflows for de novo assembly of herpesvirus genomes. VirAmp is unique in using the Galaxy framework of webbased bioinformatics tools (Blankenberg et al., 2010;Giardine et al., 2005).
Most options for viral genome assembly, alignment, annotation and comparison utilize a command-line interface, which has a more difficult learning curve.
These Unix-based software options, such as our de novo genome assembly suite VirGA (Parsons et al., 2015), the alignment packages Bowtie and BWA (Langmead and Salzberg, 2012;Posada-Cespedes et al., 2016), and similar script-based workflows (Greninger et al., 2018;Mangul et al., 2014), are usually open source and are generally available through repositories like BitBucket or GitHub. If these programs prove too difficult to use, researchers can also choose from commercial packages (ex. Geneious© Biomatters and CLC Bio© Qiagen) for alignment or de novo assembly of genomes.
The proliferation of genomics software has been a boon for the production and publication of genomics data. However, this increased accessibility comes with a potential loss of quality, if these "easier-to-use" programs are used improperly.
The rationale for the generation of high-quality genomes has been established by the human genome project and by many microbial genome projects (Fraser et al., 2002). For those that are new to genomics, the question of what constitutes an optimal, "high quality" genome may be a very difficult one to answer. Without delving into a true point-by-point definition of what may be considered a high-quality genome assembly, there are a few guidelines that are helpful to follow, particularly for alphaherpesvirus genomics. Genomes should be fully assembled, ideally without gaps or unfinished regions in the final sequence. This includes the large structural repeat regions, and the genes contained within these. These genomes should also be annotated to the full ability that current research allows and deposited into an international repository such as GenBank or the European Nucleotide Archive (ENA). The genome annotation should include genes and protein coding regions, along with details on the assembly and annotation methods, in addition to strain source and handling (see Figure 2). These details enable future comparisons with other genomes and gene-sequence data. Raw sequence read data should be deposited in an international Short Read Archive (SRA), for others to validate and/or re-analyze in the future. As discussed above ("Clear nomenclature…"), high quality genomes with detailed annotation and nomenclature (Figure 2) enable future scientists to accurately trace the origins of the data and utilize these data in further studies.
Herpesviruses can occasionally present unique challenges to these guidelines, as completion of the sequencing of complex tandem repeats or G+C-rich sequences can be difficult. However, the absence of such sequences hinders future comparative studies, because partial genomes represent an incomplete source of data. This is particularly worrisome when these regions are a source of genetic diversity and rapid evolution, as with the herpesviruses. In a recent comparison of neonatal HSV-2 genomes, for example, we found almost sixty adult-derived HSV-2 genomes in the GenBank sequence database (Akhtar et al., 2018). However less than ten of these GenBank-deposited HSV-2 genomes included the key genes found in the internal and terminal repeat regions: the key transcriptional regulator ICP4 (encoded by the RS1 gene), the neurovirulence protein ICP34.5 (RL1), and the immediate-early protein ICP0 (RL2) (Johnston et al., 2017a(Johnston et al., , 2017bKoelle et al., 2017;Kolb et al., 2015;Newman et al., 2015). This missing data prevents a full genetic comparison of coding diversity in these genes, all of which have been demonstrated to be crucial for in vivo pathogenesis and spread (Roizman and Campadelli-Fiume, 2007). Since a sequenced viral genome forms the basis for all hybrid sequencing applications like RNAseq, chromatin-immunoprecipitation (ChIP) sequencing, and chromatin conformational capture (CCC or 3C) sequencing, deficiencies in the starting reference genome can cause many unintentional downstream problems. The tremendous insights that hybrid genomic approaches can provide therefore rely upon a strong foundation in the initial deciphering of viral genome populations.

Future Frontiers
Throughout this chapter we have discussed the progression of alphaherpesvirus genomics, from the first assembled genomes to the current state of the art, and some of the insights that these studies have revealed. The technological advancements that have occurred in recent years have driven a substantially greater understanding of the genetic diversity and evolution of herpesviruses.
We now have an appreciation of the variations that can be seen within herpesvirus genomes, and that these viruses often exist as populations which contain heterogeneous genomes. This heterogeneity has been documented via the in vivo detection of minor variants during genital infection by HSV-1 (Shipley et al., 2018) or HSV-2 (Minaya et al., 2017a), and the selection of drug-resistant variants during antiviral therapy (Burrel et al., 2010;Depledge et al., 2016b;Houldcroft et al., 2017;Sauerbrei et al., 2010). Studies using newer oligoenrichment approaches have begun to reveal the genetic diversity present in vivo, in natural human herpesvirus infections (Depledge et al., 2016b;Greninger et al., 2018;Johnston et al., 2017a;Shipley et al., 2018), without any potential bias from expanding the virus population in cell culture. The plasticity of alphaherpesvirus populations has also been observed in cultured virus stocks (Greninger et al., 2018;Parsons et al., 2015), and examined over time, as seen in our recent application of in vitro evolution to HSV-1 (Kuny et al., 2020). With a renewed appreciation of the connection between herpesvirus genetics and biology, it is only natural to look to the future for the next revelations.
First, the potential for improvement within third-generation sequencing (longread) technology is tantalizing, and its application to herpesvirus genomics could be revolutionary. Early applications of MinION and SMRT long-read sequencing in herpesvirology have already been useful, providing many new insights into novel transcriptional networks (Depledge et al., 2018a(Depledge et al., , 2018cTombácz et al., 2017bTombácz et al., , 2017a. The combination of this approach with approaches discussed above (see "Other contributions to functional diversity" above), such as ribosome profiling or the analysis of novel transcripts, will further illuminate the functional genetic outputs of these viral genomes (Depledge et al., 2018a). Oldfield et al. recently demonstrated a completely synthetic approach to the generation of an alphaherpesvirus genome, which provides an opportunity for vast improvements in the efficiency of generating multiple precise mutations in a single viral genome, albeit at a currently higher cost (Oldfield et al., 2017). Maturation and improvement in the accuracy of MinION and SMRT technologies will allow for improved detection of recombinant genomes and structural variants, as well as the ability to define genetic haplotypes in mixed populations. These technologies may also allow for sequencing through the entirety of the large internal and terminal repeat regions of alphaherpesviruses, allowing improved assembly and a higher-throughput analysis of those areas of the genome. These methods may also be able to detect methylated bases or secondary structures in the viral DNA, or even read substrates besides DNA and RNA (Depledge et al., 2018a).
Another intriguing future area of alphaherpesvirus genomics research will be to link viral genetics to observable phenotypes, particularly virulence (Dutilh et al., 2013;Power et al., 2016). As current-generation HTS technologies have matured, genome wide association studies (GWAS) and quantitative trait locus (QTL) studies have been applied to this question for herpesviruses. Brandt and colleagues recently used QTL analyses to examine the contribution of specific viral genotypes to ocular phenotypes of HSV-1 infection in mice Lee et al., 2016). This study used the recombinant progeny of two strains of HSV-1, where the individual recombinants were completely sequenced with HTS and analyzed with comparative genomics (Lee et al., 2015). Examining the phenotypes of randomly generated or naturally occurring variants is also known as forward genetics, and these approaches can complement previous decades of reverse genetic approaches (i.e. disrupting specific genes and studying the results). Reverse genetics was crucial to discover the function(s) of many herpesvirus genes, but the types of mutations that are introduced in most reverse genetic approaches (e.g. entire gene deletion or complete disruption of enzymatic function) are rarely seen in field or clinical samples (Bryant et al., 2018;Depledge et al., 2018b;Pandey et al., 2016;Johnston et al., 2017b). The more subtle changes observed in natural settings may have substantial impacts in human or animal clinical outcomes (Loncoman et al., 2017;Bryant et al., 2018;Mechelli et al., 2015;Arav-Boger, 2015). Extending GWAS analyses to naturally isolated viral variants, particularly in comparison to matched human samples, will shed light on the intersection of human and viral genetics that leads to a spectrum of observed disease outcomes (Thompson et al., 2014;Ramchandani et al., 2019).