1

Genetic Diversity of Bundibugyo Ebolavirus from Uganda and the Democratic Republic of 

Congo

Isaac Emmanuel Omara1,2, Sylvia Kiwuwa-Muyingo3,7, Stephen Balinandi1, Luke Nyakarahuka1,5, 

Jocelyn Kiconco1, John Timothy Kayiwa1, Gerald Mboowa2,4, Daudi Jjingo4,6, Julius J. Lutwama1 

Affiliations

1. Department of Arbovirology, Emerging and Re-emerging Infectious Diseases, Uganda Virus 

Research Institute, Entebbe, Wakiso, Uganda

2. Department of Immunology and Molecular Biology, School of Biomedical Sciences, College of 

Health Sciences, Makerere University, Kampala, Uganda

3. Department of Data Measurement and Evaluation, African Population and Health Research 

Center, Nairobi, Kenya

4. The African Center of Excellence in Bioinformatics and Data-Intensive Sciences, The Infectious 

Disease Institute, Makerere University, Kampala, Uganda

5. School of Bio-security, Bio-technical and Laboratory Sciences, College of Veterinary Medicine, 

Animal Resources and Bio-security, Makerere University, Kampala, Uganda

6. Department of Computer Science, College of Computing and Information Sciences, Makerere 

University, Kampala, Uganda

7. MRC/UVRI and LSHTM Uganda Research Unit, Entebbe, Uganda

Corresponding author

Email: omara.isaac.88@gmail.com (OIE)

Author Contributions 

Isaac E. Omara: Conceptualization, retrieved and curated the data, formal analysis, Interpretations, 

drafted original manuscript, coordinated manuscript writing and editing

Sylvia Kiwuwa-Muyingo: Made comments, guided and provided mentorship throughout the 

manuscript writing up to submission. 

Stephen Balinandi: Involved in conceptualization, made comments and reviewed manuscript

Luke Nyakarahuka: Involved in conceptualization, made comments and reviewed manuscript

John T. Kayiwa: Made comments and reviewed manuscript

Jocelyn Kiconco: Made comments and reviewed manuscript

Gerald Mboowa: Guidance during conceptualization, overall over sight during project execution

Daudi Jjingo: Guidance during conceptualization, overall over sight during project execution

Julius J. Lutwama: Guidance during conceptualization, overall over sight during project execution

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


2

Abstract

Background

The Ebolavirus is one of the deadliest viral pathogens which was first discovered in the year 1976 

during two consecutive outbreaks in the Democratic Republic of Congo and Sudan. Six known 

strains have been documented. The Bundibugyo Ebolavirus in particular first emerged in the year 

2007 in Uganda. This outbreak was constituted with 116 human cases and 39 laboratory confirmed 

deaths. After 5 years, it re-emerged and caused an epidemic for the first time in the Democratic 

Republic of Congo in the year 2012 as reported by the WHO. Here, 36 human cases with 13 

laboratory confirmed deaths were registered. Despite several research studies conducted in the past, 

there is still scarcity of knowledge available on the genetic diversity of Bundibugyo Ebolavirus. We 

undertook a research project to provide insights into the unique variants of Bundibugyo Ebolavirus 

that circulated in the two epidemics that occurred in Uganda and the Democratic Republic of Congo 

Materials and Methods

The Bioinformatics approaches used were; Quality Control, Reference Mapping, Variant Calling, 

Annotation, Multiple Sequence Alignment and Phylogenetic analysis to identify genomic variants  

as well determine the genetic relatedness between the two epidemics. Overall, we used 41 viral 

sequences that were retrieved from the publicly available sequence database, which is the National 

Center for Biotechnology and Information Gen-bank database. 

Results

Our analysis identified 14,362 unique genomic variants from the two epidemics. The Uganda 

isolates had 5,740 unique variants, 75 of which had high impacts on the genomes. These were 51 

frameshift, 15 stop gained, 5 stop lost, 2 missense, 1 synonymous and 1 stop lost and splice region. 

Their effects mainly occurred within the L-gene region at reference positions 17705, 11952, 11930 

and 11027. For the DRC genomes, 8,622 variant sites were identified. The variants had a modifier 

effect on the genome occurring at reference positions, 213, 266 and 439. Examples are C213T, 

A266G and C439T. Phylogenetic reconstruction identified two separate and unique clusters from 

the two epidemics.

Conclusion

Our analysis provided further insights into the genetic diversity of Bundibugyo Ebolavirus from the 

two epidemics. The Bundibugyo Ebolavirus strain was genetically diverse with multiple variants. 

Phylogenetic reconstruction identified two unique variants. This signified an independent spillover 

event from a natural reservoir, rather a continuation from the ancestral outbreak that initiated the 

resurgence in DRC in the year 2012. Therefore, the two epidemics were not genetically related. 

Keywords: Bundibugyo, Ebolavirus, RT-PCR, DRC, RNA, Viral Hemorrhagic Fever

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


3

Introduction

The Ebolavirus is one of the deadliest viral pathogens which was first discovered in the year 1976 

during two consecutive outbreaks in the Democratic Republic of Congo (DRC) and Sudan(1). Since 

then, over 30 different outbreaks have been reported in Sub-Saharan Africa with an estimated 

14,000 deaths and case fatality rates of up to 90% (2)(1). These viruses belong to the family 

Filoviridae and Genus Ebolavirus (2). There are six known strains in the genus Ebolavirus, all of 

which have a negative sense-single stranded RNA genome of approximately 18 -19 kilo base pairs 

(3). They include; Zaire Ebolavirus, Sudan Ebolavirus, Bundibugyo Ebolavirus, Reston Ebolavirus, 

Tai Forest Ebolavirus (4) and Bombali Ebolavirus (5). The first three strains have been documented 

to cause severe illness and death in both humans and non-human primates with case fatality rates 

ranging from 40%-90% (6) (7). The Reston and Tai Forest Ebolavirus have not yet been discovered 

to cause human mortalities(1). Since its first discovery in the year 1976, there has been recurrent 

outbreaks of Ebolaviruses in Sub Saharan African countries(8)(9)(1)(10). With new cases reported 

almost every after five years in East and Central Africa for example, Uganda has reported seven 

different Ebolavirus outbreaks since the year 2000 and the DRC has recorded its 12th outbreak this 

year in February 2021(1)(11)(12). In particular, the Bundibugyo Ebolavirus has a genome size of 

18,940 base pairs and its RNA genome encodes seven structural proteins namely; Nucleoprotein 

(NP), two virion proteins (VP35 and VP40), a surface Glycoprotein (GP) and additional two virion 

proteins (VP30 and VP24). The genome also consists of an RNA- dependent, RNA polymerase (L) 

and a non-structural soluble protein (sGP proteins). The L gene codes for the RNA Polymerase, 

which is the most conserved region where as the VP40 virion protein is the most polymorphic gene 

in the Ebolavirus. The Bundibugyo Ebolavirus made its first appearance on the 1st August 2007, 

when there were reported cases of a viral hemorrhagic fever in Bundibugyo and Kikyo townships, a 

district in the western part of Uganda (11). This outbreak resulted into 116 human cases and 39 

laboratory confirmed deaths (13). The index case was suspected to be a 26-year-old woman from 

Kabango village in Bundibugyo district. She presented with general weakness, fever and diarrhea 

after which she was hospitalized (13). Together with other suspect cases, blood samples were 

collected, sent to the Uganda Virus Research Institute (UVRI) and the US Centers for Disease 

Control and Prevention. Several laboratory investigations were performed and they confirmed on 

the 29th November 2007, a very unique and therefore novel strain of Ebolavirus, that was named 

Bundibugyo (13) (14). The Epidemiological data collected from this investigation found hunting 

spears near her home but hunting as a practice was denied. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


4

In order to strengthen the response preparedness of the Viral Hemorrhagic Fever (VHF) in Uganda, 

the UVRI re-initiated the VHF National Surveillance programme in the year 2010 [34]. This was 

through an agreement between the Uganda Ministry of Health, Uganda Virus Research Institute and 

the US Centers for Disease Control and Prevention [32]. To date, UVRI serves as the national and 

regional reference laboratory for detection and response to VHF outbreaks which are of public 

health relevance in the region [35]. Currently, there is improved diagnostics to provide real time 

reporting of VHF cases detected. Laboratory diagnostic assays that have been implemented include; 

IgM and IgG ELISA for antigen-detection, RT-PCR as well as sequencing [36]. 

Since then, several research studies have been conducted for example; the work done by J.S. 

Towner et al, 2008 highlighted the high level of genetic diversity at amino acid level in the encoded 

virus proteins computing to over 27% and 35% for Bundibugyo and Zaire Ebolavirus respectively 

(11). Secondly, other research studies done elsewhere have also reported that variations in the 

Ebolavirus genome might have effects on the efficacy of virus detection at a sequence based level 

and design of candidate therapeutics (15). Thirdly, several years back, a research study that was 

conducted following two simultaneous occurrences of Ebolavirus in the DRC and South Sudan in 

1976 (16), found a correlation between Ebolavirus disease and animal disease outbreaks (17). This 

is because Ebolavirus is transmitted by direct contact with the blood or any other secretions from 

animals or persons (18) (19). In addition, more recent studies that involve the Polymerase Chain 

Reaction (PCR) and antibody tests have identified cave-dwelling fruit bats as the possible natural 

reservoirs to most Ebolavirus strains (20) (21). Spillover events therefore occur when the animal 

and human interface is bridged through human activities such as; hunting wildlife for bush meat 

(22). This then sparks of epidemics which is most often followed by sustained human to human 

transmissions (23) (24). Despite all these research studies, five years after the ancestral outbreak 

was declared over in Uganda, the Bundibugyo Ebolavirus re-emerged and caused an epidemic for 

the first time in the DRC (25). This was reported by the WHO on the 17th August 2012 in Isiro 

Province (25) (26). The putative index case for this epidemic remains unidentified [14]. However, 

the earlier laboratory investigations using the RT-PCR assays confirmed, a clinic nurse in Isiro 

Province whose symptoms began on the 28th June 2012 (25). She reported with multiple potential 

exposures like human contact with other sick people, exposure to bats and as well she attended a 

funeral service (25). This outbreak resulted into 36 human cases with 13 laboratory confirmed 

deaths (26).  Despite all these research studies, there is still limited scientific information available 

to explain the genetic diversity of this strain. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


5

Further to this, in light of new evidence from the February 2021 outbreaks of Ebolavirus in the 

Republic of Guinea and the DRC, a new and unique paradigm or pattern for how these outbreaks 

spark off has been identified (27) (17). This new research findings suggests that the putative index 

cases leading to the resurgences of the February 2021 outbreaks are linked to contacts with 

survivors from past Ebolavirus outbreaks (28). Surprisingly, the previous outbreak of Ebolavirus in 

Guinea occurred 5-7 years ago at the time of the West African outbreak (29) (30). Whereas the 

resurgence in the DRC occurred a year after the 2020 outbreak was declared over (17) This cases 

have already raised important new research questions such as; “How do we need to change our 

response to escape from the cycle of outbreak-response-re-introduction-outbreak”, “can new 

therapeutics be used to clear viruses from survivors” and the immediate question is, what these new 

findings mean for Ebolavirus survivors who are already faced with a lot of challenges (31). This 

therefore has created a need for reconsideration into local and scientific accounts of past Ebolavirus 

outbreaks (27) for example; the two epidemics of Bundibugyo Ebolavirus in Uganda in the year 

2007 (11) and the DRC in the year 2012 (25). We therefore undertook a research study to get a 

better understanding of these two epidemics. Our main aim was to determine the genetic diversity 

of Bundibugyo Ebolavirus from Uganda and the Democratic Republic of Congo. The specific goals 

were to; i) To identify the unique variants in isolates of Bundibugyo Ebolavirus from the epidemics 

that occurred in Uganda and the Democratic Republic of Congo, ii) To determine the genetic 

relatedness between the Bundibugyo Ebolavirus outbreaks in Uganda (2007) and the Democratic 

Republic of Congo (2012). Ultimately, we aimed to determine whether the resurgence of 

Bundibugyo Ebolavirus in DRC was an independent spillover event from nature or a continuation 

from the ancestral outbreak, possibly through contacts with past survivors. 

Fig 1: The working hypotheses for the resurgence in DRC in the year 2012

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


6

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


7

Materials and Methods

Study Design

This was a retrospective descriptive study. The source organism in our analysis was the Bundibugyo 

Ebolavirus strain, which has a genome size of 18,940 base pairs (32). We used publicly available 

sequence data that was retrieved from the National Center for Biotechnology and Information Gen-

bank database (33) The NCBI, has the Sequence Read Archive (SRA) repository and the Nucleotide 

sequence database (34). The SRA is the largest publicly available repository having raw sequence 

data from high throughput sequencers (35). The 31 raw sequence data which represented isolates 

collected from the epidemic in Uganda was retrieved from this repository. Whereas the Nucleotide 

sequence database has assembled genomes deposited from different experiments (36). The 4 

nucleotide sequences that represented isolates from the DRC in our analysis was retrieved from this 

database. 

Sample Size Determination

The isolates were; 31 fastq sequences, 6 fasta sequences from the Uganda outbreak in the year 

2007. The isolates from the DRC outbreak were represented by the 4 fasta sequences that were 

retrieved from the nucleotide sequence database (36). The table 1 below shows the characteristics of 

the isolates which the DRC sequences were generated (25)

Table 1: Patient and Sample characteristics from the DRC outbreak in the year 2012
Case ID Gene-bank Number Demographics Occupation Sampling Location Clinical Status
112 KC545393 44/F homemaker Isiro Province deceased
120 KC545394 77/M unknown Vungba deceased
122 KC545395 Unknown unknown Unknown survived
37 KC545396 18/F student Isiro Province deceased
F, female; M, Male

Bioinformatics Analysis

The fastq-dump tool (37) was used to download all the sequence data from the NCBI database. This 

included the sequences with their metadata which were all stored in a High-Performance 

Computing (HPC) server at the African Center of Excellence in Bioinformatics and Data Intensive 

Sciences (38). Quality Assessment was not performed on the DRC sequences. This is because, they 

were assembled genomes. However for the 31 raw sequence data (fastq format) collected from the 

Uganda outbreak in the year 2007, a quality control check was performed comprehensively in order 

to ensure they were of good quality before downstream analysis (39). This assessment was 

performed using tools; Fast-QC (v0.11.9) (40) and Multi-QC (v1.9) (41) respectively. The low

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


8

quality regions were then trimmed, including adapter sequences setting the phred threshold at 20 

(42).  

Variant Analysis 

For the Uganda genomes, the quality filtered raw sequence data were referenced mapped against a 

reference genome using the Burrow’s Wheeler Aligner tool (0.7.17-r1188) (43) The reference 

genome used was an isolate from the 2007 outbreak in Uganda (Gene-bank accession number 

FJ217161). This isolate was used because it has a complete genome size of 18,940 base pairs and it 

is also an original isolate from the ancestral outbreak. Variants were then called using freebayes tool 

(v1.3.1-dirty) (44)(45). This tool has advantages over other tools because, it is haplotype based, also 

a Bayesian genetic variant detector and outputs a variant call format (VCF) file, which consists of 

small polymorphisms specifically SNPs (single-nucleotide polymorphisms), Indels (Insertions and 

Deletions), MNPs (multi-nucleotide polymorphisms) (45). We then used SnpEff tool to perform 

variant annotation (46). This tool predicts the functional effect of the variants on proteins or amino 

acid changes (47). 

To annotate variants, a database from the reference genome has to be built. This was performed 

using “SnpEff build” tool. To create the SnpEff database, we downloaded sequence data from 

NCBI for the reference genome of Bundibugyo Ebolavirus with accession number, FJ217161. We 

also downloaded the corresponding General Feature Format (GFF) file, which contains the 

annotations and the FASTA file, with its entire genome (48) The SnpEff tool was then used to 

annotate variants. Once the analysis was executed, the annotation data was outputted as an 

annotated Variant Call Format (VCF) and an HTML report file containing all the summary statistics 

for the different variants (46). In addition, Python v3.6.3 (49) was then used to construct a bar plot 

to show the frequency of unique variants with high impacts on the genome.

On the other hand, all the DRC sequences including an isolate of the Bundibugyo Ebolavirus from 

the 2007 outbreak as the reference sequence (Gene-bank accession number FJ217161) were 

concatenated in to a single multi-fasta file and saved as a FASTA format. This reference sequence 

was used in order to determine how the variants from the 2012 outbreak were phylogenetically 

distinct from the 2007 outbreak in Uganda. Multiple Sequence Alignment was performed on the 

fasta sequences using MAFFT v7.310 tool (50). After this, variants were then called using the 

alignment FASTA file as input and the SNP extraction tool, SNP-sites v2.3.3 (51). This tool 

restructures the aligned data as a Variant Call Format (VCF) file. This VCF file provides a clear

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


9

mapping of SNPs from the aligned sequences. This then allowed easy identification of the SNP 

location and the genotype for each sample at a given locus (52). In the outputted VCF file, the rows 

correspond with each unique variant and the columns provides the genotype at that given site (53). 

A summary of the SNPs relative to the reference sequence was then visualized using the snipit tool 

(https://github.com/aineniamh/snipit) and SnpEff tool was used to annotate the variants. Using 

different bash scripts, a report showing the effect of the variants on the different sequences was 

extracted (54) 

Phylogenetic Analysis to determine the genetic relatedness between the two Epidemics

All the quality filtered raw sequence data from Uganda, were assembled using both SPades v3.13.1 

(55) and abyss 1.9.0 assemblers (56) (57) in order to obtain a consensus sequence. These two 

genome assemblers are best suited for assembly of short paired end reads (58). A draft scaffold was 

then obtained with the use of SSPACE tool (59). This is a standalone tool and was used for 

scaffolding the paired end reads. It enabled read orientation into connected sequences by allowing 

mean values and standard deviations of the insert sizes for each read library (59). GapFiller tool was 

then used to find and fill gaps generated in the contiguous sequence (60). All the obtained fasta 

sequences from both Uganda and the DRC were concatenated in to a single FASTA file including 

the reference sequence. Multiple Sequence Alignment was then performed using MAFFT v7.310 

(50) and manually checked in Ali-view v.1.27(61). The 5’ and the 3’ untranslated regions were then 

trimmed to remove any remaining gaps. The maximum-likelihood phylogenetic tree was then 

constructed using IQ-TREE (62) and Phyml (63) and the best suited substitution model was 

determined and run for 1000 replicates. The resulting newick file was uploaded to the interactive 

tree of life, iTOL v4.0(64), which is an online tool for phylogenetic tree visualization. The tree was 

rooted at mid-point to split variants from Uganda and the Democratic Republic of Congo.   

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


10

RESULTS

Variant Analysis 

In the Uganda genomes, 37 sequences of Bundibugyo Ebolavirus were analyzed. We identified 

5,740 distinct genome variants and they are recorded in table 2 below. Generally, the variants were 

distributed according to the different regions (downstream, upstream, exon intergenic, 3 and 5 

prime Untranslated Regions (UTR). However, majority of the variants were found downstream and 

upstream regions of the genome (non-coding regions). The viral sequences showed multiple 

diversity with most variants occurring at reference positions 17705, 11952, 11930 and 11027 

appearing in most of the isolates collected from this outbreak. 

Table 2: The frequency of each unique variant type in isolates of the Bundibugyo Ebolavirus 

collected from the 2007 outbreak in Uganda. 

Number of Effects by Type
Annotation Counts

downstream_gene_variant 2,375
upstream_gene_variant 1,945
missense_variant 543
synonymous_variant 284
3_prime_UTR_variant 268
intergenic_region 103
5_prime_UTR_variant 79
frameshift_variant 68
stop_gained 57
splice_region_variant 7
stop_lost 5
5_prime_UTR_premature_start_codon_gain_variant 3
conservative_in-frame_deletion 2
stop_retained_variant 1 
Total number of unique genomic variants 5,740

In addition, we then identified 75 unique variants of high impacts on the genome. They were; 51 

frameshift, 15 stop gained, 5 stop lost, 2 missense, 1 synonymous and 1 stop lost and splice region. 

Among these variants, the most common impacts were majorly frame-shifts and stop gained. They 

include; TAT17705TT, GAAAAAATTTTG11952GAAAAAAATTTTG, G11930T, 

CAAAAAACCCG11027CAAAAAAACCCG. Their effects occurred mostly on the L-gene region 

of the Bundibugyo Ebolavirus. Refer to S2 in Appendix for the supporting information showing a 

table indicating the frequency of unique variants which had high impacts on the genome. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


11

Fig 2: A Bar-plot showing the frequency of unique variants with high impact on the genomes 

of Bundibugyo Ebolavirus isolated from the 2007 outbreak in Uganda 

On the other hand, we identified 8,622 nucleotide variant sites from the isolates of Bundibugyo 

Ebolavirus collected from the 2012 outbreak. The variants identified here all had a modifier effect 

on the genome. This effect was predetermined by the variant type in each of the isolates. Some of 

them include; C213T, A266G and C439T. Fig 3 below shows the different nucleotide variant sites 

in the DRC sequences relative to the reference sequence with Gene-bank accession number of 

FJ217161. This was a complete genome and isolated from the 2007 outbreak. The purpose of using 

this as a reference sequence was to find out how the sequences from the 2012 outbreak in the DRC 

were genetically distinct and unique from the ancestral outbreak of 2007 which occurred in Uganda. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


12

Fig 3: Nucleotide alignment showing variant sites in the four sequences relative to the reference 

genome. The 4 sequences represent isolates collected from the DRC outbreak in 2012 (NCBI 

accession numbers: KC545393, KC545394, KC545395, KC545396). The reference sequence is an 

isolate from the ancestral outbreak (Gene-bank accession number: FJ217161) 

Phylogenetic Analysis to determine the genetic relatedness between the two Epidemics 

When the tree in Fig 4 below was rooted at mid-point, two separate and unique clusters were 

identified from these two epidemics. Phylogenetic reconstruction demonstrates that the 4 sequences 

isolated from the outbreak in DRC cluster uniquely and distant from those of the 2007 outbreak in 

Uganda. This signify a separate variant and basing on our analysis, we identified approximately 

8,622 mutations from the 4 DRC sequences, which is almost double the number of mutations 

identified from the ancestral outbreak in Uganda (5,740)

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


13

Fig 4: Maximum likelihood phylogenetic tree showing the two unique clusters identified from the 

two epidemics that occurred in Uganda (2007) and the Democratic Republic of Congo (2012). The 

reference sequence used was an isolate from the ancestral outbreak having a Gene-bank accession 

number of FJ217161

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


14

Discussion

We generally aimed to determine the genetic diversity of Bundibugyo Ebolavirus from the 

epidemics that occurred in Uganda in the year 2007 and the DRC in the year 2012. Our specific 

goal was to identify the unique variants in isolates collected from these two epidemics. This was to 

ultimately enable us determine the genetic relatedness between two epidemics. 

Our analysis identified in total, 14,362 unique genomic variants from the two epidemics. The high 

impacts variants in the Uganda genomes were mainly frame shifts, stop gained and missense 

mutations. A frameshift is a genetic variant that changes the way codons are read during the process 

of creating an amino acid sequence (65). This variant is due to an insertion or deletion of a 

nucleotide (66). This is of significance because cells read a gene in groups of three bases. The three 

bases here correspond to one of 20 different amino acids that is used to build a protein (67). 

Therefore, if a mutation disrupts this reading frame, then the sequence of DNA that follows the 

mutation will be read incorrectly (68). With a stop gained variant, the mutation leads to changes of 

at least one base of a codon, hence a premature stop codon (69). This results in a premature stop of 

translation of messenger RNA in to a protein hence a non-functional or unstable protein (52). In 

addition, a missense mutation is a genetic change where a single pair of substitution alters the 

genetic code leading to production of a new amino acid (70). Most of these variant effects occurred 

within the L-gene region of the Bundibugyo Ebolavirus. Since the L-gene is the most conserved 

region and is a target for primer designs (71), the findings from our analysis is in line with a 

previous study conducted after the 2007 outbreak (11). Where sequence analysis of the PCR 

fragment from the virus L-gene revealed the initial failure of real-time RT-PCR assays, since the 

viral sequences were divergent from the four already known strains of Ebolavirus (11). Therefore, 

these alterations of the genetic code or disruption of one reading frame could have resulted in to the 

formation of a new strain of Ebolavirus and hence our findings supports this past study. The high 

frequency of frameshift variants is suggestive of a new strain (52), the Bundibugyo Ebolavirus, 

which was a novel strain that first emerged in the year 2007 in Uganda (11).  

On the other hand, 8,622 nucleotide variant sites were identified from the DRC genomes. The 

variants had a modifier effect on the genome. This effect was predetermined by the type of variant 

identified in these isolates (52).  Modifier variants are genes that alter the phenotypic outcomes and 

results in to altered effects or impacts (72). The phenotypic outcomes here could include; 

dominance, expression and penetrance (73). Naturally, viruses accumulate mutations over time 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


15

which may arise from adaptations in response to environmental changes or immune responses of the 

host reservoirs (74). Sometimes viruses transmit and persists after fixing beneficial mutations that

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


16

 would allow it to better exploit it’s host or other new hosts (75). This scenario could explain the re-

emergence of the Bundibugyo Ebolavirus. That is, after the ancestral outbreak was declared over in 

the year 2007, this virus might have undergone an evolutionary change over a period of five years 

in a certain natural reservoir, generating variations in it’s genome. This resulted into a separate and 

unique variant that was responsible for the 2012 outbreak in the DRC (76). The ultimate high 

frequency of modifier effects on the genome is an indicator and possibly explains, the divergence or 

formation of a new variant that was unique from the ancestral type (25)(26).

Phylogenetic trees help in our understanding of the evolutionary relationships between groups (77). 

In our context, we used it to determine the genetic relatedness between the epidemics that occurred 

in Uganda in the year 2007 and the DRC in the year 2012. Phylogenetic reconstruction in Fig 4 

demonstrates that the 4 sequences from the 2012 outbreak in DRC cluster together and are similar 

but distantly related from those of the ancestral outbreak (78). This signify a new variant and basing 

on our analysis, we identified approximately 8,622 mutations from the 4 DRC sequences, which is 

almost double the number of mutations identified from the ancestral outbreak in Uganda (5,740) 

(79). This is indicative of viral evolution over the period of five years (80). In other words, the 

frequency of SNPs or mutations occurrence in a genome under the conditions of a survivor 

organism is reduced by a big magnitude compared to that from a host reservoir (81). This is because 

the virus under goes a period of latency in a human survivor (82). Therefore, these two separate 

variants indicate that the 2012 outbreak in DRC was a new introduction or an independent spillover 

event from a certain animal reservoir, rather a human transmission from a contact with a past 

survivor. 

Our study however had limitations such as; limited sampling which led to less sequence data 

generated from the 2012 outbreak. Some patient demographics were unknown, this hindered our 

understanding in to the Epidemiology and Molecular findings. This led to uncertainty in drawing 

conclusions on the genetic diversity of Bundibugyo Ebolavirus from the 2012 outbreak in the DRC.  

For example; in variant analysis and phylogenetic estimations. The availability of more or complete 

genomes from the DRC outbreak in 2012 would improve the study of transmission dynamics 

between these two epidemics as well as identification of multiple key SNP’s that can promote the 

study of Bundibugyo Ebolavirus pathogenesis. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


17

In conclusion, our analysis provided further insights into the genetic diversity of Bundibugyo 

Ebolavirus from the two epidemics. Variant characterization can be used in the fight against 

Bundibugyo Ebolavirus and the development of effective treatments or vaccines. This is because 

key SNPs have been identified and can be used for further research about the pathogenesis of 

Bundibugyo Ebolavirus. The findings from our study has also provided knowledge on the likely 

origin or how the 2012 outbreak in the DRC was initiated. Phylogenetic reconstruction identified 

two unique variants. This signified an independent spillover event from a natural reservoir, rather a 

continuation from the ancestral outbreak that initiated the resurgence in the DRC in the year 2012. 

Therefore, the two epidemics are not genetically related. 

Abbreviations

BDBV: Bundibugyo Ebolavirus, RT-PCR: Reverse Transcription Polymerase Chain Reaction, 

DRC: Democratic Republic of Congo, SNP: Single Nucleotide Polymorphism, IgM: 

Immunoglobulin M, IgG: Immunoglobulin G, UVRI: Uganda Virus Research Institute

Acknowledgments

I would like to extend my sincere gratitude to the department of Immunology and Molecular 

Biology, College of Health Sciences in Makerere University, for the training leading to the award of 

a Master of Science in Bioinformatics. Special thanks to the department of Arbovirology, Emerging 

and Re-emerging Infectious Diseases at the Uganda Virus Research Institute. They offered financial 

support to facilitate my studies. Finally, I would like to extend my appreciation to the MRC/UVRI 

and LSHTM Uganda Research Unit for an offer of a Manuscript Mentorship Programme that 

eventually facilitated the submission of this master’s research project work for publication.

Supporting Information

S1 Appendix: List of the Gene-bank identifiers for the sequences that were used in our analysis

S2 Appendix: The frequency of unique variants which had high impact on the genomes

Ethical Clearance

This research project was approved by the School of Biomedical Sciences Research and Ethics 

Committee (SBSREC). This is an institutional review board found within the College of Health 

Sciences in Makerere University. The protocol number was SBS-2021-64

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


18

Data Availability

There was no funding for this project. We used publicly available sequence data that was retrieved 

from the National Center for Biotechnology and Information (NCBI). Below are the links.

Raw sequence data:  https://www.ncbi.nlm.nih.gov/sra/?term=Bundibugyo+Ebolavirus+in+Uganda

Assembled genomes: https://www.ncbi.nlm.nih.gov/nuccore/?term=Bundibugyo+Ebolavirus

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


19

References

1. Rugarabamu S, Mboera L, Rweyemamu M, Mwanyika G, Lutwama J, Paweska J, et al. 
Forty-two years of responding to Ebola virus outbreaks in Sub-Saharan Africa: A review. 
BMJ Glob Heal. 2020;5(3):1–10. 

2. Majid MU, Tahir MS, Ali Q, Rao AQ, Rashid B, Ali A, et al. Nature and history of Ebola 
virus: an overview. Arch Neurosci. 2016;3(3):e35027. 

3. Dietzel E, Schudt G, Krähling V, Matrosovich M, Becker S. Functional Characterization of 
Adaptive Mutations during the West African Ebola Virus Outbreak. J Virol. 2017;91(2). 

4. Carroll SA, Towner JS, Sealy TK, McMullan LK, Khristova ML, Burt FJ, et al. Molecular 
evolution of viruses of the family Filoviridae based on 97 whole-genome sequences. J Virol. 
2013;87(5):2608–16. 

5. Goldstein T, Anthony SJ, Gbakima A, Bird BH, Bangura J, Tremeau-Bravard A, et al. The 
discovery of Bombali virus adds further support for bats as hosts of ebolaviruses. Nat 
Microbiol. 2018;3(10):1084–9. 

6. Chippaux JP. Outbreaks of Ebola virus disease in Africa: The beginnings of a tragic saga. J 
Venom Anim Toxins Incl Trop Dis. 2014;20(1):1–14. 

7. Chippaux J-P. Outbreaks of Ebola virus disease in Africa: the beginnings of a tragic saga. J 
Venom Anim Toxins Incl Trop Dis. 2014;20(1):44. 

8. Albariño CG, Shoemaker T, Khristova ML, Wamala JF, Muyembe JJ, Balinandi S, et al. 
Genomic analysis of filoviruses associated with four viral hemorrhagic fever outbreaks in 
Uganda and the Democratic Republic of the Congo in 2012. Virology [Internet]. 
2013;442(2):97–100. Available from: http://dx.doi.org/10.1016/j.virol.2013.04.014

9. Li X, Zai J, Liu H, Feng Y, Li F, Wei J, et al. The 2014 Ebola virus outbreak in West Africa 
highlights no evidence of rapid evolution or adaptation to humans. Sci Rep [Internet]. 
2016;6(October):1–9. Available from: http://dx.doi.org/10.1038/srep35822

10. Shoemaker T, MacNeil A, Balinandi S, Campbell S, Wamala JF, McMullan LK, et al. 
Reemerging Sudan ebola virus disease in Uganda, 2011. Emerg Infect Dis. 2012;18(9):1480. 

11. Towner JS, Sealy TK, Khristova ML, Albariño CG, Conlan S, Reeder SA, et al. Newly 
discovered Ebola virus associated with hemorrhagic fever outbreak in Uganda. PLoS Pathog. 
2008;4(11):3–8. 

12. Nsio J, Kapetshi J, Makiala S, Raymond F, Tshapenda G, Boucher N, et al. 2017 Outbreak of 
Ebola Virus Disease in Northern Democratic Republic of Congo. J Infect Dis. 
2020;221(5):701–6. 

13. Wamala JF, Lukwago L, Malimbo M, Nguku P, Yoti Z, Musenero M, et al. Ebola 
hemorrhagic fever associated with novel virus strain, Uganda, 2007-2008. Emerg Infect Dis. 
2010;16(7):1087–92. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


20

14. Towner JS, Sealy TK, Khristova ML, Albariño CG, Conlan S, Reeder SA, et al. Newly 
discovered ebola virus associated with hemorrhagic fever outbreak in Uganda. PLoS Pathog. 
2008;4(11):e1000212. 

15. Carneiro J, Pereira F. EbolaID: an online database of informative genomic regions for Ebola 
identification and treatment. PLoS Negl Trop Dis. 2016;10(7):e0004757. 

16. Bowen ETW, Platt GS, Lloyd G, Raymond RT, Simpson DIH. A comparative study of 
strains of Ebola virus isolated from southern Sudan and northern Zaire in 1976. J Med Virol. 
1980;6(2):129–38. 

17. Vivalya BM, Piripiri AL, Mbeva JBK. The resurgence of Ebola disease outbreak in North-
Kivu: viewpoint of the health system in the aftermath of the outbreak in the Democratic 
Republic of Congo. PAMJ-One Heal. 2021;5(5). 

18. Judson S, Prescott J, Munster V. Understanding ebola virus transmission. Viruses. 
2015;7(2):511–21. 

19. Rewar S, Mirdha D. Transmission of Ebola virus disease: an overview. Ann Glob Heal. 
2014;80(6):444–51. 

20. Ogawa H, Miyamoto H, Nakayama E, Yoshida R, Nakamura I, Sawa H, et al. 
Seroepidemiological prevalence of multiple species of filoviruses in fruit bats (Eidolon 
helvum) migrating in Africa. J Infect Dis. 2015;212(suppl_2):S101–8. 

21. Changula K, Kajihara M, Mori-Kajihara A, Eto Y, Miyamoto H, Yoshida R, et al. 
Seroprevalence of filovirus infection of Rousettus aegyptiacus bats in Zambia. J Infect Dis. 
2018;218(suppl_5):S312–7. 

22. Johnson CK, Hitchens PL, Evans TS, Goldstein T, Thomas K, Clements A, et al. Spillover 
and pandemic properties of zoonotic viruses with high host plasticity. Sci Rep. 2015;5(1):1–
8. 

23. Wood JLN, Leach M, Waldman L, MacGregor H, Fooks AR, Jones KE, et al. A framework 
for the study of zoonotic disease emergence and its drivers: spillover of bat pathogens as a 
case study. Philos Trans R Soc B Biol Sci. 2012;367(1604):2881–92. 

24. Kock RA, Begovoeva M, Ansumana R, Suluku R. Searching for the source of Ebola: the 
elusive factors driving its spillover into humans during the West African outbreak of 2013–
2016. OIE Sci Tech Rev. 2019;38(1):113–7. 

25. Hulseberg CE, Kumar R, Di Paola N, Larson P, Nagle ER, Richardson J, et al. Molecular 
analysis of the 2012 Bundibugyo virus disease outbreak. Cell Reports Med. 
2021;2(8):100351. 

26. Kratz T, Roddy P, Tshomba Oloma A, Jeffs B, Pou Ciruelo D, de la Rosa O, et al. Ebola 
virus disease outbreak in Isiro, Democratic Republic of the Congo, 2012: signs and 
symptoms, management and outcomes. PLoS One. 2015;10(6):e0129333. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


21

27. Fairhead J, Leach M, Millimouno D. Spillover or endemic? Reconsidering the origins of 
Ebola virus disease outbreaks by revisiting local accounts in light of new evidence from 
Guinea. BMJ Glob Heal. 2021;6(4):e005783. 

28. Keita AK, Düx A, Diallo H, Calvignac-Spencer S, Sow MS, Keita MB, et al. Resurgence of 
Ebola virus in guinea after 5 years calls for careful attention to survivors without creating 
further stigmatization. Virological. 2021; 

29. Marí Saéz A, Weiss S, Nowak K, Lapeyre V, Zimmermann F, Düx A, et al. Investigating the 
zoonotic origin of the West African Ebola epidemic. EMBO Mol Med. 2015;7(1):17–23. 

30. Spengler JR, Ervin ED, Towner JS, Rollin PE, Nichol ST. Perspectives on West Africa Ebola 
virus disease outbreak, 2013–2016. Emerg Infect Dis. 2016;22(6):956. 

31. Kupferschmidt K. New Ebola outbreak likely sparked by a person infected 5 years ago. 
Science (80- ). 2021; 

32. Oluwagbemi O, Awe O. A comparative computational genomics of Ebola Virus Disease 
strains: In-silico Insight for Ebola control. Informatics Med Unlocked. 2018;12:106–19. 

33. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD, et al. GenBank. 
Nucleic Acids Res. 2018;46(D1):D41–7. 

34. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. 
Nucleic Acids Res. 2010;39(suppl_1):D19–21. 

35. Kodama Y, Shumway M, Leinonen R. The Sequence Read Archive: explosive growth of 
sequencing data. Nucleic Acids Res. 2012;40(D1):D54–6. 

36. Mizrachi I. Genbank: the nucleotide sequence database. NCBI Handb [Internet], Updat. 
2007;22. 

37. Schmid MW. Rcount: User Guide. 2014; 

38. Mboowa G, Sserwadda I, Aruhomukama D. Genomics and bioinformatics capacity in Africa: 
no continent is left behind. Genome. 2021;64(5):503–13. 

39. Tong Y-G, Shi W-F, Liu D, Qian J, Liang L, Bo X-C, et al. Genetic diversity and 
evolutionary dynamics of Ebola virus in Sierra Leone. Nature. 2015;524(7563):93–6. 

40. Andrews S. Babraham bioinformatics-FastQC a quality control tool for high throughput 
sequence data. URL https//www bioinformatics babraham ac uk/projects/fastqc/[Google Sch. 
2010; 

41. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for 
multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–8. 

42. Krueger F, Andrews SR. Quality control, trimming and alignment of Bisulfite-Seq data (Prot 
57). Dep Med Hematol Oncol Domagkstr. 2012;3(48149):1–13. 

43. Hansen NF. Variant calling from next generation sequence data. In: Statistical Genomics. 
Springer; 2016. p. 209–24. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


22

44. Garrison E, Marth G. FreeBayes. Marth Lab. 2010; 

45. Mohammed KS, Kibinge N, Prins P, Agoti CN, Cotten M, Nokes DJ, et al. Evaluating the 
performance of tools used to call minority variants from whole genome short-read data. 
Wellcome open Res. 2018;3. 

46. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for 
annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in 
the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 
2012;6(2):80–92. 

47. Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino 
acid substitutions and indels. 2012; 

48. Pertea G, Pertea M. GFF utilities: GffRead and GffCompare. F1000Research. 2020;9. 

49. Consortium A gambiae 1000 G. Genetic diversity of the African malaria vector Anopheles 
gambiae. Nature. 2017;552(7683):96. 

50. Van Borm S, Vanneste K, Fu Q, Maes D, Schoos A, Vallaey E, et al. Increased viral read 
counts and metagenomic full genome characterization of porcine astrovirus 4 and Posavirus 1 
in sows in a swine farm with unexplained neonatal piglet diarrhea. Virus Genes. 
2020;56(6):696–704. 

51. Pakistan HI V. Public health round-up. Bull World Heal Organ. 2019;97:517–8. 

52. Bindayna KM, Crinion S. Variant analysis of SARS-CoV-2 genomes in the Middle East. 
Microb Pathog. 2021;153:104741. 

53. Khailany RA, Safdar M, Ozaslan M. Genomic characterization of a novel SARS-CoV-2. 
Gene reports. 2020;19:100682. 

54. Mishra D, Khandelwal G. Command-Line Tools in Linux for Handling Large Data Files. In: 
Bioinformatics: Sequences, Structures, Phylogeny. Springer; 2018. p. 375–92. 

55. Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes de novo 
assembler. Curr Protoc Bioinforma. 2020;70(1):e102. 

56. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel 
assembler for short read sequence data. Genome Res. 2009;19(6):1117–23. 

57. Liu Y, Schmidt B, Maskell DL. Parallelized short read assembly of large genomes using de 
Bruijn graphs. BMC Bioinformatics. 2011;12(1):1–10. 

58. Paszkiewicz K, Studholme DJ. De novo assembly of short sequence reads. Brief Bioinform. 
2010;11(5):457–72. 

59. Boetzer M, Henkel C V, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled 
contigs using SSPACE. Bioinformatics. 2011;27(4):578–9. 

60. Nadalin F, Vezzi F, Policriti A. GapFiller: a de novo assembly approach to fill the gap within 
paired reads. BMC Bioinformatics. 2012;13(14):1–16. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


23

61. Larsson A. AliView: a fast and lightweight alignment viewer and editor for large datasets. 
Bioinformatics. 2014;30(22):3276–8. 

62. Nguyen L-T, Schmidt HA, Von Haeseler A, Minh BQ. IQ-TREE: a fast and effective 
stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 
2015;32(1):268–74. 

63. Guindon S, Delsuc F, Dufayard J-F, Gascuel O. Estimating maximum likelihood phylogenies 
with PhyML. In: Bioinformatics for DNA sequence analysis. Springer; 2009. p. 113–37. 

64. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. 
Nucleic Acids Res. 2019;47(W1):W256–9. 

65. Chen J, Wu J-S, Mize T, Moreno M, Hamid M, Servin F, et al. A frameshift variant in the 
CHST9 gene identified by family-based whole genome sequencing is associated with 
schizophrenia in Chinese population. Sci Rep. 2019;9(1):1–9. 

66. Rausell A, Mohammadi P, McLaren PJ, Bartha I, Xenarios I, Fellay J, et al. Analysis of stop-
gain and frameshift variants in human innate immunity genes. PLoS Comput Biol. 
2014;10(7):e1003757. 

67. Berg JM. Amino Acids Are Encoded by Groups of Three Bases Starting from a Fixed Point. 
1970. 

68. Yourno J, Heath S. Nature of the hisD3018 frameshift mutation in Salmonella typhimurium. 
J Bacteriol. 1969;100(1):460–8. 

69. Cirulli ET, Heinzen EL, Dietrich FS, Shianna K V, Singh A, Maia JM, et al. A whole-
genome analysis of premature termination codons. Genomics. 2011;98(5):337–42. 

70. Gorlov IP, Pikielny CW, Frost HR, Her SC, Cole MD, Strohbehn SD, et al. Gene 
characteristics predicting missense, nonsense and frameshift mutations in tumor samples. 
BMC Bioinformatics. 2018;19(1):1–14. 

71. Hammou RA, Kasmi Y, Khataby K, Laasri FE, Boughribil S, Ennaji MM. Roles of VP35, 
VP40 and VP24 Proteins of Ebola Virus in Pathogenic and Replication Mechanisms. 
CRTOMIR P, Ebola, Croácia Intechopen. 2016;101–17. 

72. Davidson BA, Hassan S, Garcia EJ, Tayebi N, Sidransky E. Exploring genetic modifiers of 
Gaucher disease: The next horizon. Hum Mutat. 2018;39(12):1739–51. 

73. Nadeau JH. Modifier genes in mice and humans. Nat Rev Genet. 2001;2(3):165–74. 

74. Carroll SA, Towner JS, Sealy TK, McMullan LK, Khristova ML, Burt FJ, et al. Molecular 
Evolution of Viruses of the Family Filoviridae Based on 97 Whole-Genome Sequences. J 
Virol. 2013;87(5):2608–16. 

75. Agudelo-Romero P, Carbonell P, Perez-Amador MA, Elena SF. Virus adaptation by 
manipulation of host’s gene expression. PLoS One. 2008;3(6):e2397. 

76. Munson MA, Banerjee A, Watson TF, Wade WG. Molecular analysis of the microflora 
associated with dental caries. J Clin Microbiol. 2004;42(7):3023–9. 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


24

77. Ladner JT, Wiley MR, Mate S, Dudas G, Prieto K, Lovett S, et al. Evolution and Spread of 
Ebola Virus in Liberia, 2014-2015. Cell Host Microbe [Internet]. 2015;18(6):659–69. 
Available from: http://dx.doi.org/10.1016/j.chom.2015.11.008

78. Pereira‐Gomez M, Lopez‐Tort F, Fajardo A, Cristina J. An evolutionary insight into 
emerging Ebolavirus strains isolated in Africa. J Med Virol. 2020;92(8):988–95. 

79. Bosworth A, Rickett NY, Dong X, Ng LFP, García-Dorival I, Matthews DA, et al. Analysis 
of an Ebola virus disease survivor whose host and viral markers were predictive of death 
indicates the effectiveness of medical countermeasures and supportive care. Genome Med. 
2021;13(1):1–18. 

80. Emanuel J, Marzi A, Feldmann H. Filoviruses: ecology, molecular biology, and evolution. 
Adv Virus Res. 2018;100:189–221. 

81. Kritsky AA, Keita S, Magassouba N, Krasnov YM, Safronov VA, Naidenova E V, et al. 
Ebola virus disease outbreak in the Republic of Guinea 2021: hypotheses of origin. bioRxiv. 
2021; 82. Boseley S. Pauline Cafferkey: dedicated nurse and reluctant Ebola hero. 
Lancet. 2016;388(10043):455. 

 
.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


25

Supporting Information

S1 Appendix: List of the Gene-bank identifiers for the sequences that were used in our analysis

Gene-bank ID’s for the 
Uganda outbreak in 2007

Gene-bank ID’s for DRC outbreak in 
2012

SRR7621005        
SRR7621077

SRR7621144 KC545393

SRR7621007        
SRR7621078

SRR7167631 KC545394

SRR7621014        
SRR7621080

SRR7620993 KC545395

SRR7621024        
SRR7621081

SRR7620996 KC545396

SRR7621029        
SRR7621082

NC_014373

SRR7621032        
SRR7621087

MK028856

SRR7621040        
SRR7621090

MK028834

SRR7621047        
SRR7621140

KU182911

SRR7621059        
SRR7621142

KR063673

SRR7621074        
SRR7621143

MK028835

S2 Appendix: The frequency of unique variants which had high impact on the genomes

Gene-bank 
Identifiers

Position on the 
Reference

Reference bases Variants Variant Type Genomic
Region

SRR7167619 6899
11930
11952

16385

TAAAAAAACTT
G
GAAAAAATTTTG

AAC

TAAAAAAAACTT
T
GAAAAAAATTTTG, 
AAAAAAATTTTG
ACAC

frame-shift
stop_gained
frame-shift

frame-shift

GP-gene
L-gene
L-gene

L-gene
SRR7167620 7389

7390
7533

T
G
T

A,C (T7389A,C)
A,C ( G7389A,C)
G (T7533G)

stop_lost
stop_lost
stop_lost

GP-gene
GP-gene
GP-gene

SRR7167621 11930 G T (G11930T) stop_gained L-gene
SRR7621024 706

1566
6899
7786
10764
13046
14530

17734

T
ATC
TAAAAAAACTT
CATC
CAACT
TGGGAT
GCGTAG

T

A
AC 
TAAAAAAAACTT
CC
CACT
TGGAT
ACGTCAG

A

stop_gained
frame-shift 
frame-shift
frame-shift
frame-shift
frame-shift
frame-shift and  
synonymous
stop_gained

NP-gene
VP35
GP-gene
GP-gene
VP24
L-gene

L-gene
L-gene

SRR7167630 7389
7390

T
G

A,C,G
C

stop_lost
stop_lost

GP-gene
GP-gene

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


26

SRR7167631 6533 T G stop_gained GP-gene
SRR7620981 3246

8731
10587

11952

G
TTTTGTGTGA
TCTACT

GAAAAAATTTTG

T
TCTTTGTGTGA
ACTAGCT

GAAAAAAATTTTG

stop_gained
frame-shift
frame-shift and 
Missense
frame-shift

VP35
VP30
VP24

L-gene
SRR7620993 11952

17705

GAAAAAATTTTG

TAT

GAAAAAAATTTTG,
AAAAAAATTTTG
TT 

frameshift

frameshift

L-gene

L-gene

SRR7620996 784
1964

GAAAAAGGAAGGT
GAAAAAAATGAT

GAAAAGGAAGGT
GAAAAAAAATGAT

frameshift
frameshift

NP-gene
VP35

SRR7621082 1856
1947

AATA
CGGCT

AA
CCT

frameshift 
frameshift

NP-gene
NP-gene

SRR7621087 11952

17705

GAAAAAATTTTG

TAT

GAAAAAAATTTTG,
AAAAAAATTTTG
TT

frameshift

frameshift

L-gene

L-gene
SRR7621140 1322

5293
11027
11952 

C
CAAAAAATG
CAAAAAACCCG
GAAAAAATTTTG

T
CAAAAAAATG
CAAAAAAAACCCG
GAAAAAAAATTTTG

stop_gained
frameshift
frameshift
frameshift

NP-gene
VP40
VP24
L-gene

SRR7621143 1584

1658
1964
6278
7107
10513
11027
12895
14820
15478
16900
17145

17882

TAAAAGAC

A
GAAAAAAATGAT
G
C
TAAAAACT
CAAAAAACCCG
TAAGAGG
G
TGGGGGGCA
TAAAAAGT
CAGAGCATAGCATC
GAGGCAGAAA
C

TAAAGAC,
AAAAAGAC
T
GAAAAAAAATGAT
A
T
TAAAAAACT
CAAAAAAACCCG
TAGAGG
A
TGGGGGCA
TAAAAAAGT
CTCGA

T

frameshift

stop_gained
frameshift
stop_gained
stop_gained
frameshift
frameshift
frameshift
stop_gained
frameshift
frameshift
frameshift and 
missense
stop_gained

NP-gene

NP-gene
NP-gene
GP-gene
GP-gene
VP24
VP24
L-gene
L-gene
L-gene
L-gene
L-gene

L-gene
SRR7621144 12344 GAAGAT GAGAT frameshift L-gene
SRR7621047 1070

6548
6899
6930

7718
11027
13324
14533
15296
17268

C
TTCA
TAAAAAAACTT
A

CAAGC
CAAAAAACCCG
TAAAGC
TAGA
GTA
TTAC

T
TA
TAAAAAAAACTT
G

CAGC
CAAAAAAACCCG
TAAGC
TA
GA
TC

stop_gained
frameshift
frameshift
stop_lost and
splice_region
frameshift
frameshift
frameshift
frameshift
frameshift
frameshift

NP-gene
GP-gene
GP-gene
GP-gene

GP-gene
VP24
L-gene
L-gene
L-gene
L-gene

SRR7621005 1856
1947

AATA
CGGCT

AA
CCT

frameshift
frameshift

NP-gene
NP-gene 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


27

9106 C A stop_gained VP30
SRR7621007 1964

4731
11951

13409
14577
15008
16035

GAAAAAAATGAT
C
GG

AGTG
T
G
ATTTTATG

GAAAAAAAATGAT
T
GAA

AG
A
T
ATTTATG

frameshift
stop_gained
frameshift and 
missense
frameshift
stop_gained
stop_gained
frameshift

NP-gene
VP40 

L-gene
L-gene
L-gene
L-gene
L-gene

SRR7621074 11027
13672
15485
17705

CAAAAAACCCG
TGGTC
C
TAT

CAAAAAAACCCG
TGTC
T
TT

frameshift 
frameshift
stop_gained
frameshift

VP24
L-gene
L-gene
L-gene

SRR7621080 11952

17705

GAAAAAATTTTG

TAT

GAAAAAAATTTTG,AA
AAAAATTTTG
TT

frameshift

frameshift

L-gene

L-gene 

.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/


.CC-BY 4.0 International licenseperpetuity. It is made available under a
preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in 

The copyright holder for thisthis version posted October 18, 2021. ; https://doi.org/10.1101/2021.10.18.464898doi: bioRxiv preprint 

https://doi.org/10.1101/2021.10.18.464898
http://creativecommons.org/licenses/by/4.0/