以計算方式研究基因體結構與變異

No Thumbnail Available

Date

2012

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

DNA定序技術在生物研究中扮演著日益重要的角色,透過將生物體DNA定序資料重組還原其基因體,可以獲得許多與該生物體相關的資訊。另外透過比較不同物種或生物個體的基因體,基因體上的結構性變異(structural variation)也被發現對於基因表現、疾病和演化有重要的影響。近幾年來,隨著定序技術的進步,發展出許多新的短序列定序平台,這些平台可以用非常低的成本得到大量的定序資料。然而,龐大的資料量與短的資料長度在使用計算方法還原基因體的問題上,帶來許多挑戰。除此之外,生物的基因體常常含有複雜的結構性變異,像是反轉(inversion)和位移(transposition),傳統的序列比對演算法無法完整的比對含有結構性變異的序列,對於發生變異的位置也無法得知。因此,我們提出新的計算方法,能夠使用短序列定序資料重組基因體,以及比對兩組含有結構性變異的序列。 在重組基因體方面,我們提出了一個新的演算法,藉由採用”跳躍延伸” (jumping extension)的方式有效的將短序列定序資料重組還原其基因體。跳躍延伸的主要概念是透過定序資料兩層的重疊關係,將龐大定序資料中較有可能屬於同一區域的定序資料過濾出來,進而重組該區域的基因體內容。實驗結果顯示,本論文所提出的演算法相較於其他目前常見的方法不僅能得到較佳的重組結果,在記憶體的使用量上也相較的低。 在序列比對方面,我們提出了一個新的序列比對演算法來比對兩條含有反轉或位移的序列。透過比對包含反轉或位移事件斷點區域(breakpoint region)的序列,我們可以估計發生事件斷點的位置,進而還原事件發生之前的序列內容,再透過傳統的序列比對方法得到完整的比對結果。藉由所提出的演算法,我們分析UCSC網站上人類與黑猩猩的基因體資料,得到130處反轉事件的斷點與846處位移事件的斷點,以及該區域序列的完整比對結果。另外透過模擬的方式,我們試驗提出的方法在比對不同親緣關係物種序列的效果。實驗結果顯示所提出的方法適用於比對親緣關係高於大鼠(rat)與小鼠(mouse)的物種序列。 綜合以上所述,本論文所提出的計算方法能有效的使用短序列定序資料重組基因體,進而完整的比對含有反轉和位移的序列。除此之外,發生反轉以及位移事件的斷點位置也能夠被偵測出來。
DNA sequencing is an important technique in biological studies. With the sequencing data by genome assembly, a lot of useful information of an organism’s genome, such as the size, DNA composition, and contents, can be obtained. By comparing genomes between species or individuals, structural variation (SV) has been found that play an important role in changing gene expression, diseases, and genome evolution. With the great progress of sequencing technologies, several short read sequencing platforms have emerged in the past few years. These platforms can generate huge amounts of data with much lower cost than by the traditional Sanger sequencing although the sequencing lengths are much shorter. The vast amounts of sequencing data and the short read length pose many computational challenges in genome assembly. Further, when comparing genomes that contain complex rearrangement SVs, such as inversion and transposition, the genomic sequences cannot be aligned well by traditional alignment algorithms to identify the breakpoints and recover the rearrangements directly. Therefore, we propose two new computational methods to reconstruct a sequenced genome using short read sequencing data and to detect the breakpoints of inversion or transposition events in the sequence for genome comparison. In the first work, we propose a genome assembly algorithm which adopts a new extension approach called “jumping extension” for assembling reads ≥100 bp efficiently. The jumping extension is the kernel of our proposed method that can group the reads that are more likely to be sequenced from the same region and extends more than one hundred bases at one time. During the read extension, dynamically trimming low quality nucleotides from the 3'-end of a read improves the connectives of the reads. Empirical and simulation studies reveal that the proposed algorithm achieves not only better contig quality but also better memory usage than many popular methods. In the second work, we propose a pairwise alignment algorithm which can align sequences containing rearrangement (inversion/transposition) events. The breakpoints of the rearrangement events are estimated by the alignment of the breakpoint regions. Then, one sequence is un-shuffled according to the corresponding breakpoints and the alignment result of sequences before the possible rearrangement events occurred can be obtained. We have identified 130 simple inversion breakpoints and 846 simple transposition breakpoints between human and chimpanzee genome using the data from UCSC website. We also evaluate the method on several pairs of species and the result shows that the method is suitable for species that are as conserved as mouse and rat. In this dissertation, we develop a series of computational methods to study the genome organization and variation. The proposed methods can efficiently reconstruct an organism’s genome from short read sequencing data and detect the breakpoints of inversion and transposition at nucleotide level. In the future, the methods can be applied to the sequencing data obtained from different platforms. Besides, based on our study, we can develop new methods to cope with more complex genome variations that are made up with combinations of SVs.

Description

Keywords

基因體結構變異, 短序列定序, 次世代定序, 基因體重組, 基因體反轉, 基因體位移, genome structural variation, short read sequencing, next-generation sequencing, genome assembly, genomic inversion, genomic transposition

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By