数学建模社区-数学中国

标题: 基因组测序模拟 [打印本页]

作者: 杨利霞    时间: 2019-4-21 14:56
标题: 基因组测序模拟
基因组测序模拟( z; f; `: H- u3 k2 t* x3 ~) d
基因组测序模拟2 N6 Q7 ], O% [, N. M
( e- q' N, w& e4 Z7 Y/ \
一、摘要
& v; X! `6 `" H! a7 l
) f3 `% r4 g1 v+ @- u3 u/ z0 b通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件: o/ A9 ~* B2 l! n/ X- k2 j( D

! D2 P7 B3 T( r5 u0 k" }二、材料和方法
/ k( V; h6 t9 p6 e
3 U9 V5 }& H+ h9 w/ @2 k1、硬件平台5 m4 g( x- L( D* `1 A# M. z

& G2 m/ H, }) Z, U, Q1 p+ f处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz 5 N% `1 V' x9 d. _. V2 I9 S- J8 y1 H; O
安装内存(RAM):16.0GB
5 ]% ~; U5 V, O1 T. W/ K5 }1 E
& ]: I8 H2 N1 i% }0 Z: ?1 R7 Y2、系统平台4 C% t) S+ Q) s% j- v$ {( m# P! W
Windows 8.1,Ubuntu
, i+ L5 G6 k) p5 e6 L/ u( _! H
! Z  j8 p( G" r' z3、软件平台
& F: e( _  p, x1 v, R- x4 O4 O9 c2 K- ]! A+ H0 _8 z% ~6 l  O8 g
art_454
; n  F! i, X: j, MGenomeABC http://crdd.osdd.net/raghava/genomeabc/
7 c- h- r4 _8 L. y4 D( xPython3.5
. u5 K- s5 m# D  A) ABiopython
; I$ u+ t/ F! G. F" X3 w1 O4 P4、数据库资源
  X. D" w" e0 L4 ]' H" g$ O8 J: j
NCBI数据库:https://www.ncbi.nlm.nih.gov/4 A, L  }  Z+ D- k9 J

$ K& n$ G: i6 W& w# a. t! j5、研究对象) N5 G" z+ b6 b$ N- W# ^# X" v0 l

: D! Z9 j5 ~4 R9 e; ~/ R0 I( _酵母基因组Saccharomyces cerevisiae S288c (assembly R64) " s8 E4 }# D4 B2 H* t7 s' V4 T+ S+ f
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz  C* z7 J) x$ T/ u" p2 x' H
6 b9 Z. s+ g9 `, }/ Q
6、方法
* p: _. k/ |* `( h: U1 ~8 y$ g- a! {0 n' }
art_454的使用 : Q3 B& b, Y0 c/ p$ g1 C& b
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。2 z9 J) S$ B: _1 z* J
GenomeABC
, R# b6 A: S$ ?3 U: ?) r  @, m7 T进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
4 Y! d- H4 e8 T. M编程模拟测序 ) R/ Q  F, C. L: j4 v( \
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。" k5 Z$ @1 R( t9 _
三、结果
0 ^2 \" G2 y7 g; I
' Y3 Q4 j# P! J; u1、art_454的运行结果- t# @" _0 E8 u* ~! S% ?& K( g9 g0 p

" r1 V8 k0 \7 f无参数art_454运行,阅读帮助文档 5 I, N1 ]5 a. o& b$ C' s

' C$ \! f6 L+ T/ I图表 1无参数art_454运行 7 M8 X# h" u' t2 w+ ^
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
" f2 B2 _  w* ~' s/ |* r下图为模拟单端测序,程序运行过程及结果 : Y3 P4 {: o; Z- ]4 N+ m9 F1 L
# E, y0 X7 T" g- e0 j5 n
图表 2 art454单端测序
0 a+ K8 G1 D6 ^0 T( A$ o* y
! h. K) I$ E5 }4 _$ B/ i+ O  R图表 3 art454单端模拟结果   o( H" F6 ~4 U2 x/ _/ L
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 7 n! f7 n: D8 b# f- v
下图为模拟双端测序,程序运行过程及结果
7 a. _, U% d3 X% K* c# ?; [; W. w$ ^* h" Q' C5 a
图表 4 art454双端测序 - ~6 s+ _& ?/ X1 b
6 a3 U2 {+ S! K9 F1 ?. C3 ^
图表 5 art454双端模拟结果 / b. ]3 g! }% z* w
2、GenomeABC
- R+ S" P- i+ z2 U! k% W/ u下图为设置参数页面 % [7 J) F6 e6 X& C) y; p& W

: W' c0 W: z" N下图为结果下载页面   ]4 y5 O+ U7 K6 x
3 \9 Y! A( ?/ S# s. C9 Y
图表 6 结果下载页面
/ O: B+ _" S/ n. B- ?3、编程模拟测序结果
/ w; I4 ^4 @& B% |; C5 k拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
9 v( @3 q. b( V, c+ I' F& }) a单端测序
1 p2 m4 ~0 E: }& o( B, e
4 [8 W& Q4 c5 F6 Q  @图表 7 程序模拟单端测序
" |" K6 \* Q# V0 e双端测序
5 d2 P  Q8 t0 u1 @' y( L) v' S* T8 d9 {
图表 8 程序模拟双端测序 2 N7 T# K7 a: V
测序结果 7 a6 c2 _: W% L
; N5 g/ ]6 h, y$ p. b1 K
图表 9 结果文件
. B: U8 q1 D* N- Q
9 v6 \% t! j: z' s3 _% L因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
# X+ G  U& \3 Z测序结果统计表
  H! E9 d9 u7 K7 O, a% J  L& K9 V+ I, q# S/ M- W2 x
测序方式        基因组大小(bp)        片段长度区间 (bp)        N值        期望片段长度        克隆保留率        片段数量        Reads长度范围(bp)        Reads总数量        Reads总长度        覆盖度(m值)        理论丢失率(e-m)        覆盖率(1-e-m)
2 Q/ g& S5 g1 _0 `单端        12157kb        200-1000        10        600        0.95        107378        50-100        101968        7645.541kb        0.62889        0.53318        0.46682
! ?; P: {* B8 j; _: e单端        12157kb        200-1000        20        600        0.95        213722        50-100        202996        15227.882kb        1.25259        0.28576        0.71424
2 D! q' o$ @% W双端        12157kb        200-1000        10        600        0.95        106704        50-100        202770        15212.662kb        1.25134        0.28612        0.71388
" \8 C4 e3 Q& Z$ F双端        12157kb        200-1000        20        600        0.95        214212        50-100        407186        30534.265kb        2.51164        0.08114        0.918863 P; P& z& X8 d: k, f8 u, t3 N+ R0 j
四、讨论和结论
2 ~- v* C' x/ R1 ^; Z: _
: f2 }5 b* h, N$ m7 f, n程序运行方法  L8 I% z; M$ S! G8 \
8 k2 s& O2 W9 e
在类的构造方法init()中,调整参数。   W6 N7 K8 I8 ]: k1 j
Averagefragmentlength为片段平均的长度;
& u. E: Q8 B6 Q6 j( K9 Y' @! N* |/ R9 Wminfragmentlength和maxfragmentlength是保留片段的范围; 3 g3 w5 D) q! F; j9 \" H! b7 ~  R+ ^: R! P
cloneRetainprobability是克隆的保留率;
5 _5 _) l' y# ]" Yminreadslength和maxreadslength是测序reads的长度范围
& Y: @) n# _# {& u  Z0 H
6 M- X! [3 s  p) d( O6 i  \模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
: @2 o/ W  {5 e; \7 `6 v& k
% Q( Z  k7 S: E5 `附录
, D5 X1 L6 Y+ A: s) U
" r( R8 k8 b9 h& a1 Mfrom Bio import SeqIO
" w& o. I% ]7 v( C9 P* Pfrom math import exp9 Y! M. c' L: }9 d: I) I1 k% W
import random+ ^. J, \0 k: ^* F" |; r$ o
% n" e1 M0 h% C) E( |0 K) q/ N8 n
class Sequencing:
/ L! y& X- q# [" W3 J' D    # N代表拷贝份数' ^% [9 n# j4 L+ d& f  a6 S2 }. F
    def __init__(self)+ W% ^3 T* M' H. t1 F3 f
        self.fragmentList = []9 s3 v+ _) @- v& f% [  W7 v3 G
        self.readsID = 1
  p1 K3 x. ^& Z7 x( T        self.readsList = []2 h1 J0 Q2 E& H% P
        self.averagefragmentlength = 650
7 k0 B+ \( d8 Y1 O9 T8 Z& ~; z2 N        self.minfragmentlength = 500
9 Y. y. o$ k/ _        self.maxfragmentlength = 800# `) |& C! J# h1 _0 n
        self.cloneRetainprobability = 1- ?/ d4 I* b1 e, c+ a# N
        self.minreadslength = 50  F0 |4 z6 T4 {8 ]) I$ B# t- D
        self.maxreadslength = 150$ p4 m' s- s" @9 V" ^
        self.N = 10
. B  q( I) }) c4 u0 D        self.genomeLength = 0: H# V1 c8 X2 _; Y3 V: l9 o8 p. v! M; Z
        self.allreadslength = 0
0 {- {, m+ n. N, e. w8 c9 k
* }0 T6 D* P! M% Q% q% C# j; K! B    # 生成断裂点5 I: d0 i9 W* A/ H/ A  E; n$ g
    def generatebreakpoint(self, seqlen, averageLength):1 ]6 t, V6 `9 ]
        # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. t5 C" C+ F2 |8 j- G        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
7 m( w8 t/ \8 E8 e# Q3 y( o( S& ^+ K        breakpoint.append(seqlen)
8 O; a3 ^1 f6 {  |4 ^; k9 F+ e  d        breakpoint.append(0)
3 ^' P1 F7 y8 Q% k- A& ~7 w        # 把随机断裂点从小到大排序7 Z# _' ?0 V; L2 e, y/ |7 I. y1 J! J- k
        breakpoint.sort()
$ _5 P, M( `: y8 H2 g0 l* V        return breakpoint3 z; t1 p/ d7 C& \# i3 q& l
- f- W7 ?6 w  ^5 G; s% ?
    # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
8 J) `/ ~; H5 f! Z% E    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
5 g# f$ K& I1 w5 l' p# ?2 Y        for i in range(len(breakpoint) - 1):  M7 }% u! [. T+ o
            fragment = seq[breakpoint:breakpoint[i + 1]]# m( M' p; K/ {5 e
            if maxfragmentlength > len(fragment) > minfragmentlength:, ]7 J8 h4 D5 V1 C: M2 J
                self.fragmentList.append(fragment), p, Z7 m+ a5 h; c! |& T
        return self.fragmentList
- w( S2 E& W& X) {; Q7 {$ L8 [7 p& C4 V2 H4 g" a
    # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率" L1 i3 O2 F; m& [
    def clonefragment(self, fragmentList, cloneRetainprobability):) g6 O9 E8 ]! S2 {* a
        clonedfragmentList = [], K0 h2 K6 t. A) u+ R; D
        Lossprobability = [random.random() for _ in range(len(fragmentList))]
. g7 I$ B4 a1 W' p        for i in range(len(fragmentList)):# I# i9 e" V5 ^1 |% L) b3 i) s
            if Lossprobability <= cloneRetainprobability:0 j8 s3 u8 ]& F0 x. Y9 Q
                clonedfragmentList.append(fragmentList)
% _6 l1 i- p9 ^: [; O# Y9 t& ]        return clonedfragmentList" d# h* l( y" \- r8 n
2 D6 D  N7 u3 l+ f' p; C
    # 模拟单端测序,并修改reads的ID号% k9 n; d/ v+ [& @# `  P" H+ o! E
    def singleread(self, clonedfragmentList):$ V  R1 {) r- h' j2 j9 z0 W. g
        for fragment in clonedfragmentList:
) A$ F8 _1 q' Q  d& O            fragment.id = ""
2 G6 p( N  Y1 y! N/ l            fragment.name = ""- _: L6 _1 J9 o
            fragment.description = fragment.description[12:].split(",")[0]
7 H! w, l- E5 g2 Z            fragment.description = str(self.readsID) + "." + fragment.description; r; }% W  u6 w# O( A! f# M
            self.readsID += 1
. I/ L1 c5 l7 @. S3 _) y4 V            readslength = random.randint(self.minreadslength, self.maxreadslength)
/ U5 t. g4 \* \( x/ B            self.allreadslength += readslength; l" ~* K7 x( t/ l2 B( e
            self.readsList.append(fragment[:readslength])
$ l0 I5 M; Z5 [: f# b0 C% j' N  B3 \2 d
    def singlereadsequencing(self, genomedata, sequencingResult):
/ |" G9 {1 Q2 X$ _1 p( H        for seq_record in SeqIO.parse(genomedata, "fasta"):
) l2 B" X6 s7 Q$ W' _            seqlen = len(seq_record), ^' e4 ^; w1 u8 i8 \* ?
            self.genomeLength += seqlen
$ w' y& ?2 c1 o. B. z" O            for i in range(self.N):
$ ~! D% Z3 R$ X, M' k& ?- n                # 生成断裂点) W& a- Z+ B4 L7 V$ n
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)0 B3 A8 C! N7 D$ D( t" t) A
                # 沿断裂点打断基因组: f4 G& U. F2 s8 a( A3 V
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength), g3 R) K4 `' S; G0 G4 a
        # 模拟克隆时的随机丢失情况
6 P: Z, Y# o( [, I4 X        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
! s  P7 [8 K1 L( `+ E$ O0 |        # 模拟单端测序
+ E5 p9 x$ \. e$ Q, Z        self.singleread(clonedfragmentList)* m% ^3 L0 S: V% V* v
        SeqIO.write(self.readsList, sequencingResult, "fasta"); J! S/ f' l5 T% i

' N! o% x7 q; o. c! \4 q# V: D    def pairread(self, clonedfragmentList):0 |& S: x& I$ i: ^% @( S% k
        for fragment in clonedfragmentList:2 a! }; _3 [8 ^& X( S& a& F
            fragment.id = ""0 \" S$ n' U% P% f) R
            fragment.name = ""# S0 S, Z7 `6 W# h
            description = fragment.description[12:].split(",")[0]3 I& _+ n, B7 _+ }' `/ j# h+ S
            fragment.description = str(self.readsID) + "." + description/ R2 P8 c% W+ u" K$ n$ m" @
            readslength = random.randint(self.minreadslength, self.maxreadslength)
5 f9 D4 W3 p- v/ p0 h) ]% [% J# Q            self.allreadslength += readslength% u. B" w6 I9 B( I' E
            self.readsList.append(fragment[:readslength])% j; ]. X6 q# W% l
0 z2 }% C9 \1 t3 @
            readslength = random.randint(self.minreadslength, self.maxreadslength)
$ n( A7 g3 M! f" |2 L            self.allreadslength += readslength1 @& X8 f+ ?3 \+ B9 Z& ^% ^
, y) K4 X% e% @8 p' D* W
            fragmentcomplement = fragment.reverse_complement()
9 X" A( }9 P) z            fragmentcomplement.id = ""# {- E4 I6 |, Y' R
            fragmentcomplement.name = ""
& @7 V2 \+ U7 {6 a+ V' W- a* @) G, d            fragmentcomplement.description = str(self.readsID) + "." + description
; i3 t! ~1 i. s" A5 M! Z0 r            self.readsList.append(fragmentcomplement[:readslength])' c! P# v, ]) ?5 y1 u7 I

8 J! J/ E4 r1 G% H9 n5 Z3 B            self.readsID += 17 n9 M/ ?2 X4 Q2 V, Z0 H6 M
. T; a& j" x* P# K8 u3 U4 p
    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
' M; W6 \) R9 L' p8 d3 I        for seq_record in SeqIO.parse(genomedata, "fasta"):. G( y- F- G# K6 n1 b+ K6 m
            seqlen = len(seq_record); i6 _- W+ h9 Q6 s) k
            self.genomeLength += seqlen8 l  z; B  X. {, H6 ?( C& a4 Y3 j
            for i in range(self.N):* J2 `3 Z/ R6 N6 R% W' p
                # 生成断裂点
0 y8 l' ~; _' z+ O& G* U' u: u# _( i+ Q                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)7 [( [7 ?, H2 B
                # 沿断裂点打断基因组, h7 b, g5 G# ]0 p( I4 W
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)6 \$ h" N0 @0 g" k/ R: b! `8 _9 s
        # 模拟克隆时的随机丢失情况
7 @6 h# h& p' t3 H% K$ a6 Y3 M        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)$ R) B) [" P( b% W" W
        # 模拟双端测序6 M' H# b$ X, Y, I$ x5 E+ J
        self.pairread(clonedfragmentList)1 f; f  _4 x! L$ h; w+ {4 e/ V
        readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]) _) r% w& m% i) z3 W+ m
        readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]3 [( A7 ?" }% A4 W
        SeqIO.write(readsList_1, sequencingResult_1, "fasta")3 }1 }+ I9 |8 @5 k1 s, e) E1 L
        SeqIO.write(readsList_2, sequencingResult_2, "fasta")
1 m# {2 Q) d0 ?3 H& }* w. _) S
  h: C- ~4 B0 ^$ b. g1 j$ p    def resultsummary(self):9 p9 q  D1 Q  f0 H/ m
        print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
% Q* z% n5 k% }9 @* P' l        print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))* X# N) X& Y3 n% d9 D, ~3 l
        print("N值:" + str(self.N))  V% B, P; v7 c. t  J7 b7 y) B
        print("期望片段长度:" + str(self.averagefragmentlength)). f' `6 j5 s, `' w
        print("克隆保留率:" + str(self.cloneRetainprobability)); j( f0 V7 a3 d- y! _  D! J# e
        print("片段数量:" + str(len(self.fragmentList)))6 f# m. ~' Q4 M, w% a: ^  U) C
        print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
5 n' r8 A. F# n8 f        print("reads总数量:" + str(len(self.readsList)))
7 O% B+ ?1 {9 t' @: F5 j        print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
/ t( ^9 U) P) k5 o        m = self.allreadslength / self.genomeLength% E+ |7 G/ a  I5 ?$ b: v% C
        print("覆盖度(m值):" + str(round(m, 5)))$ S) i9 S6 U" g1 ^! t/ ~* a* i. Q) ~
        print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))( J1 R# C" p: F" n7 @0 d, }
        print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
$ c8 u/ `) T  v9 h, b# -------------------------------------------主程序-------------------------------------------
0 I/ J2 f& X* O- p7 u# 模拟单端测序; u7 Y* l" y) J( y$ g
sequencingObj = Sequencing()5 K! r  z. a6 D. V$ y+ d. C
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")% T- H, H: J9 e% M9 d
sequencingObj.resultsummary()" W/ T/ m, H- W# b

& R3 E& b  m) s2 O5 e* d( n# 模拟双端测序7 T' v# |% z. J! D; p
sequencingObj = Sequencing()1 @$ ?( C; w3 @3 n
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
5 V7 P, q4 z5 I6 NsequencingObj.resultsummary()5 \' ^) @  C6 e: B6 S" t
from Bio import SeqIO2 |( e3 D( x+ w
from math import exp. B# ^! l/ c$ W" ?3 k8 @
import random! X+ k# y0 A4 u- I3 E2 j

# d% l* @# b5 S  y* M9 Iclass Sequencing:
1 l1 \& Y$ O/ r) o$ n6 C% \* ?; ?    # N代表拷贝份数
% w3 O% X  N# I3 |/ X/ z& f1 K- n" @    def __init__(self):
/ L5 K' ^/ A3 P! R! f6 p: `% p) h        self.fragmentList = []& [" y" y; a+ B6 m  t
        self.readsID = 1: u/ l. f. w7 o$ \8 z
        self.readsList = []) F0 C; F; J) `+ Z2 a
        self.averagefragmentlength = 650
' k7 V) N6 `3 ]! h9 H6 z! U        self.minfragmentlength = 500
& ]1 v# r/ I' b+ f& Z/ p6 N        self.maxfragmentlength = 800! `: m; p8 o  y. {
        self.cloneRetainprobability = 1
! ^6 ]. W7 j3 p8 A* {        self.minreadslength = 50, w) S5 l7 t$ g
        self.maxreadslength = 150
; ]* s3 H/ i- u9 p        self.N = 10
& ?4 V: @/ c- ]1 z        self.genomeLength = 0
2 i5 d/ e) {; E* F' q" W# t        self.allreadslength = 0$ W8 s4 h3 {. @* ^# }

: M9 V# r" ]5 }    # 生成断裂点
- S. A0 e) Q1 r9 |3 k9 L    def generatebreakpoint(self, seqlen, averageLength):
6 w/ K$ @: ~% a        # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)- ^- s8 {, w. S& }' D) {
        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))], y& ^5 E5 c, N
        breakpoint.append(seqlen)
3 H3 a  C4 v1 X" E  o; K        breakpoint.append(0)
9 h1 m: |* M" `9 n% w$ F        # 把随机断裂点从小到大排序
! r+ y+ ^+ D( S5 ^9 d% q* `" [        breakpoint.sort()
  i/ j+ E0 \  `  Z1 ^  ~( p        return breakpoint
2 e* `6 y5 l2 s0 |4 |( `& _) I* [2 N" `9 H' M  c8 Q
    # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp! Z" k6 U8 \9 ^
    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
4 h6 f* L1 @: S        for i in range(len(breakpoint) - 1):
  M% n% ^1 M  }+ g0 s            fragment = seq[breakpoint:breakpoint[i + 1]]
* E3 W8 o4 t- w            if maxfragmentlength > len(fragment) > minfragmentlength:
! p% U6 Y$ v) b8 O                self.fragmentList.append(fragment), k3 a/ i1 m4 c; ^: G
        return self.fragmentList9 R/ d% w) |* t+ l% z
) y1 [, O3 @- c" R0 V  ~% ^
    # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
" d7 f# `. v4 ?9 h% t0 G2 f    def clonefragment(self, fragmentList, cloneRetainprobability):
1 S2 T3 A! o% l: l' M        clonedfragmentList = []
) O, X9 L7 h3 S  y        Lossprobability = [random.random() for _ in range(len(fragmentList))]
. o5 F4 T1 g5 d" `0 I' r! p* F        for i in range(len(fragmentList)):) C4 ~6 V2 D: }$ O( F
            if Lossprobability <= cloneRetainprobability:
6 H; I) u: g; A. Q  W! ?* _                clonedfragmentList.append(fragmentList)0 a4 F8 H* ]4 \
        return clonedfragmentList
% A) t# X4 |/ Z) z. Z% T, r+ `& B1 y- g; M6 m0 b
    # 模拟单端测序,并修改reads的ID号3 i$ e; j2 ?. x
    def singleread(self, clonedfragmentList):
. x9 i! c9 P0 a  G7 T! c9 m  w$ r1 o        for fragment in clonedfragmentList:
2 s4 |, W4 u# W6 `8 l            fragment.id = ""3 v1 |4 u2 i8 Z7 P# {, e+ m
            fragment.name = ""
: s7 f' ~4 V7 K; u            fragment.description = fragment.description[12:].split(",")[0]
1 G. k7 n  ?9 \+ Q0 O2 A# W8 t            fragment.description = str(self.readsID) + "." + fragment.description. o4 u! b+ A( N5 V% t, v1 _
            self.readsID += 1& B- |# o% e" y
            readslength = random.randint(self.minreadslength, self.maxreadslength)' |2 q7 W' s! n8 y9 t( a6 E- j- g* h
            self.allreadslength += readslength( f: I: x/ W9 a7 V
            self.readsList.append(fragment[:readslength])2 Q: b7 h- G9 L* o4 c1 [

6 k4 x/ n* h/ G3 e( _    def singlereadsequencing(self, genomedata, sequencingResult):
9 K- Y8 ^& e, u+ g# L# G$ V        for seq_record in SeqIO.parse(genomedata, "fasta"):
# c. w" q" ]! o# V- ~- f7 k            seqlen = len(seq_record)3 }. J# K& d) d: h8 {
            self.genomeLength += seqlen9 p: A- `" d% Y" ?% P+ [: ^, p; c
            for i in range(self.N):6 K1 \+ r; S2 D
                # 生成断裂点" I* q7 u7 z- [4 N
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)+ q/ e; H8 b' r& A' _
                # 沿断裂点打断基因组
. t9 }% D0 p1 l# {. ?7 ], w                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)$ q- d: e5 J9 ^6 F" q8 l9 ?
        # 模拟克隆时的随机丢失情况, r- d! @1 u* Q/ F, ^
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
% P# V% }) @/ z0 `4 Z/ q        # 模拟单端测序5 w+ H. P" R4 ]2 N$ B' U
        self.singleread(clonedfragmentList)
' \# {3 N& v: C: M4 `) N        SeqIO.write(self.readsList, sequencingResult, "fasta")
. _+ ?" k" F# ]7 o2 d* T& B; U$ V8 S  n0 A7 ~# c3 n
    def pairread(self, clonedfragmentList):
' ?& A- J6 e2 y$ P# X$ E0 n        for fragment in clonedfragmentList:
7 C% A! k/ j/ X& ^, o0 i            fragment.id = ""! P1 y+ w  z& Z
            fragment.name = ""7 c( \# w6 v$ s9 @
            description = fragment.description[12:].split(",")[0]
$ x1 C! ?0 Y; d2 F2 V* d. W            fragment.description = str(self.readsID) + "." + description% n* P+ ~  ^4 g; a9 F- P
            readslength = random.randint(self.minreadslength, self.maxreadslength)2 u1 d! ~0 ^9 q
            self.allreadslength += readslength
" S; S6 E, p8 B5 L            self.readsList.append(fragment[:readslength])
6 _7 e/ @8 l, M0 `" {* @% g- B8 @$ G: _4 K: q8 Q7 w) m: Q
            readslength = random.randint(self.minreadslength, self.maxreadslength)
6 s' P& q9 U& d7 A/ f            self.allreadslength += readslength
1 O1 _! }& P& ^- l- }3 ^( X* U# p" E% Z+ f8 s) b3 o
            fragmentcomplement = fragment.reverse_complement()  D- p' @% s  Q8 w6 {
            fragmentcomplement.id = ""
0 d! P5 x% X" `* [  w            fragmentcomplement.name = ""; k: U+ W  _3 K
            fragmentcomplement.description = str(self.readsID) + "." + description, ~- w# N+ w" z' R% e. k! n  z
            self.readsList.append(fragmentcomplement[:readslength])
# i) L( }9 Q9 a$ ^7 [3 W& \+ [9 x1 _7 z* e& a* ?7 q8 q
            self.readsID += 1
% ]. h' r* ^5 ]# @- G* }- s6 U6 R; [+ x0 F
    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
4 V6 h4 j3 C4 V% V        for seq_record in SeqIO.parse(genomedata, "fasta"):
4 }6 ~" r, z$ g! B7 y2 ^            seqlen = len(seq_record)
) I  \# Y) Z+ Y5 Q            self.genomeLength += seqlen
6 N( ~7 K/ ~5 \: k            for i in range(self.N):% ]- H, |! V% W' b# O5 K1 K
                # 生成断裂点
" O) [% ^+ Y% d1 }& y0 y( a7 z) B- o' F                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
7 u0 g8 H/ u( v                # 沿断裂点打断基因组4 n: ~( s8 |9 D
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)& P( {( G) F! A) a! r
        # 模拟克隆时的随机丢失情况
4 g! u3 V( ^( J5 W        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)8 U" j7 ^* z) ?  C( [
        # 模拟双端测序8 D. J0 i& T; ?* {4 T  v
        self.pairread(clonedfragmentList)5 R% r+ u& b( M3 j9 d/ T2 [
        readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]2 k; C, C3 X* \6 N  M2 d& a5 d; z
        readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]! G& I( K+ G- v8 {7 ~1 P
        SeqIO.write(readsList_1, sequencingResult_1, "fasta"); ~6 g2 d1 R0 q
        SeqIO.write(readsList_2, sequencingResult_2, "fasta")+ x  _5 G3 X9 a% c; n. A* I
# L' p7 \8 k) \& n& a
    def resultsummary(self):& C; r- V1 q. _# |4 }7 \, n
        print("基因组长度:" + str(self.genomeLength / 1000) + "kb")$ i  ]* F$ ^' p0 i4 r! k
        print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
# M8 R9 i3 {7 \" k1 m        print("N值:" + str(self.N))
- j( L1 O) |5 I, h. ?. W( V        print("期望片段长度:" + str(self.averagefragmentlength)): G6 s: X9 {* i  M  H2 [5 z" q/ \; M
        print("克隆保留率:" + str(self.cloneRetainprobability))- r5 g' ~; Q$ L. E& n
        print("片段数量:" + str(len(self.fragmentList)))6 a* l* U. J; G% n: B& l
        print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))* q' n* m% K. B) Q
        print("reads总数量:" + str(len(self.readsList)))0 o' A( P2 o1 C' p
        print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
# [8 W- F( {" b4 s        m = self.allreadslength / self.genomeLength/ R% U0 O' z8 l" E$ ^1 V
        print("覆盖度(m值):" + str(round(m, 5)))
# e: |, {$ @  S" B& Z& \        print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
& l4 k/ [2 h: f7 [        print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))% `7 x; q" B  W( L
# -------------------------------------------主程序-------------------------------------------* h3 t" S- O" z1 m
# 模拟单端测序
  l3 p( F9 X/ o: QsequencingObj = Sequencing()" `' G1 W4 A1 J- A
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
. G+ i+ x& I: |+ |% B* m( ^. asequencingObj.resultsummary(), G; E1 ]. g+ X8 I* `
9 C/ k' F3 f9 |) E' {8 f, o7 T
# 模拟双端测序3 z( w+ ^1 D5 B7 A) N2 q- B) Q* s
sequencingObj = Sequencing()2 c6 g* [4 J# m& f& q( ?( `8 ]
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")8 P6 R' `( g% ^0 T& m# n' H
sequencingObj.resultsummary()
3 J: P& K) ?- ~4 {8 G) V6 M, o
1 \3 D2 z4 z& E3 c+ T5 Z
! ~  u% O7 V  Y2 i3 s1 f8 v1 \2 X8 R1 Y, a

0 R9 u3 Z; M: n3 q' _' i# k8 b6 h

数学建模解题思路与方法.pptx

117.69 KB, 下载次数: 4, 下载积分: 体力 -2 点


作者: 3477959497    时间: 2019-4-22 10:49
不错。。。。。。。。。。。。。。。。。$ G, w, J% d, n: P% U& p  N8 _





欢迎光临 数学建模社区-数学中国 (http://www.madio.net/) Powered by Discuz! X2.5