数学建模社区-数学中国

标题: 基因组测序模拟 [打印本页]

作者: 杨利霞    时间: 2019-4-21 14:56
标题: 基因组测序模拟
基因组测序模拟5 o% D) C0 N' K( O2 l
基因组测序模拟
2 ?0 z4 W1 H1 s6 X- _
0 b1 z% {: N2 R2 g/ K" [+ L7 \一、摘要, {2 p7 r/ ]$ [9 _2 S3 s
' E& n0 f8 m* m0 d* r0 e- ?
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件7 q* v9 v+ [) G

  j# s/ @  I7 U二、材料和方法
; g, J" Y% Y( ~! V+ v* M( D+ e5 T; m
1、硬件平台
% s1 K9 ]; D+ H) w' c# c& E( b& \' U" A2 U1 q! G
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
* C: W3 Q# A3 ~; @安装内存(RAM):16.0GB& v- d# {1 Y5 U2 X
" I! f3 W8 _0 u4 G" m" |( O7 y
2、系统平台
- C+ @7 W9 m: S- z2 Z- L  R) LWindows 8.1,Ubuntu) x  w8 R5 ^9 {
; U: S; W  v! N: M8 Y
3、软件平台3 y' v' l# [/ v- @4 o$ h3 x
% u: T% \& Q* N3 b7 v7 o
art_454# w/ d4 t) f9 r. M. N( e0 B
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
' U* h1 ?' b, b$ qPython3.5* f; t9 Z/ Z0 u- n5 k+ T
Biopython3 E1 `/ k7 }: ^" r
4、数据库资源; V! r6 O+ Q/ U' Y% _' ?5 m
' x; g1 S5 V6 M( S" g; b" R5 C/ o2 N
NCBI数据库:https://www.ncbi.nlm.nih.gov/
/ k# A* j7 U& Z! M6 \8 {/ v" u: ?) U" m$ {3 |
5、研究对象/ r1 E) [& O5 `5 p" m

6 k6 F* B3 K5 F9 h- t9 M酵母基因组Saccharomyces cerevisiae S288c (assembly R64) , e% R: R: c. ^* M% x
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz* Z* m6 d  S+ K" e: `  W

8 C) U5 I) Z3 J% V6、方法) s# C: T9 A5 F$ f

' e) Y1 D% A1 T# d  O2 W& @art_454的使用
8 h- B8 }% \* z: a首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。% \- ^4 \6 [3 v- |
GenomeABC 2 b6 a4 B$ p0 t
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。0 U9 v2 B* @8 o' }% R: ~
编程模拟测序 : M- z6 Z) C  k$ S/ ?! R
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
5 K- k* g" E% j. [1 O三、结果
2 ]  m/ P6 k: t8 P$ l. Z5 w
+ w5 i! D" I$ R  M* T) W8 @1、art_454的运行结果; J! M; ]; x2 A! R$ u7 B/ R

0 p' b( [7 R* G5 g无参数art_454运行,阅读帮助文档
0 U" A2 a7 M. U1 u" O9 v% X6 \& ~0 ^! e0 C# G3 C6 W* s* f/ @, B% v
图表 1无参数art_454运行
, H. T5 a  j% P- C0 t; g* ]8 x对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
4 D. x) N; F. D/ f+ s4 y下图为模拟单端测序,程序运行过程及结果
8 p* {8 a6 p) e, I1 t' l! z9 A/ J# q! Z+ x" j) \- U
图表 2 art454单端测序 & |, ^4 H) Q: J

; P( ?1 ~! e5 d& |% K- |7 W图表 3 art454单端模拟结果 ' z- e! v6 c' g7 J
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20   |: y$ ?) m0 c% T" q+ b# T; s0 g/ v
下图为模拟双端测序,程序运行过程及结果 3 n3 x7 w) ?- Z7 @
. a0 ]) r- {& W( @: Q' k, }
图表 4 art454双端测序 * b  J9 @9 H) @0 Z2 r
6 c5 V- B3 `* o5 [& _5 W
图表 5 art454双端模拟结果 5 r4 i6 d* R& f
2、GenomeABC & y" W, ~) O) H* [
下图为设置参数页面
/ ~5 U3 V; N0 x, n% L' _, z% ~) _
下图为结果下载页面 . j4 }! v' f# s4 d' V' B& O

- B6 n5 ?: c5 e! d3 H2 p图表 6 结果下载页面 $ K4 Z8 \  |. k2 t5 v% C, X; z
3、编程模拟测序结果
3 O" x4 W; N' D9 s: Z9 q3 q/ d  m( h拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 ) d* F8 a3 ^# I7 a
单端测序
% v6 A4 h( s( O! f3 S
# {( C$ b2 q4 D图表 7 程序模拟单端测序
2 h  e% K- Y( n3 ~8 T/ j双端测序
& T6 x( ~8 N! C7 d
4 T# d: `, g5 y4 u2 ~9 T图表 8 程序模拟双端测序
9 Z: r! Z  |1 J$ L1 E; V. O# G测序结果
/ z2 I3 [0 p$ {; c% z) P+ L) r
) z( e$ K5 X  V( A% e1 ]2 i图表 9 结果文件% c# m, Z6 M, ?6 B
, G; |& q; V" W% H, u" Q$ [" k
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 , N' |( H" u5 A9 ?% r( m4 O4 W
测序结果统计表
$ V+ b# X9 R/ o& @  o$ Q: m. t5 o0 p) ~- ^+ r( u! n
测序方式        基因组大小(bp)        片段长度区间 (bp)        N值        期望片段长度        克隆保留率        片段数量        Reads长度范围(bp)        Reads总数量        Reads总长度        覆盖度(m值)        理论丢失率(e-m)        覆盖率(1-e-m)
4 c; D- b1 j' k! p单端        12157kb        200-1000        10        600        0.95        107378        50-100        101968        7645.541kb        0.62889        0.53318        0.46682
/ `9 ~1 I/ ]" \+ Q单端        12157kb        200-1000        20        600        0.95        213722        50-100        202996        15227.882kb        1.25259        0.28576        0.71424% E! ~0 X. s5 O6 J% {
双端        12157kb        200-1000        10        600        0.95        106704        50-100        202770        15212.662kb        1.25134        0.28612        0.71388( T9 e5 O; G* P& D5 z4 o0 a, i
双端        12157kb        200-1000        20        600        0.95        214212        50-100        407186        30534.265kb        2.51164        0.08114        0.91886) d% c8 Z2 Z7 X) o; w
四、讨论和结论; O# G7 o# D& S4 \9 {) K5 ~& Z

9 c! i+ J! U. F/ \- _# n( Y程序运行方法, E5 p9 p. {+ m9 P) y" t

+ d( ?* j4 e3 {4 D' y& ^5 T在类的构造方法init()中,调整参数。 / h+ C: l! V7 d# Q: r1 V/ J) L
Averagefragmentlength为片段平均的长度; % a. L3 r0 {7 d4 }3 ^
minfragmentlength和maxfragmentlength是保留片段的范围; + }4 X& E, r% ]; s$ T! ]2 {4 T6 V
cloneRetainprobability是克隆的保留率;
  I% Y/ d5 S  Q$ t' pminreadslength和maxreadslength是测序reads的长度范围1 b  |- x4 B1 P& I) G
4 o) `" [  B; y; b& z- c6 E* c
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
0 ~) |+ r8 t  W2 X! d  q  `
, t# {: N! C6 s6 Z: U附录9 V& L3 U5 O5 O; i2 N

* |+ Z9 F2 u2 }7 ffrom Bio import SeqIO9 A9 W: w8 q6 {
from math import exp0 @3 b& {  A9 V5 Q2 z4 s5 R
import random
9 f* d3 L1 [( {7 E0 [; g8 r7 |4 i& R% `; S) r
class Sequencing:' B9 R% `* w$ i% v
    # N代表拷贝份数5 L. @, |+ p; s+ u' V) g* m2 |- I& [
    def __init__(self)  X1 v; G( L6 i0 q% i
        self.fragmentList = []5 o% ?  R' P4 V
        self.readsID = 1: S- ]4 z" C" {6 }6 Y" S+ s
        self.readsList = []6 Z1 C8 r  c) {" _  J6 V
        self.averagefragmentlength = 6502 ^' r- Z* j9 D/ v2 S
        self.minfragmentlength = 5003 [' y  I( l" C; P
        self.maxfragmentlength = 800
& ^) n; l  Z% h4 u3 t        self.cloneRetainprobability = 1
& j% d1 y0 r2 q$ H/ D" Y  r        self.minreadslength = 50
0 I. @, N/ `" S  s        self.maxreadslength = 150! Z7 {/ j1 N9 Z
        self.N = 107 H) B# v* p5 t; R
        self.genomeLength = 0) q' C- h) f" C
        self.allreadslength = 0
" T1 J% O! j3 p# }8 f( _9 h& M9 g/ M$ q3 [) G! Y8 p) ~; S
    # 生成断裂点2 S$ _. Z% q$ u4 x' C5 ]: X
    def generatebreakpoint(self, seqlen, averageLength):
7 H1 W& j* H/ L& N- {, i6 j4 ^        # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)4 u% H8 v3 d0 T: [# x
        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]& I: ^# G( L& G5 [
        breakpoint.append(seqlen)% x/ z+ s  v. E
        breakpoint.append(0)
3 {3 t6 k, G$ v' b. N8 P        # 把随机断裂点从小到大排序
3 G# C. _" ~% q, c& N: I        breakpoint.sort()
. |6 k( i% D& F! w" d- F        return breakpoint7 N9 H  Z  Z+ u% @# ?

+ m$ o& k/ x# ]6 S. |    # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp- N- _1 {! ?" `1 I5 M& u6 Z9 e% p
    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
8 _( b7 Q$ D: d5 D) @        for i in range(len(breakpoint) - 1):' p0 f7 A( L  n5 a, d; N7 O
            fragment = seq[breakpoint:breakpoint[i + 1]]+ f) o6 U+ a. a
            if maxfragmentlength > len(fragment) > minfragmentlength:* D+ u) ~% ?" L- y8 Z& `4 Z
                self.fragmentList.append(fragment)2 Q  P3 Y4 o& l* ^& @2 [6 i# X) T% l; b
        return self.fragmentList- j; @7 w) R, I
# a% C2 f% E0 y( ?0 m
    # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率! M; o8 H9 [" c* N+ O7 u! z! X
    def clonefragment(self, fragmentList, cloneRetainprobability):7 h# ~* d- ^9 }# }# s1 l
        clonedfragmentList = []
' ?) F" N  V$ y7 s' C* `* ^  k        Lossprobability = [random.random() for _ in range(len(fragmentList))]
" _3 g/ [6 {5 j. f        for i in range(len(fragmentList)):
8 z* \! {9 X9 ~6 s/ o0 i; w9 T            if Lossprobability <= cloneRetainprobability:
8 u8 h1 \+ u9 K. m8 {  p6 K* }+ f                clonedfragmentList.append(fragmentList)
4 K* K3 g" m. x  s        return clonedfragmentList' k6 Q. x7 D/ j' V6 f0 c' j3 K

# I1 n+ ~7 m( P" P0 n' J    # 模拟单端测序,并修改reads的ID号0 Y6 X, m0 b, ]" `' O
    def singleread(self, clonedfragmentList):
# @$ X6 N( f9 z% `. z- g        for fragment in clonedfragmentList:. P/ L6 ^9 C, r% P6 n
            fragment.id = ""
! B9 K" N1 L; ^' F8 S            fragment.name = ""
- R8 K! I! z& ?* _' @; |            fragment.description = fragment.description[12:].split(",")[0]
0 H6 \% z! K& ^  h! o: \6 X3 W            fragment.description = str(self.readsID) + "." + fragment.description4 R4 Y! x& N8 q8 @+ l9 }; ?
            self.readsID += 1) Q: l0 R8 f4 u& _7 _
            readslength = random.randint(self.minreadslength, self.maxreadslength)
4 A( [5 E6 [/ J; A% R            self.allreadslength += readslength
5 A* i( N1 s) t! n            self.readsList.append(fragment[:readslength])
8 I5 w9 T' n8 M$ S* {$ `9 d: ~4 L* a. G& a! H0 k" r
    def singlereadsequencing(self, genomedata, sequencingResult):; [1 j4 X- a* C# U5 C) Y
        for seq_record in SeqIO.parse(genomedata, "fasta"):
8 @9 e3 p. B( U7 H8 f4 c            seqlen = len(seq_record)
! u  R  y- |6 i- k% _8 X# C            self.genomeLength += seqlen
/ H. I, P2 `8 a* f            for i in range(self.N):
: S( V. y8 L+ n# z: F4 A2 C6 u5 K                # 生成断裂点3 M; c! r& i4 V9 a: T0 \  B
                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)! @& J) a# v* {
                # 沿断裂点打断基因组! y# p: g: i0 s; p; D* Y
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)- ~9 N( d5 n! y: {0 ]
        # 模拟克隆时的随机丢失情况; \( @/ z- G) @' y  j1 J, H& y
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
' W: C1 p4 T7 A3 C0 [" N        # 模拟单端测序! H1 j( W0 Y/ E, E9 g
        self.singleread(clonedfragmentList)* D* B' f* d8 `; W2 u5 b4 W
        SeqIO.write(self.readsList, sequencingResult, "fasta")9 X$ c& h8 H8 k2 @7 [

' q( s- @/ k4 G4 R! [    def pairread(self, clonedfragmentList):
+ J6 h$ z, [7 q( B- ?        for fragment in clonedfragmentList:
0 g. ?) U& n% x6 f, L1 W% b' {/ l; g            fragment.id = "": w  D3 m) k. i0 a
            fragment.name = ""
1 |- Z7 T: |/ ~7 d& @  X1 `3 H# H2 N            description = fragment.description[12:].split(",")[0]- N: O/ @7 c8 @( S. \
            fragment.description = str(self.readsID) + "." + description; z( H8 d8 ]  R+ d- m+ @1 z
            readslength = random.randint(self.minreadslength, self.maxreadslength)% k& f. A8 C6 a, b# h  ]% H
            self.allreadslength += readslength, ], z: W; W) v3 m' W8 S8 N3 h8 Y$ \
            self.readsList.append(fragment[:readslength])
, `: {# O, G6 `4 r: s# A: h5 t* m& J: x% x& L
            readslength = random.randint(self.minreadslength, self.maxreadslength)
$ ~/ ?/ y* j, O            self.allreadslength += readslength
7 a2 P1 }" N% y
3 o! }0 T6 @1 G: h! M1 V1 O            fragmentcomplement = fragment.reverse_complement()" e- |6 n4 T6 G  }- }% B1 |2 d' ?
            fragmentcomplement.id = ""; \  o" I& U, B  R1 y: r1 B
            fragmentcomplement.name = ""
9 M9 ?; y" V* |* r            fragmentcomplement.description = str(self.readsID) + "." + description
- D8 c( Z$ h' X5 z$ M            self.readsList.append(fragmentcomplement[:readslength])
' m/ I5 f. b( D1 v3 s1 c7 L  F1 \" Z; G2 G
            self.readsID += 1
" @! z: k$ w- O/ ]9 B' @+ X  S4 M, O. _/ Z: p+ b9 h* R* d$ z
    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):4 s0 s# G& _* `
        for seq_record in SeqIO.parse(genomedata, "fasta"):0 |# z6 {) u8 M
            seqlen = len(seq_record)- m0 N& Z, _! p) N  x! K
            self.genomeLength += seqlen6 j0 j# `. l7 J
            for i in range(self.N):5 d8 g9 n6 p. I( i9 G7 \9 C6 u) R* s! Q
                # 生成断裂点
' x; m" ?  i* I) j; x7 n; u                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
. L! l- \  s# E) o                # 沿断裂点打断基因组& b1 N; x, G" p8 }) v9 Z! i3 o7 _
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)" S+ p7 m7 c2 h# ~2 u' m) L+ N
        # 模拟克隆时的随机丢失情况
6 t: {8 [9 q, @* h! Y" b3 J        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
+ u) }/ a- R" T3 U  r& I5 R        # 模拟双端测序8 M. d; x# C9 r6 q1 W
        self.pairread(clonedfragmentList)
1 J7 }2 M8 e' V3 d& A& Z8 D0 g/ m        readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
1 A- n8 s6 q" X( Z% z        readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
7 M7 h) A3 b% u& m        SeqIO.write(readsList_1, sequencingResult_1, "fasta")
$ H2 G. r& f* h6 c4 _6 {) m        SeqIO.write(readsList_2, sequencingResult_2, "fasta")
3 P2 F0 F6 F2 v3 g/ z& N2 Z/ V  j5 U1 a
% ?& F* t$ L5 |0 v4 ]2 c" u    def resultsummary(self):
, e3 L+ y! V! r* i        print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
. n% _7 d$ T$ M/ L        print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
0 p$ D5 ^/ a( t' b        print("N值:" + str(self.N))8 [- m. p  K* T% b% f) Q4 A
        print("期望片段长度:" + str(self.averagefragmentlength))
- E% w# Y* q% X7 d        print("克隆保留率:" + str(self.cloneRetainprobability))6 q5 u1 N* h! R& L4 g4 q" T, z  R' j: U
        print("片段数量:" + str(len(self.fragmentList)))
% S& d! I( u9 Y+ _* v* J* w        print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
! [5 `7 ?9 R' s        print("reads总数量:" + str(len(self.readsList)))
8 p$ u; y6 `+ y  e3 a        print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
  e' l4 m; b: p; i- M/ `: o        m = self.allreadslength / self.genomeLength
5 c, Z( Q+ ~* ~3 ^" g( T& h: M7 Y        print("覆盖度(m值):" + str(round(m, 5)))7 T0 A; A" H4 G# K& D
        print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
2 }. a' W  q% J0 H, G5 }) u& C4 N8 Y        print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))7 r8 h/ `- |, O. l5 X( S
# -------------------------------------------主程序-------------------------------------------
$ v% C( n- M+ {# 模拟单端测序
+ P- c0 S4 v8 O, j6 S: h, O# |sequencingObj = Sequencing()0 C! T6 E0 B( b+ }! z
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")6 Z. C" w, X9 b+ y4 v: m
sequencingObj.resultsummary()0 G; p8 B$ Y* l; ^
2 }- G! y8 \2 _, ?, ?) m3 l; u
# 模拟双端测序5 k6 Y3 y. W" D$ ^
sequencingObj = Sequencing()2 b0 g3 i; @3 j
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")$ t% D1 Z' H1 S6 F) G$ l3 o; W! _% m* V9 i
sequencingObj.resultsummary()
/ G* k2 j8 z/ y( Afrom Bio import SeqIO
' o5 Q9 d! }2 i6 dfrom math import exp1 C  L; F( h9 N. c
import random" E, H9 }& W6 D( o7 Z

. K2 O- D6 r: ^% hclass Sequencing:
5 N3 b  {0 ^# V, l! @8 P    # N代表拷贝份数) ^' `* r7 H( u. H
    def __init__(self):- M& h' V! d% G% _  `  e* {4 B6 f
        self.fragmentList = []4 C6 ^# [" T( l
        self.readsID = 1
* K3 l8 I0 q1 n5 k, D, p        self.readsList = []
" b) P3 P+ J* P  l/ \        self.averagefragmentlength = 650
# ~4 H. D9 j& y8 @# G        self.minfragmentlength = 5002 t- q4 `" [- R  ~( G, t
        self.maxfragmentlength = 800
# R2 W0 F- k" S8 @* n        self.cloneRetainprobability = 1
- o- J: s' i' W1 ^  J        self.minreadslength = 50
& d* J  U0 Q9 y5 U        self.maxreadslength = 150; h: L" p; f2 f+ _1 R
        self.N = 10
$ X8 W, V6 E) z, E        self.genomeLength = 0
4 {5 i" z* F* Q% \+ I        self.allreadslength = 0* U8 \: P! @) I
% @0 [: N* |" I+ B6 N( W% m: W
    # 生成断裂点! n1 S0 z  Q- W- G# Y& [8 M, R8 L
    def generatebreakpoint(self, seqlen, averageLength):
: b8 A) H9 F% O        # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. i( H6 f8 l' J$ x9 _. D. E+ P        breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
) O$ @7 u7 R( X; s: |        breakpoint.append(seqlen)
2 o: o, A. y- r2 [  M        breakpoint.append(0)
4 o6 j) F! q0 W2 z/ r, i. a        # 把随机断裂点从小到大排序# J: V% p. E5 R+ k) m
        breakpoint.sort()9 L+ C+ p' b) C% N
        return breakpoint' d$ u, R8 O8 S# n

& R/ N% W8 m+ t7 s9 X3 e4 G    # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
- \, F! ?* D1 V    def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):) C( X# P6 t8 W5 h; N; B
        for i in range(len(breakpoint) - 1):6 b  [) r  v* O& @  i
            fragment = seq[breakpoint:breakpoint[i + 1]]. P# ?: [  j4 r
            if maxfragmentlength > len(fragment) > minfragmentlength:
2 ~6 C) j, ^8 Q* P                self.fragmentList.append(fragment)0 G& S: N) V  A
        return self.fragmentList
1 w& j' c% }4 [: i9 m' ^2 `8 l, b* {. y; f6 ?6 M. Y
    # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
: G: ^' r/ A1 @3 f2 P( Q    def clonefragment(self, fragmentList, cloneRetainprobability):
! m$ \) Q% H  n/ D* l4 `! k        clonedfragmentList = []
+ `  @- F) K) F' g0 I6 h: C        Lossprobability = [random.random() for _ in range(len(fragmentList))]
1 P+ H5 ^# d1 q# V4 \" O+ s        for i in range(len(fragmentList)):( W/ x7 d2 U! u( \7 d
            if Lossprobability <= cloneRetainprobability:
* t1 m; S3 e$ s* B( Q                clonedfragmentList.append(fragmentList)
5 C7 p1 c( P0 e% N# j4 e# d        return clonedfragmentList
# u, z9 K% s* r- b
. c) [% I) l0 x! X: q& w! F    # 模拟单端测序,并修改reads的ID号
# W& t8 \9 q+ _! ^    def singleread(self, clonedfragmentList):
! P7 ?4 {. k7 q6 Q        for fragment in clonedfragmentList:, f9 G4 x* |: x" T5 q
            fragment.id = ""+ g6 b4 M- a: t+ ]" a& Q+ u
            fragment.name = ""& T5 O" y* s/ h9 B" ~- @( ?
            fragment.description = fragment.description[12:].split(",")[0]; m" b' Z: ?2 w0 |+ ?  |  c
            fragment.description = str(self.readsID) + "." + fragment.description
* G/ O0 Y6 ~3 B* p% ?! C            self.readsID += 1! J7 Z2 F' P1 K1 h' f$ f
            readslength = random.randint(self.minreadslength, self.maxreadslength)# C4 N4 Y: K4 v7 t: Y
            self.allreadslength += readslength% T  k8 u4 }, k( J4 \0 o
            self.readsList.append(fragment[:readslength])3 C0 |+ j& |+ N4 H% ]2 b4 c

- y( f/ h0 Q4 I. z$ L7 ?    def singlereadsequencing(self, genomedata, sequencingResult):
+ l- A6 @5 G1 ~) i9 K* g        for seq_record in SeqIO.parse(genomedata, "fasta"):
2 |$ B/ x, G1 ^+ t            seqlen = len(seq_record)) r1 X4 U8 s# D! l5 k0 L4 ^
            self.genomeLength += seqlen& W" Z% S6 @, y- j+ S
            for i in range(self.N):
" S% v8 ^$ y& F$ N/ H/ ~! G6 M                # 生成断裂点
' f/ e: W# r; g. V- I! |                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)( U5 ?/ p" E+ e7 b
                # 沿断裂点打断基因组# [" ]/ a1 O1 V. {% s6 T
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)8 h# _+ r8 ^: }7 A
        # 模拟克隆时的随机丢失情况/ k. C# d, u0 Q. L
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)7 I. z. K+ Y; g" [! C# Y0 E
        # 模拟单端测序
- q8 _6 j  b2 p8 y) A  O# M        self.singleread(clonedfragmentList)
5 |! ~' J8 E9 k% W        SeqIO.write(self.readsList, sequencingResult, "fasta")3 X' p2 \2 z9 H3 {' v
! S( ^8 ]$ Y3 U% g& i4 v8 h
    def pairread(self, clonedfragmentList):
; G: U. z+ a' R  }- {" Y7 T        for fragment in clonedfragmentList:
& Q2 O$ J4 t  D- {- X6 O& |            fragment.id = ""9 g# `9 T0 a* j* i* Z, s
            fragment.name = ""  f! u6 U3 B7 A! d: i( n$ V
            description = fragment.description[12:].split(",")[0]/ d2 q5 E6 N( |  n7 j3 @. r- s
            fragment.description = str(self.readsID) + "." + description
. y1 n; E% d3 d            readslength = random.randint(self.minreadslength, self.maxreadslength)
0 m" s- e+ D1 O' E# T  ]  @            self.allreadslength += readslength. }8 X+ |8 A0 k- X: y" p+ L" M4 W
            self.readsList.append(fragment[:readslength])' u0 S  ]" Y% J7 s

1 l: K2 |! t) y0 |/ r7 T4 F            readslength = random.randint(self.minreadslength, self.maxreadslength)
# Q' y# C4 {6 X6 R7 E4 d0 _            self.allreadslength += readslength
4 Q' i8 U3 O3 b1 i! S+ s! ?/ _4 h, w7 }9 r
            fragmentcomplement = fragment.reverse_complement()3 X  B8 X) ]: Y/ t; f1 {4 S
            fragmentcomplement.id = ""4 R* r- d. ]# B. _( S9 a  C
            fragmentcomplement.name = ""
6 A3 Q0 T( q, t# |4 @            fragmentcomplement.description = str(self.readsID) + "." + description: h4 H- a7 g4 K1 |; I) }0 `  w
            self.readsList.append(fragmentcomplement[:readslength])% S( y; F+ M" W; a
* Y) U5 d0 C0 I# E8 R. v
            self.readsID += 10 R9 g7 j7 v7 j* K( \) R

7 x$ K2 D( h! T6 q, p    def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
  p7 ]9 V1 H6 I        for seq_record in SeqIO.parse(genomedata, "fasta"):
* h0 h( }$ L! p- T3 ?9 p! t            seqlen = len(seq_record)' V8 _& m& ^; ]4 |% O0 \, L8 Q7 X0 n
            self.genomeLength += seqlen
: E% Q& q- u, Y2 q7 F9 S( H9 e            for i in range(self.N):
. {: ~# {8 F* R) U; U# j- R                # 生成断裂点
( @, \1 L' q9 w                breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)5 T' Y+ N" y6 W( r% g& B# f, Z
                # 沿断裂点打断基因组! n0 T$ S- h" A: V+ x# t
                self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
4 ]& r; k: P, N* z$ [        # 模拟克隆时的随机丢失情况$ D5 V8 \. W  R8 s* u
        clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
" G  G+ m3 ~9 `; E        # 模拟双端测序
6 L. n& d' ]$ L/ e        self.pairread(clonedfragmentList)9 C8 S+ D$ w: X& R  ]6 u/ G8 N$ e% i
        readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]% |& T5 G2 A; T' n+ q% M2 g, u6 C
        readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
. t" p$ `3 K  r9 P8 `: ^% z3 S        SeqIO.write(readsList_1, sequencingResult_1, "fasta")- ?( t' A. K, W4 n# P
        SeqIO.write(readsList_2, sequencingResult_2, "fasta")) z0 F6 a$ K' ]

. n, H2 T$ z5 e. C, F    def resultsummary(self):
& a2 Y! ?. L2 f8 R/ |        print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
! o% x$ f) T) q4 K        print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
3 e0 Z* q/ t$ X: d. l        print("N值:" + str(self.N))2 o" d8 |+ F' c3 Q/ p0 E4 L
        print("期望片段长度:" + str(self.averagefragmentlength))
+ @  N. `2 K( j2 _, O) F        print("克隆保留率:" + str(self.cloneRetainprobability))
3 Y8 Q8 O* X: r2 ~. v8 g; B3 r        print("片段数量:" + str(len(self.fragmentList)))  W. \" E1 h# g5 t
        print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))' }/ p! c% t: |5 s5 K" ~3 Z
        print("reads总数量:" + str(len(self.readsList)))
0 F! g" B$ @9 e" N2 `" d. {        print("reads总长度:" + str(self.allreadslength / 1000) + "kb")0 t6 |! m# \& K3 z( h& ]' m
        m = self.allreadslength / self.genomeLength
8 R$ g- z. D' J: F. k6 C6 [5 |        print("覆盖度(m值):" + str(round(m, 5)))
: N$ O: d  l# i! c$ |$ Z9 `        print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))- @& M) Z; G( S6 F$ Y( l* k4 o# A# K
        print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
' i$ {; C: `/ G* r: L$ j# -------------------------------------------主程序-------------------------------------------
: k' M9 G# z6 a  f  x' |# 模拟单端测序/ s8 `: R* l( p& l
sequencingObj = Sequencing(): b* j& v0 `" d$ @# v- [
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")1 `6 R6 g  e. F: q
sequencingObj.resultsummary()
! S, B  J, x5 j/ _# X2 {) F' a/ V
' m4 g0 w* R( w9 q# 模拟双端测序8 y9 m( u+ K+ O% v7 ^6 t
sequencingObj = Sequencing()
* }: L& ^- U/ i, r( x- ~7 J0 E+ i" QsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")0 Z* J5 K9 _6 O, r  R
sequencingObj.resultsummary()
- j# W2 U6 T' w' m' m, A- r+ C. M7 O  M8 t( f
1 [+ M6 j2 h$ M* }: }
( c' g) N( U6 w$ f  U
* X/ |3 b" O( ?5 k

数学建模解题思路与方法.pptx

117.69 KB, 下载次数: 4, 下载积分: 体力 -2 点


作者: 3477959497    时间: 2019-4-22 10:49
不错。。。。。。。。。。。。。。。。。5 \+ R: k' z5 Z7 b5 ^





欢迎光临 数学建模社区-数学中国 (http://www.madio.net/) Powered by Discuz! X2.5