数学建模社区-数学中国
标题:
基因组测序模拟
[打印本页]
作者:
杨利霞
时间:
2019-4-21 14:56
标题:
基因组测序模拟
基因组测序模拟
( z; f; `: H- u3 k2 t* x3 ~) d
基因组测序模拟
2 N6 Q7 ], O% [, N. M
( e- q' N, w& e4 Z7 Y/ \
一、摘要
& v; X! `6 `" H! a7 l
) f3 `% r4 g1 v+ @- u3 u/ z0 b
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
: o/ A9 ~* B2 l! n/ X- k2 j( D
! D2 P7 B3 T( r5 u0 k" }
二、材料和方法
/ k( V; h6 t9 p6 e
3 U9 V5 }& H+ h9 w/ @2 k
1、硬件平台
5 m4 g( x- L( D* `1 A# M. z
& G2 m/ H, }) Z, U, Q1 p+ f
处理器:Intel(R) Core(TM)i7-4710MQ CPU
@
2.50GHz
5 N% `1 V' x9 d. _. V2 I9 S- J8 y1 H; O
安装内存(RAM):16.0GB
5 ]% ~; U5 V, O1 T. W/ K5 }1 E
& ]: I8 H2 N1 i% }0 Z: ?1 R7 Y
2、系统平台
4 C% t) S+ Q) s% j- v$ {( m# P! W
Windows 8.1,Ubuntu
, i+ L5 G6 k) p5 e6 L/ u( _! H
! Z j8 p( G" r' z
3、软件平台
& F: e( _ p, x1 v, R- x4 O4 O9 c2 K
- ]! A+ H0 _8 z% ~6 l O8 g
art_454
; n F! i, X: j, M
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
7 c- h- r4 _8 L. y4 D( x
Python3.5
. u5 K- s5 m# D A) A
Biopython
; I$ u+ t/ F! G. F" X3 w1 O4 P
4、数据库资源
X. D" w" e0 L
4 ]' H" g$ O8 J: j
NCBI数据库:https://www.ncbi.nlm.nih.gov/
4 A, L } Z+ D- k9 J
$ K& n$ G: i6 W& w# a. t! j
5、研究对象
) N5 G" z+ b6 b$ N- W# ^# X" v0 l
: D! Z9 j5 ~4 R9 e; ~/ R0 I( _
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
" s8 E4 }# D4 B2 H* t7 s' V4 T+ S+ f
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
C* z7 J) x$ T/ u" p2 x' H
6 b9 Z. s+ g9 `, }/ Q
6、方法
* p: _. k/ |* `( h
: U1 ~8 y$ g- a! {0 n' }
art_454的使用
: Q3 B& b, Y0 c/ p$ g1 C& b
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
2 z9 J) S$ B: _1 z* J
GenomeABC
, R# b6 A: S$ ?3 U: ?) r @, m7 T
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
4 Y! d- H4 e8 T. M
编程模拟测序
) R/ Q F, C. L: j4 v( \
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
" k5 Z$ @1 R( t9 _
三、结果
0 ^2 \" G2 y7 g; I
' Y3 Q4 j# P! J; u
1、art_454的运行结果
- t# @" _0 E8 u* ~! S% ?& K( g9 g0 p
" r1 V8 k0 \7 f
无参数art_454运行,阅读帮助文档
5 I, N1 ]5 a. o& b$ C' s
' C$ \! f6 L+ T/ I
图表 1无参数art_454运行
7 M8 X# h" u' t2 w+ ^
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
" f2 B2 _ w* ~' s/ |* r
下图为模拟单端测序,程序运行过程及结果
: Y3 P4 {: o; Z- ]4 N+ m9 F1 L
# E, y0 X7 T" g- e0 j5 n
图表 2 art454单端测序
0 a+ K8 G1 D6 ^0 T( A$ o* y
! h. K) I$ E5 }4 _$ B/ i+ O R
图表 3 art454单端模拟结果
o( H" F6 ~4 U2 x/ _/ L
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
7 n! f7 n: D8 b# f- v
下图为模拟双端测序,程序运行过程及结果
7 a. _, U% d3 X
% K* c# ?; [; W. w$ ^* h" Q' C5 a
图表 4 art454双端测序
- ~6 s+ _& ?/ X1 b
6 a3 U2 {+ S! K9 F1 ?. C3 ^
图表 5 art454双端模拟结果
/ b. ]3 g! }% z* w
2、GenomeABC
- R+ S" P- i+ z2 U! k% W/ u
下图为设置参数页面
% [7 J) F6 e6 X& C) y; p& W
: W' c0 W: z" N
下图为结果下载页面
]4 y5 O+ U7 K6 x
3 \9 Y! A( ?/ S# s. C9 Y
图表 6 结果下载页面
/ O: B+ _" S/ n. B- ?
3、编程模拟测序结果
/ w; I4 ^4 @& B% |; C5 k
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
9 v( @3 q. b( V, c+ I' F& }) a
单端测序
1 p2 m4 ~0 E: }& o( B, e
4 [8 W& Q4 c5 F6 Q @
图表 7 程序模拟单端测序
" |" K6 \* Q# V0 e
双端测序
5 d2 P Q8 t0 u1 @' y
( L) v' S* T8 d9 {
图表 8 程序模拟双端测序
2 N7 T# K7 a: V
测序结果
7 a6 c2 _: W% L
; N5 g/ ]6 h, y$ p. b1 K
图表 9 结果文件
. B: U8 q1 D* N- Q
9 v6 \% t! j: z' s3 _% L
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
# X+ G U& \3 Z
测序结果统计表
H! E9 d9 u7 K7 O, a% J L
& K9 V+ I, q# S/ M- W2 x
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
2 Q/ g& S5 g1 _0 `
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
! ?; P: {* B8 j; _: e
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
2 D! q' o$ @% W
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
" \8 C4 e3 Q& Z$ F
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
3 P; P& z& X8 d: k, f8 u, t3 N+ R0 j
四、讨论和结论
2 ~- v* C' x/ R1 ^; Z: _
: f2 }5 b* h, N$ m7 f, n
程序运行方法
L8 I% z; M$ S! G8 \
8 k2 s& O2 W9 e
在类的构造方法init()中,调整参数。
W6 N7 K8 I8 ]: k1 j
Averagefragmentlength为片段平均的长度;
& u. E: Q8 B6 Q6 j( K9 Y' @! N* |/ R9 W
minfragmentlength和maxfragmentlength是保留片段的范围;
3 g3 w5 D) q! F; j9 \" H! b7 ~ R+ ^: R! P
cloneRetainprobability是克隆的保留率;
5 _5 _) l' y# ]" Y
minreadslength和maxreadslength是测序reads的长度范围
& Y: @) n# _# {& u Z0 H
6 M- X! [3 s p) d( O6 i \
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
: @2 o/ W {5 e; \7 `6 v& k
% Q( Z k7 S: E5 `
附录
, D5 X1 L6 Y+ A: s) U
" r( R8 k8 b9 h& a1 M
from Bio import SeqIO
" w& o. I% ]7 v( C9 P* P
from math import exp
9 Y! M. c' L: }9 d: I) I1 k% W
import random
+ ^. J, \0 k: ^* F" |; r$ o
% n" e1 M0 h% C) E( |0 K) q/ N8 n
class Sequencing:
/ L! y& X- q# [" W3 J' D
# N代表拷贝份数
' ^% [9 n# j4 L+ d& f a6 S2 }. F
def __init__(self)
+ W% ^3 T* M' H. t1 F3 f
self.fragmentList = []
9 s3 v+ _) @- v& f% [ W7 v3 G
self.readsID = 1
p1 K3 x. ^& Z7 x( T
self.readsList = []
2 h1 J0 Q2 E& H% P
self.averagefragmentlength = 650
7 k0 B+ \( d8 Y1 O9 T8 Z& ~; z2 N
self.minfragmentlength = 500
9 Y. y. o$ k/ _
self.maxfragmentlength = 800
# `) |& C! J# h1 _0 n
self.cloneRetainprobability = 1
- ?/ d4 I* b1 e, c+ a# N
self.minreadslength = 50
F0 |4 z6 T4 {8 ]) I$ B# t- D
self.maxreadslength = 150
$ p4 m' s- s" @9 V" ^
self.N = 10
. B q( I) }) c4 u0 D
self.genomeLength = 0
: H# V1 c8 X2 _; Y3 V: l9 o8 p. v! M; Z
self.allreadslength = 0
0 {- {, m+ n. N, e. w8 c9 k
* }0 T6 D* P! M% Q% q% C# j; K! B
# 生成断裂点
5 I: d0 i9 W* A/ H/ A E; n$ g
def generatebreakpoint(self, seqlen, averageLength):
1 ]6 t, V6 `9 ]
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. t5 C" C+ F2 |8 j- G
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
7 m( w8 t/ \8 E8 e# Q3 y( o( S& ^+ K
breakpoint.append(seqlen)
8 O; a3 ^1 f6 { |4 ^; k9 F+ e d
breakpoint.append(0)
3 ^' P1 F7 y8 Q% k- A& ~7 w
# 把随机断裂点从小到大排序
7 Z# _' ?0 V; L2 e, y/ |7 I. y1 J! J- k
breakpoint.sort()
$ _5 P, M( `: y8 H2 g0 l* V
return breakpoint
3 z; t1 p/ d7 C& \# i3 q& l
- f- W7 ?6 w ^5 G; s% ?
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
8 J) `/ ~; H5 f! Z% E
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
5 g# f$ K& I1 w5 l' p# ?2 Y
for i in range(len(breakpoint) - 1):
M7 }% u! [. T+ o
fragment = seq[breakpoint
:breakpoint[i + 1]]
# m( M' p; K/ {5 e
if maxfragmentlength > len(fragment) > minfragmentlength:
, ]7 J8 h4 D5 V1 C: M2 J
self.fragmentList.append(fragment)
, p, Z7 m+ a5 h; c! |& T
return self.fragmentList
- w( S2 E& W& X) {; Q7 {$ L
8 [7 p& C4 V2 H4 g" a
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
" L1 i3 O2 F; m& [
def clonefragment(self, fragmentList, cloneRetainprobability):
) g6 O9 E8 ]! S2 {* a
clonedfragmentList = []
, K0 h2 K6 t. A) u+ R; D
Lossprobability = [random.random() for _ in range(len(fragmentList))]
. g7 I$ B4 a1 W' p
for i in range(len(fragmentList)):
# I# i9 e" V5 ^1 |% L) b3 i) s
if Lossprobability
<= cloneRetainprobability:
0 j8 s3 u8 ]& F0 x. Y9 Q
clonedfragmentList.append(fragmentList
)
% _6 l1 i- p9 ^: [; O# Y9 t& ]
return clonedfragmentList
" d# h* l( y" \- r8 n
2 D6 D N7 u3 l+ f' p; C
# 模拟单端测序,并修改reads的ID号
% k9 n; d/ v+ [& @# ` P" H+ o! E
def singleread(self, clonedfragmentList):
$ V R1 {) r- h' j2 j9 z0 W. g
for fragment in clonedfragmentList:
) A$ F8 _1 q' Q d& O
fragment.id = ""
2 G6 p( N Y1 y! N/ l
fragment.name = ""
- _: L6 _1 J9 o
fragment.description = fragment.description[12:].split(",")[0]
7 H! w, l- E5 g2 Z
fragment.description = str(self.readsID) + "." + fragment.description
; r; }% W u6 w# O( A! f# M
self.readsID += 1
. I/ L1 c5 l7 @. S3 _) y4 V
readslength = random.randint(self.minreadslength, self.maxreadslength)
/ U5 t. g4 \* \( x/ B
self.allreadslength += readslength
; l" ~* K7 x( t/ l2 B( e
self.readsList.append(fragment[:readslength])
$ l0 I5 M; Z5 [: f# b
0 C% j' N B3 \2 d
def singlereadsequencing(self, genomedata, sequencingResult):
/ |" G9 {1 Q2 X$ _1 p( H
for seq_record in SeqIO.parse(genomedata, "fasta"):
) l2 B" X6 s7 Q$ W' _
seqlen = len(seq_record)
, ^' e4 ^; w1 u8 i8 \* ?
self.genomeLength += seqlen
$ w' y& ?2 c1 o. B. z" O
for i in range(self.N):
$ ~! D% Z3 R$ X, M' k& ?- n
# 生成断裂点
) W& a- Z+ B4 L7 V$ n
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
0 B3 A8 C! N7 D$ D( t" t) A
# 沿断裂点打断基因组
: f4 G& U. F2 s8 a( A3 V
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
, g3 R) K4 `' S; G0 G4 a
# 模拟克隆时的随机丢失情况
6 P: Z, Y# o( [, I4 X
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
! s P7 [8 K1 L( `+ E$ O0 |
# 模拟单端测序
+ E5 p9 x$ \. e$ Q, Z
self.singleread(clonedfragmentList)
* m% ^3 L0 S: V% V* v
SeqIO.write(self.readsList, sequencingResult, "fasta")
; J! S/ f' l5 T% i
' N! o% x7 q; o. c! \4 q# V: D
def pairread(self, clonedfragmentList):
0 |& S: x& I$ i: ^% @( S% k
for fragment in clonedfragmentList:
2 a! }; _3 [8 ^& X( S& a& F
fragment.id = ""
0 \" S$ n' U% P% f) R
fragment.name = ""
# S0 S, Z7 `6 W# h
description = fragment.description[12:].split(",")[0]
3 I& _+ n, B7 _+ }' `/ j# h+ S
fragment.description = str(self.readsID) + "." + description
/ R2 P8 c% W+ u" K$ n$ m" @
readslength = random.randint(self.minreadslength, self.maxreadslength)
5 f9 D4 W3 p- v/ p0 h) ]% [% J# Q
self.allreadslength += readslength
% u. B" w6 I9 B( I' E
self.readsList.append(fragment[:readslength])
% j; ]. X6 q# W% l
0 z2 }% C9 \1 t3 @
readslength = random.randint(self.minreadslength, self.maxreadslength)
$ n( A7 g3 M! f" |2 L
self.allreadslength += readslength
1 @& X8 f+ ?3 \+ B9 Z& ^% ^
, y) K4 X% e% @8 p' D* W
fragmentcomplement = fragment.reverse_complement()
9 X" A( }9 P) z
fragmentcomplement.id = ""
# {- E4 I6 |, Y' R
fragmentcomplement.name = ""
& @7 V2 \+ U7 {6 a+ V' W- a* @) G, d
fragmentcomplement.description = str(self.readsID) + "." + description
; i3 t! ~1 i. s" A5 M! Z0 r
self.readsList.append(fragmentcomplement[:readslength])
' c! P# v, ]) ?5 y1 u7 I
8 J! J/ E4 r1 G% H9 n5 Z3 B
self.readsID += 1
7 n9 M/ ?2 X4 Q2 V, Z0 H6 M
. T; a& j" x* P# K8 u3 U4 p
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
' M; W6 \) R9 L' p8 d3 I
for seq_record in SeqIO.parse(genomedata, "fasta"):
. G( y- F- G# K6 n1 b+ K6 m
seqlen = len(seq_record)
; i6 _- W+ h9 Q6 s) k
self.genomeLength += seqlen
8 l z; B X. {, H6 ?( C& a4 Y3 j
for i in range(self.N):
* J2 `3 Z/ R6 N6 R% W' p
# 生成断裂点
0 y8 l' ~; _' z+ O& G* U' u: u# _( i+ Q
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
7 [( [7 ?, H2 B
# 沿断裂点打断基因组
, h7 b, g5 G# ]0 p( I4 W
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
6 \$ h" N0 @0 g" k/ R: b! `8 _9 s
# 模拟克隆时的随机丢失情况
7 @6 h# h& p' t3 H% K$ a6 Y3 M
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
$ R) B) [" P( b% W" W
# 模拟双端测序
6 M' H# b$ X, Y, I$ x5 E+ J
self.pairread(clonedfragmentList)
1 f; f _4 x! L$ h; w+ {4 e/ V
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
) _) r% w& m% i) z3 W+ m
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
3 [( A7 ?" }% A4 W
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
3 }1 }+ I9 |8 @5 k1 s, e) E1 L
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
1 m# {2 Q) d0 ?3 H& }* w. _) S
h: C- ~4 B0 ^$ b. g1 j$ p
def resultsummary(self):
9 p9 q D1 Q f0 H/ m
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
% Q* z% n5 k% }9 @* P' l
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
* X# N) X& Y3 n% d9 D, ~3 l
print("N值:" + str(self.N))
V% B, P; v7 c. t J7 b7 y) B
print("期望片段长度:" + str(self.averagefragmentlength))
. f' `6 j5 s, `' w
print("克隆保留率:" + str(self.cloneRetainprobability))
; j( f0 V7 a3 d- y! _ D! J# e
print("片段数量:" + str(len(self.fragmentList)))
6 f# m. ~' Q4 M, w% a: ^ U) C
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
5 n' r8 A. F# n8 f
print("reads总数量:" + str(len(self.readsList)))
7 O% B+ ?1 {9 t' @: F5 j
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
/ t( ^9 U) P) k5 o
m = self.allreadslength / self.genomeLength
% E+ |7 G/ a I5 ?$ b: v% C
print("覆盖度(m值):" + str(round(m, 5)))
$ S) i9 S6 U" g1 ^! t/ ~* a* i. Q) ~
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
( J1 R# C" p: F" n7 @0 d, }
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
$ c8 u/ `) T v9 h, b
# -------------------------------------------主程序-------------------------------------------
0 I/ J2 f& X* O- p7 u
# 模拟单端测序
; u7 Y* l" y) J( y$ g
sequencingObj = Sequencing()
5 K! r z. a6 D. V$ y+ d. C
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
% T- H, H: J9 e% M9 d
sequencingObj.resultsummary()
" W/ T/ m, H- W# b
& R3 E& b m) s2 O5 e* d( n
# 模拟双端测序
7 T' v# |% z. J! D; p
sequencingObj = Sequencing()
1 @$ ?( C; w3 @3 n
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
5 V7 P, q4 z5 I6 N
sequencingObj.resultsummary()
5 \' ^) @ C6 e: B6 S" t
from Bio import SeqIO
2 |( e3 D( x+ w
from math import exp
. B# ^! l/ c$ W" ?3 k8 @
import random
! X+ k# y0 A4 u- I3 E2 j
# d% l* @# b5 S y* M9 I
class Sequencing:
1 l1 \& Y$ O/ r) o$ n6 C% \* ?; ?
# N代表拷贝份数
% w3 O% X N# I3 |/ X/ z& f1 K- n" @
def __init__(self):
/ L5 K' ^/ A3 P! R! f6 p: `% p) h
self.fragmentList = []
& [" y" y; a+ B6 m t
self.readsID = 1
: u/ l. f. w7 o$ \8 z
self.readsList = []
) F0 C; F; J) `+ Z2 a
self.averagefragmentlength = 650
' k7 V) N6 `3 ]! h9 H6 z! U
self.minfragmentlength = 500
& ]1 v# r/ I' b+ f& Z/ p6 N
self.maxfragmentlength = 800
! `: m; p8 o y. {
self.cloneRetainprobability = 1
! ^6 ]. W7 j3 p8 A* {
self.minreadslength = 50
, w) S5 l7 t$ g
self.maxreadslength = 150
; ]* s3 H/ i- u9 p
self.N = 10
& ?4 V: @/ c- ]1 z
self.genomeLength = 0
2 i5 d/ e) {; E* F' q" W# t
self.allreadslength = 0
$ W8 s4 h3 {. @* ^# }
: M9 V# r" ]5 }
# 生成断裂点
- S. A0 e) Q1 r9 |3 k9 L
def generatebreakpoint(self, seqlen, averageLength):
6 w/ K$ @: ~% a
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
- ^- s8 {, w. S& }' D) {
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
, y& ^5 E5 c, N
breakpoint.append(seqlen)
3 H3 a C4 v1 X" E o; K
breakpoint.append(0)
9 h1 m: |* M" `9 n% w$ F
# 把随机断裂点从小到大排序
! r+ y+ ^+ D( S5 ^9 d% q* `" [
breakpoint.sort()
i/ j+ E0 \ ` Z1 ^ ~( p
return breakpoint
2 e* `6 y5 l2 s0 |4 |( `
& _) I* [2 N" `9 H' M c8 Q
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
! Z" k6 U8 \9 ^
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
4 h6 f* L1 @: S
for i in range(len(breakpoint) - 1):
M% n% ^1 M }+ g0 s
fragment = seq[breakpoint
:breakpoint[i + 1]]
* E3 W8 o4 t- w
if maxfragmentlength > len(fragment) > minfragmentlength:
! p% U6 Y$ v) b8 O
self.fragmentList.append(fragment)
, k3 a/ i1 m4 c; ^: G
return self.fragmentList
9 R/ d% w) |* t+ l% z
) y1 [, O3 @- c" R0 V ~% ^
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
" d7 f# `. v4 ?9 h% t0 G2 f
def clonefragment(self, fragmentList, cloneRetainprobability):
1 S2 T3 A! o% l: l' M
clonedfragmentList = []
) O, X9 L7 h3 S y
Lossprobability = [random.random() for _ in range(len(fragmentList))]
. o5 F4 T1 g5 d" `0 I' r! p* F
for i in range(len(fragmentList)):
) C4 ~6 V2 D: }$ O( F
if Lossprobability
<= cloneRetainprobability:
6 H; I) u: g; A. Q W! ?* _
clonedfragmentList.append(fragmentList
)
0 a4 F8 H* ]4 \
return clonedfragmentList
% A) t# X4 |/ Z) z. Z
% T, r+ `& B1 y- g; M6 m0 b
# 模拟单端测序,并修改reads的ID号
3 i$ e; j2 ?. x
def singleread(self, clonedfragmentList):
. x9 i! c9 P0 a G7 T! c9 m w$ r1 o
for fragment in clonedfragmentList:
2 s4 |, W4 u# W6 `8 l
fragment.id = ""
3 v1 |4 u2 i8 Z7 P# {, e+ m
fragment.name = ""
: s7 f' ~4 V7 K; u
fragment.description = fragment.description[12:].split(",")[0]
1 G. k7 n ?9 \+ Q0 O2 A# W8 t
fragment.description = str(self.readsID) + "." + fragment.description
. o4 u! b+ A( N5 V% t, v1 _
self.readsID += 1
& B- |# o% e" y
readslength = random.randint(self.minreadslength, self.maxreadslength)
' |2 q7 W' s! n8 y9 t( a6 E- j- g* h
self.allreadslength += readslength
( f: I: x/ W9 a7 V
self.readsList.append(fragment[:readslength])
2 Q: b7 h- G9 L* o4 c1 [
6 k4 x/ n* h/ G3 e( _
def singlereadsequencing(self, genomedata, sequencingResult):
9 K- Y8 ^& e, u+ g# L# G$ V
for seq_record in SeqIO.parse(genomedata, "fasta"):
# c. w" q" ]! o# V- ~- f7 k
seqlen = len(seq_record)
3 }. J# K& d) d: h8 {
self.genomeLength += seqlen
9 p: A- `" d% Y" ?% P+ [: ^, p; c
for i in range(self.N):
6 K1 \+ r; S2 D
# 生成断裂点
" I* q7 u7 z- [4 N
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
+ q/ e; H8 b' r& A' _
# 沿断裂点打断基因组
. t9 }% D0 p1 l# {. ?7 ], w
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
$ q- d: e5 J9 ^6 F" q8 l9 ?
# 模拟克隆时的随机丢失情况
, r- d! @1 u* Q/ F, ^
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
% P# V% }) @/ z0 `4 Z/ q
# 模拟单端测序
5 w+ H. P" R4 ]2 N$ B' U
self.singleread(clonedfragmentList)
' \# {3 N& v: C: M4 `) N
SeqIO.write(self.readsList, sequencingResult, "fasta")
. _+ ?" k" F# ]7 o2 d
* T& B; U$ V8 S n0 A7 ~# c3 n
def pairread(self, clonedfragmentList):
' ?& A- J6 e2 y$ P# X$ E0 n
for fragment in clonedfragmentList:
7 C% A! k/ j/ X& ^, o0 i
fragment.id = ""
! P1 y+ w z& Z
fragment.name = ""
7 c( \# w6 v$ s9 @
description = fragment.description[12:].split(",")[0]
$ x1 C! ?0 Y; d2 F2 V* d. W
fragment.description = str(self.readsID) + "." + description
% n* P+ ~ ^4 g; a9 F- P
readslength = random.randint(self.minreadslength, self.maxreadslength)
2 u1 d! ~0 ^9 q
self.allreadslength += readslength
" S; S6 E, p8 B5 L
self.readsList.append(fragment[:readslength])
6 _7 e/ @8 l, M0 `" {* @% g
- B8 @$ G: _4 K: q8 Q7 w) m: Q
readslength = random.randint(self.minreadslength, self.maxreadslength)
6 s' P& q9 U& d7 A/ f
self.allreadslength += readslength
1 O1 _! }& P& ^- l- }3 ^
( X* U# p" E% Z+ f8 s) b3 o
fragmentcomplement = fragment.reverse_complement()
D- p' @% s Q8 w6 {
fragmentcomplement.id = ""
0 d! P5 x% X" `* [ w
fragmentcomplement.name = ""
; k: U+ W _3 K
fragmentcomplement.description = str(self.readsID) + "." + description
, ~- w# N+ w" z' R% e. k! n z
self.readsList.append(fragmentcomplement[:readslength])
# i) L( }9 Q9 a$ ^7 [3 W& \+ [9 x
1 _7 z* e& a* ?7 q8 q
self.readsID += 1
% ]. h' r* ^5 ]# @- G* }
- s6 U6 R; [+ x0 F
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
4 V6 h4 j3 C4 V% V
for seq_record in SeqIO.parse(genomedata, "fasta"):
4 }6 ~" r, z$ g! B7 y2 ^
seqlen = len(seq_record)
) I \# Y) Z+ Y5 Q
self.genomeLength += seqlen
6 N( ~7 K/ ~5 \: k
for i in range(self.N):
% ]- H, |! V% W' b# O5 K1 K
# 生成断裂点
" O) [% ^+ Y% d1 }& y0 y( a7 z) B- o' F
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
7 u0 g8 H/ u( v
# 沿断裂点打断基因组
4 n: ~( s8 |9 D
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
& P( {( G) F! A) a! r
# 模拟克隆时的随机丢失情况
4 g! u3 V( ^( J5 W
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
8 U" j7 ^* z) ? C( [
# 模拟双端测序
8 D. J0 i& T; ?* {4 T v
self.pairread(clonedfragmentList)
5 R% r+ u& b( M3 j9 d/ T2 [
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
2 k; C, C3 X* \6 N M2 d& a5 d; z
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
! G& I( K+ G- v8 {7 ~1 P
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
; ~6 g2 d1 R0 q
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
+ x _5 G3 X9 a% c; n. A* I
# L' p7 \8 k) \& n& a
def resultsummary(self):
& C; r- V1 q. _# |4 }7 \, n
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
$ i ]* F$ ^' p0 i4 r! k
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
# M8 R9 i3 {7 \" k1 m
print("N值:" + str(self.N))
- j( L1 O) |5 I, h. ?. W( V
print("期望片段长度:" + str(self.averagefragmentlength))
: G6 s: X9 {* i M H2 [5 z" q/ \; M
print("克隆保留率:" + str(self.cloneRetainprobability))
- r5 g' ~; Q$ L. E& n
print("片段数量:" + str(len(self.fragmentList)))
6 a* l* U. J; G% n: B& l
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
* q' n* m% K. B) Q
print("reads总数量:" + str(len(self.readsList)))
0 o' A( P2 o1 C' p
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
# [8 W- F( {" b4 s
m = self.allreadslength / self.genomeLength
/ R% U0 O' z8 l" E$ ^1 V
print("覆盖度(m值):" + str(round(m, 5)))
# e: |, {$ @ S" B& Z& \
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
& l4 k/ [2 h: f7 [
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
% `7 x; q" B W( L
# -------------------------------------------主程序-------------------------------------------
* h3 t" S- O" z1 m
# 模拟单端测序
l3 p( F9 X/ o: Q
sequencingObj = Sequencing()
" `' G1 W4 A1 J- A
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
. G+ i+ x& I: |+ |% B* m( ^. a
sequencingObj.resultsummary()
, G; E1 ]. g+ X8 I* `
9 C/ k' F3 f9 |) E' {8 f, o7 T
# 模拟双端测序
3 z( w+ ^1 D5 B7 A) N2 q- B) Q* s
sequencingObj = Sequencing()
2 c6 g* [4 J# m& f& q( ?( `8 ]
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
8 P6 R' `( g% ^0 T& m# n' H
sequencingObj.resultsummary()
3 J: P& K) ?- ~4 {8 G) V6 M, o
1 \3 D2 z4 z& E3 c+ T5 Z
! ~ u% O7 V Y2 i3 s
1 f8 v1 \2 X8 R1 Y, a
0 R9 u3 Z; M: n3 q' _' i# k8 b6 h
数学建模解题思路与方法.pptx
2019-4-21 14:56 上传
点击文件名下载附件
下载积分: 体力 -2 点
117.69 KB, 下载次数: 4, 下载积分: 体力 -2 点
作者:
3477959497
时间:
2019-4-22 10:49
不错。。。。。。。。。。。。。。。。。
$ G, w, J% d, n: P% U& p N8 _
欢迎光临 数学建模社区-数学中国 (http://www.madio.net/)
Powered by Discuz! X2.5