数学建模社区-数学中国
标题:
基因组测序模拟
[打印本页]
作者:
杨利霞
时间:
2019-4-21 14:56
标题:
基因组测序模拟
基因组测序模拟
6 O+ n P1 P: B5 }8 L& r% ]( [
基因组测序模拟
/ ?/ ?+ z7 L3 A* G# L1 ^1 o0 m
. ^8 s B/ D8 v6 ^
一、摘要
# G) w {0 y, ?4 ~
- H t5 v7 U2 L
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
- F2 B X" q/ |) l% g7 K, z8 l
. G' I2 b. ?) n" H( X- L9 K4 i0 Z& P1 S
二、材料和方法
. G2 }( n4 P! w0 B
+ Y A8 {( I( x6 G
1、硬件平台
3 h$ m# B7 Q: a6 a5 U* E2 _, J Z
1 L( X# L+ V. D, M+ r2 n
处理器:Intel(R) Core(TM)i7-4710MQ CPU
@
2.50GHz
_8 z- Y0 v6 ?" [$ D) Q8 u4 r
安装内存(RAM):16.0GB
" i: n- N. q/ z5 V3 k% U* A y o
! V4 v0 }2 l' i0 d5 c- h
2、系统平台
4 c; f" }; i* G: ]- i: W$ c) d
Windows 8.1,Ubuntu
b9 w: ~, H- d% O5 U7 B1 g& u
6 T; m- Y. R$ g3 x: Z
3、软件平台
9 i! W+ ^6 h! x% T( Z0 w: ?: c, _6 G
2 k( w1 p; k7 T; j
art_454
* s; \+ X) b! i- m" u3 p
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
$ @8 [5 }6 }* Y$ Y" [
Python3.5
7 [8 m r% f8 x: j* v" i& v& f/ a
Biopython
, W1 o, q3 m2 D9 \ H
4、数据库资源
8 o' s/ N2 K3 M. E. |
: @ g" {5 ^$ \ p
NCBI数据库:https://www.ncbi.nlm.nih.gov/
* ?7 o5 W# G- ~- U
$ b+ i# {7 K. B; o: v3 O8 x5 [
5、研究对象
5 g9 A, y F' ]' y' w# E
% i2 a! S4 s! m$ F( }
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
x& j% u% u) `) C! j
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
$ r* Y" e$ u8 Y8 F
5 D; Z# q) E# }5 a2 d
6、方法
* F# V6 |0 E) c) {
& R, t* s9 c+ B* T* \. Z
art_454的使用
' H# G6 n2 f8 x- X y3 E' h
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
# h& q3 C+ ?0 x4 f" ~
GenomeABC
; ~5 D* q1 G g% M
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
8 {/ k. K& ` j6 T, O+ _
编程模拟测序
/ [+ u, C) j, R
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
" e1 @4 a9 M# Q4 r$ P$ l
三、结果
/ L' T% r* I5 D$ n; B. q
0 v+ ]3 l3 ~, e2 {6 j# y
1、art_454的运行结果
8 [: B }% ?5 v
0 X; P; b* c+ l7 [. S' M* b X$ S
无参数art_454运行,阅读帮助文档
2 X$ q& q# Q% a. }# t7 H* k- V3 Z7 O9 l. p
7 E/ Q( w L b# H D3 I" I
图表 1无参数art_454运行
8 e' ~' I& Y& ]% c/ b
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
8 d9 i6 A- q5 h n$ |
下图为模拟单端测序,程序运行过程及结果
& Y' D9 A P2 C& g, b' ]
5 _3 ~6 }4 k9 n! y" y( T
图表 2 art454单端测序
0 z% C: h8 H6 }* S q9 p T
- u5 L! C1 I3 O0 A! k. t
图表 3 art454单端模拟结果
9 k) i+ G* ?3 b; r1 }: h
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
5 V o# q! q9 ^- M/ L! B- f8 H7 g4 s
下图为模拟双端测序,程序运行过程及结果
7 _ V% x5 r: S% Q# {4 B+ a5 k: R! v
! }: o0 ~: P0 k% o/ q1 X1 g
图表 4 art454双端测序
6 u+ U4 K8 f2 S# {$ }8 Q
( G! d7 x2 T" V) M y
图表 5 art454双端模拟结果
3 ~/ E+ @ ?. h1 l6 G$ A3 x" r; t% O
2、GenomeABC
2 K6 G n4 b! d) ~- a5 k
下图为设置参数页面
; f ~: X3 g9 d5 g4 K) F( y# |
1 I# M$ F/ M( _; ~8 @1 z
下图为结果下载页面
) f- \/ e) G. V0 x0 j
' j; C' ^. m8 c- V
图表 6 结果下载页面
Z6 {8 r; e& \- w2 d1 z* K
3、编程模拟测序结果
. U& h0 p, h# a4 u( I0 ]
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
& v# ~4 p2 a7 R
单端测序
- x/ H! U9 q- z7 g* `! h6 R
+ {- Q* e1 m- c2 Z
图表 7 程序模拟单端测序
Z; X1 h `0 V+ Y8 V- D2 E
双端测序
( K. V2 }9 q% j6 |7 v. g' U
$ U& P" V: f% D8 G w0 S
图表 8 程序模拟双端测序
( h o/ G' n Y, W* Q; H- }# M
测序结果
2 H1 m- F; T& J S1 @7 |$ d
. s* Y/ B& c6 h3 D
图表 9 结果文件
3 H w9 o, F$ [) P
# B! H7 o8 L; y
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
* b* U0 F F2 U) S3 [, F% e$ `
测序结果统计表
% t* n- z: ?4 H* k+ F# B
$ I' f. }, y7 g& ^( ~$ e
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
- o# R+ p) t/ b
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
2 ]# B. V: h* u) F1 w4 @1 S" Y
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
5 T/ [; H/ D0 [3 ~
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
6 k( P4 B% h! I% m I) d
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
4 g T( t& j8 a. }( o3 i3 L
四、讨论和结论
; n$ n! t- G O2 s+ e
; N9 p' z8 `. r2 [
程序运行方法
+ H; T7 P5 g; k, F0 t. M/ w
8 f. H7 z* A* b, @
在类的构造方法init()中,调整参数。
/ H+ p: u, y2 J6 H6 q+ k N
Averagefragmentlength为片段平均的长度;
6 w. K0 u6 f* Z2 w& {+ j2 y
minfragmentlength和maxfragmentlength是保留片段的范围;
) k" O! Z$ T- l! D
cloneRetainprobability是克隆的保留率;
/ ~! r. T% ]" i N7 n* R
minreadslength和maxreadslength是测序reads的长度范围
: a3 L) r: Y# W3 O# R; n
- Y7 Z2 C% S C9 n' a
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
L/ a6 d9 o0 h1 ~. w) Q0 s1 R6 k
8 R8 r. S) N1 y$ R% i% Q! p2 U% P' R
附录
0 w5 F7 l' W" c0 u- f+ `
! w0 [1 H. T2 ^# A" @" m( T* q
from Bio import SeqIO
7 |. t/ ^" M. R9 `0 R+ Y+ g
from math import exp
1 `" |" i `) K1 h5 P5 D
import random
' S: \+ Q- \" P$ C8 ] Q1 A; X4 B1 s
; Y" X7 ~1 D" p1 {7 |& ^" [1 J! c
class Sequencing:
7 Q% E5 w$ t" J7 O) c+ D* j
# N代表拷贝份数
# w' b5 p) X+ N5 T, X
def __init__(self)
( ?$ i. g1 k/ u5 X4 K t
self.fragmentList = []
; S( V! V. j" p4 |0 t
self.readsID = 1
* h# s b% u" i
self.readsList = []
( t) w( ?' N; G6 q
self.averagefragmentlength = 650
3 z) h) j3 d- k \; |2 F; {
self.minfragmentlength = 500
4 g2 _- W$ N/ t0 L+ g% y. G# C
self.maxfragmentlength = 800
9 u) A1 v4 c7 l, V# l Z7 W y( X
self.cloneRetainprobability = 1
6 K/ c: V. S+ j* Z$ K; O" q' N
self.minreadslength = 50
7 c8 ^5 l' Q8 C7 @$ ~" t3 o8 O1 d
self.maxreadslength = 150
9 j( { J9 l, \) y- ~: ~
self.N = 10
0 Z) K( Q" ?1 {" S# b% J" t
self.genomeLength = 0
! Q6 Z. v1 f% ^& T% U4 P, m
self.allreadslength = 0
% m t5 G7 r$ l2 \2 ^
' v# j) c& {. w, S
# 生成断裂点
; s4 x' [- w, C% x/ `* [; U
def generatebreakpoint(self, seqlen, averageLength):
8 P h( E& S6 ^3 F2 U5 Y: b
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
) ^; L* I% _; C" f
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
' ?5 g: |. L" H5 a
breakpoint.append(seqlen)
1 r9 V5 K i5 t/ n' i
breakpoint.append(0)
0 T* X: [* l- J" M8 L! ]2 l
# 把随机断裂点从小到大排序
/ E- V9 W( L3 n1 o/ ]" ]
breakpoint.sort()
% u" A" o. e! [& }8 a! p
return breakpoint
$ O! H% v; E* b) W4 t2 v! O3 n! b
* |& y) i j; _. k) Y3 k
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
' Z8 s: H: g: J, u/ V5 J8 ~
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
7 C, }% l( o; ?( c3 `
for i in range(len(breakpoint) - 1):
3 t0 M" s) d# q/ y
fragment = seq[breakpoint
:breakpoint[i + 1]]
( K6 i: c7 }6 z: Z/ H: O
if maxfragmentlength > len(fragment) > minfragmentlength:
" J$ t! O6 }/ ~. I
self.fragmentList.append(fragment)
9 t% |5 d% j4 d& E3 h2 G
return self.fragmentList
' L4 l, j" L! |, C
' ]$ s `1 S- y8 X0 G
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
4 H% W0 x/ f5 Z! `
def clonefragment(self, fragmentList, cloneRetainprobability):
U! {# g/ B& K: Q N: l' u
clonedfragmentList = []
+ v- f$ S4 n2 @' ], f
Lossprobability = [random.random() for _ in range(len(fragmentList))]
7 y8 E3 b2 r0 U6 n5 s' i
for i in range(len(fragmentList)):
; {! @4 H( W5 O, Y, }! z( t& `
if Lossprobability
<= cloneRetainprobability:
- H% O: k! s2 b! A; V% O
clonedfragmentList.append(fragmentList
)
$ Q) b% _5 f, k
return clonedfragmentList
2 Q. W4 I( {, E. T0 |# u% V
# w6 F. N7 p1 v) ?4 W) u$ M5 o
# 模拟单端测序,并修改reads的ID号
3 x/ r6 [9 W0 p% b
def singleread(self, clonedfragmentList):
1 q, y, K( _( _; E2 k/ [& U% b
for fragment in clonedfragmentList:
6 z, {+ \) W) h- ]* s, u3 a( a- n
fragment.id = ""
, r9 P2 F; E2 u9 _
fragment.name = ""
) w/ u6 U1 S7 x
fragment.description = fragment.description[12:].split(",")[0]
4 z& y# o* f7 M. T" p
fragment.description = str(self.readsID) + "." + fragment.description
# P G- z. f, z& l
self.readsID += 1
$ }* J q3 @. ^) j/ K ^, g4 X
readslength = random.randint(self.minreadslength, self.maxreadslength)
! k2 G& ^( t6 H) @
self.allreadslength += readslength
+ x# R/ v/ ]8 e- A8 s
self.readsList.append(fragment[:readslength])
1 q/ J: J8 g7 k ^( d
r+ G' k, U! N3 W6 B, w
def singlereadsequencing(self, genomedata, sequencingResult):
1 m) n0 ?. ]1 c
for seq_record in SeqIO.parse(genomedata, "fasta"):
( h5 Y9 x! G% J8 }
seqlen = len(seq_record)
# {2 \7 U9 W' J$ i% T& I. x& ?
self.genomeLength += seqlen
/ m% u( I) U4 H& t, C7 w: B1 f( p
for i in range(self.N):
9 e6 p& p' I( @9 O# ~" ^8 S6 C; q
# 生成断裂点
+ C* r0 k6 p0 j$ G
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
* Y: l# Y' o2 X# {4 ~
# 沿断裂点打断基因组
- E. S' R- O/ u( m! G4 c6 n
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
- y) F; a% S8 }; L
# 模拟克隆时的随机丢失情况
; j* l- [3 V! @& A. {8 ~5 {
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
- I% \4 T; x! G2 s: G
# 模拟单端测序
. r" k( i) g: Q% C! m
self.singleread(clonedfragmentList)
: R. d2 [# J3 T$ k
SeqIO.write(self.readsList, sequencingResult, "fasta")
0 Q" z7 n. Z8 r- i/ P3 F2 j- `; Y: L
0 L2 Z9 l' `' q9 b
def pairread(self, clonedfragmentList):
/ Z. u( p) H; Y: K! ^) ^
for fragment in clonedfragmentList:
; O. u, n5 D0 o+ N% g
fragment.id = ""
8 e) r C* z5 C; J; V2 R
fragment.name = ""
$ {) K+ Y L: f; B
description = fragment.description[12:].split(",")[0]
( L2 E) R' @% B1 x- f! J* _
fragment.description = str(self.readsID) + "." + description
- u3 Z; {- p: n9 U6 o' y1 w
readslength = random.randint(self.minreadslength, self.maxreadslength)
6 |2 A; W. r3 B3 C
self.allreadslength += readslength
0 a% @ |3 _, d- b6 g5 G g
self.readsList.append(fragment[:readslength])
+ d! y2 N$ Z* T
, y1 O: n" U. ?: R1 w3 W
readslength = random.randint(self.minreadslength, self.maxreadslength)
/ }5 p% j$ m) K+ c e8 k; n2 [
self.allreadslength += readslength
4 B5 ~/ M3 V0 n$ \
& P U. O3 }- E v) r
fragmentcomplement = fragment.reverse_complement()
. e1 n! @1 G9 P: S' F0 A) D( \
fragmentcomplement.id = ""
: y9 M) l: w4 F8 X0 c
fragmentcomplement.name = ""
* m7 k! U6 |( ]8 H% W
fragmentcomplement.description = str(self.readsID) + "." + description
0 l& E* H' O5 w+ |: N1 { h" a& r
self.readsList.append(fragmentcomplement[:readslength])
* W7 u, u$ J) Z' x' I3 p6 }
/ I, ]: D. A5 ~8 G
self.readsID += 1
$ j+ P! u! }. s7 l% ^/ {' n( Z+ I
& @) t& Q! r, ]/ \' Q3 j
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
$ s% u" R9 `( ~& I
for seq_record in SeqIO.parse(genomedata, "fasta"):
a. B! }2 O" i
seqlen = len(seq_record)
1 |5 R" f0 t5 Y' O {5 o `, _+ E, [
self.genomeLength += seqlen
( ]4 \' K, |+ f5 f" f6 r S
for i in range(self.N):
% s b0 s; P" c* S
# 生成断裂点
/ h* c4 a" Z5 h* p' h8 y
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
. B. y/ `8 a* i, p9 _' y9 |
# 沿断裂点打断基因组
* B6 n* S l* c9 y/ y
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
6 B+ J- \0 X+ D; p3 }2 l
# 模拟克隆时的随机丢失情况
, t* ? Y) E8 i- Q) o/ S4 J- A5 O" [3 k
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
, H7 G6 F: T5 U; I" o
# 模拟双端测序
# k8 I2 R, y; S' E( c
self.pairread(clonedfragmentList)
) G: N8 S5 u, E5 j f' w( O
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
8 ~2 w# G% _* m+ {9 b' e
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
; M% N k6 r9 t9 h
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
( _4 P) E: x% k
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
( b4 Z4 G8 N/ D$ w" Y
1 }1 q( V+ X8 N. o) o6 L
def resultsummary(self):
* F8 j1 r3 t( R4 k
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
7 i9 G' w' {- T- Y- w4 g: i
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
( ~9 L! c& U% |
print("N值:" + str(self.N))
0 Q3 z6 `+ O/ K; ~
print("期望片段长度:" + str(self.averagefragmentlength))
( j2 V4 U% Y6 s- a7 h& o1 F. @' d
print("克隆保留率:" + str(self.cloneRetainprobability))
4 A. h) Z; n1 n( V
print("片段数量:" + str(len(self.fragmentList)))
. s" G, C4 \' D. I7 `( z& l4 v r
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
! B0 _& x# u. Q p' t
print("reads总数量:" + str(len(self.readsList)))
7 _; q, K. V/ m, k
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
/ n5 L( g' G% p( g4 p
m = self.allreadslength / self.genomeLength
+ O1 ^1 O9 o2 j
print("覆盖度(m值):" + str(round(m, 5)))
2 k+ C) c! H; T1 O
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
3 D3 `1 T7 f) |
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
6 s; s0 V# w4 A: t; r3 t0 c
# -------------------------------------------主程序-------------------------------------------
R9 \" }5 _' H q, k2 v& ]# a2 l
# 模拟单端测序
( S& [( r4 k$ Z) ^
sequencingObj = Sequencing()
7 I- ]# J* J3 c8 ~2 X3 }
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
4 L! b1 {7 ?" h( v8 H# Q* R) Q
sequencingObj.resultsummary()
* S& _3 }+ ^* @5 j
; H1 x/ s# t8 j3 h2 y
# 模拟双端测序
# w- a8 |1 Z5 i U6 w, j" B T
sequencingObj = Sequencing()
: ]0 O/ F e. R9 w! _* c8 k' i4 @
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
* E( E/ y# w) x* U7 N6 B
sequencingObj.resultsummary()
" v2 E1 b) d. \7 ^4 p
from Bio import SeqIO
7 T% E* T; y: I* D* g
from math import exp
) S& J, b3 k! |
import random
: P: ~! }' G1 A4 `. l4 \5 b' U
' N2 c4 T% Z) M$ p6 Z
class Sequencing:
. l0 E! U- A3 K) \
# N代表拷贝份数
+ v; P5 g. E! | Z, E. t* F
def __init__(self):
! ~3 P( j0 Z7 K) P0 |. q* k9 m; F0 O
self.fragmentList = []
9 P2 s2 I$ h6 e- o4 z* t- }# l
self.readsID = 1
; V1 {3 r, T+ K
self.readsList = []
8 y# R1 b( [ |7 p+ q! u7 S
self.averagefragmentlength = 650
3 L+ _" s$ K# q' T# l% o
self.minfragmentlength = 500
# L+ l* p- ^; M. t- K4 v F/ y
self.maxfragmentlength = 800
/ @# L" K2 `# m( i- z
self.cloneRetainprobability = 1
0 F H5 p% k& p
self.minreadslength = 50
9 `: A8 A+ @. q2 r* {7 ^
self.maxreadslength = 150
9 C: a4 T1 R# h# ^" [3 U
self.N = 10
& w2 N( E1 x9 J* \! j3 Y5 y+ Y! w
self.genomeLength = 0
. @# c# H) |) w
self.allreadslength = 0
. B: f* t) N8 Q+ z3 ?
: T1 {/ ^- J6 Q4 J
# 生成断裂点
& U6 J: P% ?3 L, B3 r3 ?7 l& s
def generatebreakpoint(self, seqlen, averageLength):
# y- W, n4 f% N' X6 }- _
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
/ ~' I" u* L1 W5 N. v& B" h* F
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
$ x# ?2 L& \3 [: P5 L) A8 f1 q1 K; V( G
breakpoint.append(seqlen)
8 Q9 L& g) ` L0 Q7 X; s
breakpoint.append(0)
2 `0 ~8 @- v6 j7 E6 ]
# 把随机断裂点从小到大排序
4 P8 {7 x- j* M7 `" Z
breakpoint.sort()
3 h" w w4 W) t5 U9 l# r! [4 d
return breakpoint
! R( k5 c+ k: s+ H' g
5 j1 V! }: x' k* p
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
9 ~7 Y6 u; i: O) d5 j2 J% I1 L
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
5 H/ S: E) R* w# t
for i in range(len(breakpoint) - 1):
; H n2 x) _8 {2 a& @) a
fragment = seq[breakpoint
:breakpoint[i + 1]]
, I M5 m& e+ P5 `7 F% [4 k" X
if maxfragmentlength > len(fragment) > minfragmentlength:
5 i9 g3 J7 u# e
self.fragmentList.append(fragment)
& |5 p+ h& T7 N
return self.fragmentList
/ H( z# [; M" K2 k* t
! |7 q. S1 `8 R- b8 Q6 H
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
3 P8 ?6 a- {7 {9 B9 O. N. v
def clonefragment(self, fragmentList, cloneRetainprobability):
- [0 w4 e# @! N2 N* _
clonedfragmentList = []
4 h3 A2 I4 B: D* z/ B
Lossprobability = [random.random() for _ in range(len(fragmentList))]
$ X. M! x; |3 U1 ]- H# [8 t) z
for i in range(len(fragmentList)):
; V0 c" F8 ]8 i L
if Lossprobability
<= cloneRetainprobability:
7 \* l: A5 r: d+ X9 i+ ~8 j
clonedfragmentList.append(fragmentList
)
" H/ Z4 |8 R( R& E7 l
return clonedfragmentList
4 U, l Q! Q+ r" y. c7 g" I
- o+ `8 R+ M5 U/ a
# 模拟单端测序,并修改reads的ID号
) G. K3 _+ }8 N7 C5 i( \* a% n
def singleread(self, clonedfragmentList):
( ~. E3 o0 s2 k
for fragment in clonedfragmentList:
6 r4 ^5 n$ f: Z
fragment.id = ""
0 ~: v! j! | o/ `
fragment.name = ""
- g# {4 U0 q2 \! [: J
fragment.description = fragment.description[12:].split(",")[0]
3 P3 m3 v* e" a3 T+ }& i/ u# ~0 H
fragment.description = str(self.readsID) + "." + fragment.description
+ ]+ q% \2 S1 ]- R1 w3 e
self.readsID += 1
, B' O5 q4 C; K# B; \
readslength = random.randint(self.minreadslength, self.maxreadslength)
1 X; x4 t. O5 F. J1 ]: I
self.allreadslength += readslength
: V! r* ]& Y! ]
self.readsList.append(fragment[:readslength])
, P5 B& o# W; }9 P; n& i3 s( w
1 C; g+ c; f4 Q! O. }
def singlereadsequencing(self, genomedata, sequencingResult):
9 v: m/ x' t8 o9 Y4 U# j, k* O" n
for seq_record in SeqIO.parse(genomedata, "fasta"):
% {7 b" o1 @/ }+ a- N
seqlen = len(seq_record)
% } ]: Q6 D, i' O7 H$ i9 I
self.genomeLength += seqlen
9 @, j# W2 b# }* K
for i in range(self.N):
- V9 z" }6 j/ T( S! x- p- a0 a1 r
# 生成断裂点
6 P9 j4 y! H1 X+ J3 C% M6 X9 R
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
6 Q: t8 p* }# }% v
# 沿断裂点打断基因组
% q. ^# e, G4 g( b, ^
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
5 L, Y3 U$ U; h! f% }
# 模拟克隆时的随机丢失情况
$ O4 `$ w7 a8 W/ B8 @" \0 [
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
; C! ^. k( K3 d; m; Z% R4 U. I
# 模拟单端测序
# Z1 H% G; x) b s5 T( B% N
self.singleread(clonedfragmentList)
! S" @& H1 `- t9 g% N% k
SeqIO.write(self.readsList, sequencingResult, "fasta")
$ n# J! a, l2 L, K8 x
% ]2 h' e* A* s8 A
def pairread(self, clonedfragmentList):
& z3 y- Y m" |: C
for fragment in clonedfragmentList:
1 Y) @! C& G$ v' o6 O& J
fragment.id = ""
; l/ ]( M9 d( L* b& M% T
fragment.name = ""
7 ]1 w4 B5 h4 D
description = fragment.description[12:].split(",")[0]
9 O4 T) |" u$ @' s# z7 `
fragment.description = str(self.readsID) + "." + description
3 V% f6 J! m2 ~- x
readslength = random.randint(self.minreadslength, self.maxreadslength)
: q. Y5 o9 Z6 O; d# e w2 j
self.allreadslength += readslength
7 R# v% E. L B5 N2 v
self.readsList.append(fragment[:readslength])
i3 i7 ]5 m2 _$ h4 }4 ]7 }
7 K) p" g# P7 X4 `2 W' j
readslength = random.randint(self.minreadslength, self.maxreadslength)
0 P8 ~5 Z! I6 W5 n; I) K; x' C
self.allreadslength += readslength
5 Z8 M& ?( r' x0 D" J" e) q# }
$ B ^6 B# F+ I& D
fragmentcomplement = fragment.reverse_complement()
9 H3 W( K9 h" O+ r5 [1 ?# _
fragmentcomplement.id = ""
" ?! L: o9 v! |( d% f( f; o
fragmentcomplement.name = ""
: o A! L* A# b y
fragmentcomplement.description = str(self.readsID) + "." + description
! C" M! O' Z0 @9 P3 O. O2 V1 [
self.readsList.append(fragmentcomplement[:readslength])
, {. n7 s: F: \# V* V s! F; j! z
2 ?' M! I5 i: x! E+ P$ r
self.readsID += 1
% s! G2 c; M/ i9 B" B* L
2 ?& W- k4 j$ H7 L+ l2 Y1 o3 U! T
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
4 J5 W: p/ o( I" V- y% h
for seq_record in SeqIO.parse(genomedata, "fasta"):
9 S4 b& B/ p- M3 Z8 W1 I, t
seqlen = len(seq_record)
. V( b3 f5 I5 O4 h
self.genomeLength += seqlen
5 p4 t3 p6 _0 F
for i in range(self.N):
* C5 K$ B' E% M2 z0 b8 M& v# k/ t3 f
# 生成断裂点
" |' Q8 t& }$ s
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
7 f+ Y) K" c+ f X
# 沿断裂点打断基因组
9 Y6 \1 \( j3 n, ]7 U9 z1 H1 Y$ f
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
4 R3 ~, g1 h- L& I
# 模拟克隆时的随机丢失情况
( u. C! c6 w3 k+ L( B+ t8 x9 }
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
; \( @- J" J: [* s& w6 f" }) `
# 模拟双端测序
) }* I, |8 K6 B9 }
self.pairread(clonedfragmentList)
9 b; V) t+ h3 E. m
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
2 Q* z- G4 ]' a( |
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
1 v) S6 T& {% Z$ j, ?# O9 P
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
) w4 P, ~! q( _
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
" g) ?' z- j O% o
/ g( q8 f9 x( H
def resultsummary(self):
& W1 `4 m) j4 {4 M X) H" H: S: O5 a
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
/ Y; @5 {/ r1 j" h$ L
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
5 L0 R5 |0 O2 N
print("N值:" + str(self.N))
2 y7 A. D4 ]" p a4 P F
print("期望片段长度:" + str(self.averagefragmentlength))
; f X8 C9 C9 q# \8 _3 S+ c6 B
print("克隆保留率:" + str(self.cloneRetainprobability))
- n c& E8 s7 @! u' t1 `7 m' ?
print("片段数量:" + str(len(self.fragmentList)))
! j7 N) `/ g5 J8 V7 N0 `7 b0 F, D
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
I$ \; ] a% ]$ F
print("reads总数量:" + str(len(self.readsList)))
6 n2 K5 q2 {9 k: v
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
- L2 ], Z) R4 v" i
m = self.allreadslength / self.genomeLength
- m$ i& [4 L1 P. I, O8 S0 O
print("覆盖度(m值):" + str(round(m, 5)))
" h. Y9 p# a4 L. G
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
, _. v0 t% G7 c' ]* a
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
" h! y% ~9 ^& y5 @+ W7 t8 n
# -------------------------------------------主程序-------------------------------------------
$ Q m9 b' c! p/ q
# 模拟单端测序
2 b) d6 E7 U5 u4 X/ B
sequencingObj = Sequencing()
2 U, q* [9 e# E6 U7 B
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
/ \, V s1 N/ P
sequencingObj.resultsummary()
6 y, |. v. O% N7 |& x6 L
% ?' l9 K/ r. x+ l
# 模拟双端测序
& v# d. t* O9 n' c
sequencingObj = Sequencing()
4 W+ D3 M3 f0 ~* ^1 c) F
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
' B, \) N" z. U6 n& K$ S$ k% A
sequencingObj.resultsummary()
( A: r% v' @1 N3 T# o
1 {0 V9 C) |5 X0 f, d: B3 B
O' J/ ]' m" R" I
% Q: [5 Z) Q4 w( r7 S
1 Y" g \' s: D5 s7 b% p
数学建模解题思路与方法.pptx
2019-4-21 14:56 上传
点击文件名下载附件
下载积分: 体力 -2 点
117.69 KB, 下载次数: 4, 下载积分: 体力 -2 点
作者:
3477959497
时间:
2019-4-22 10:49
不错。。。。。。。。。。。。。。。。。
$ w6 H! E5 \" G
欢迎光临 数学建模社区-数学中国 (http://www.madio.net/)
Powered by Discuz! X2.5