数学建模社区-数学中国
标题:
基因组测序模拟
[打印本页]
作者:
杨利霞
时间:
2019-4-21 14:56
标题:
基因组测序模拟
基因组测序模拟
5 o% D) C0 N' K( O2 l
基因组测序模拟
2 ?0 z4 W1 H1 s6 X- _
0 b1 z% {: N2 R2 g/ K" [+ L7 \
一、摘要
, {2 p7 r/ ]$ [9 _2 S3 s
' E& n0 f8 m* m0 d* r0 e- ?
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
7 q* v9 v+ [) G
j# s/ @ I7 U
二、材料和方法
; g, J" Y% Y( ~
! V+ v* M( D+ e5 T; m
1、硬件平台
% s1 K9 ]; D+ H) w' c# c
& E( b& \' U" A2 U1 q! G
处理器:Intel(R) Core(TM)i7-4710MQ CPU
@
2.50GHz
* C: W3 Q# A3 ~; @
安装内存(RAM):16.0GB
& v- d# {1 Y5 U2 X
" I! f3 W8 _0 u4 G" m" |( O7 y
2、系统平台
- C+ @7 W9 m: S- z2 Z- L R) L
Windows 8.1,Ubuntu
) x w8 R5 ^9 {
; U: S; W v! N: M8 Y
3、软件平台
3 y' v' l# [/ v- @4 o$ h3 x
% u: T% \& Q* N3 b7 v7 o
art_454
# w/ d4 t) f9 r. M. N( e0 B
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
' U* h1 ?' b, b$ q
Python3.5
* f; t9 Z/ Z0 u- n5 k+ T
Biopython
3 E1 `/ k7 }: ^" r
4、数据库资源
; V! r6 O+ Q/ U' Y% _' ?5 m
' x; g1 S5 V6 M( S" g; b" R5 C/ o2 N
NCBI数据库:https://www.ncbi.nlm.nih.gov/
/ k# A* j7 U& Z! M6 \8 {/ v
" u: ?) U" m$ {3 |
5、研究对象
/ r1 E) [& O5 `5 p" m
6 k6 F* B3 K5 F9 h- t9 M
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
, e% R: R: c. ^* M% x
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
* Z* m6 d S+ K" e: ` W
8 C) U5 I) Z3 J% V
6、方法
) s# C: T9 A5 F$ f
' e) Y1 D% A1 T# d O2 W& @
art_454的使用
8 h- B8 }% \* z: a
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
% \- ^4 \6 [3 v- |
GenomeABC
2 b6 a4 B$ p0 t
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
0 U9 v2 B* @8 o' }% R: ~
编程模拟测序
: M- z6 Z) C k$ S/ ?! R
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
5 K- k* g" E% j. [1 O
三、结果
2 ] m/ P6 k: t8 P$ l. Z5 w
+ w5 i! D" I$ R M* T) W8 @
1、art_454的运行结果
; J! M; ]; x2 A! R$ u7 B/ R
0 p' b( [7 R* G5 g
无参数art_454运行,阅读帮助文档
0 U" A2 a7 M. U1 u" O9 v% X6 \& ~0 ^! e
0 C# G3 C6 W* s* f/ @, B% v
图表 1无参数art_454运行
, H. T5 a j% P- C0 t; g* ]8 x
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
4 D. x) N; F. D/ f+ s4 y
下图为模拟单端测序,程序运行过程及结果
8 p* {8 a6 p) e, I1 t' l
! z9 A/ J# q! Z+ x" j) \- U
图表 2 art454单端测序
& |, ^4 H) Q: J
; P( ?1 ~! e5 d& |% K- |7 W
图表 3 art454单端模拟结果
' z- e! v6 c' g7 J
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
|: y$ ?) m0 c% T" q+ b# T; s0 g/ v
下图为模拟双端测序,程序运行过程及结果
3 n3 x7 w) ?- Z7 @
. a0 ]) r- {& W( @: Q' k, }
图表 4 art454双端测序
* b J9 @9 H) @0 Z2 r
6 c5 V- B3 `* o5 [& _5 W
图表 5 art454双端模拟结果
5 r4 i6 d* R& f
2、GenomeABC
& y" W, ~) O) H* [
下图为设置参数页面
/ ~5 U3 V; N0 x
, n% L' _, z% ~) _
下图为结果下载页面
. j4 }! v' f# s4 d' V' B& O
- B6 n5 ?: c5 e! d3 H2 p
图表 6 结果下载页面
$ K4 Z8 \ |. k2 t5 v% C, X; z
3、编程模拟测序结果
3 O" x4 W; N' D9 s: Z9 q3 q/ d m( h
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
) d* F8 a3 ^# I7 a
单端测序
% v6 A4 h( s( O! f3 S
# {( C$ b2 q4 D
图表 7 程序模拟单端测序
2 h e% K- Y( n3 ~8 T/ j
双端测序
& T6 x( ~8 N! C7 d
4 T# d: `, g5 y4 u2 ~9 T
图表 8 程序模拟双端测序
9 Z: r! Z |1 J$ L1 E; V. O# G
测序结果
/ z2 I3 [0 p$ {; c% z) P+ L) r
) z( e$ K5 X V( A% e1 ]2 i
图表 9 结果文件
% c# m, Z6 M, ?6 B
, G; |& q; V" W% H, u" Q$ [" k
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
, N' |( H" u5 A9 ?% r( m4 O4 W
测序结果统计表
$ V+ b# X9 R/ o& @ o$ Q: m
. t5 o0 p) ~- ^+ r( u! n
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
4 c; D- b1 j' k! p
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
/ `9 ~1 I/ ]" \+ Q
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
% E! ~0 X. s5 O6 J% {
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
( T9 e5 O; G* P& D5 z4 o0 a, i
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
) d% c8 Z2 Z7 X) o; w
四、讨论和结论
; O# G7 o# D& S4 \9 {) K5 ~& Z
9 c! i+ J! U. F/ \- _# n( Y
程序运行方法
, E5 p9 p. {+ m9 P) y" t
+ d( ?* j4 e3 {4 D' y& ^5 T
在类的构造方法init()中,调整参数。
/ h+ C: l! V7 d# Q: r1 V/ J) L
Averagefragmentlength为片段平均的长度;
% a. L3 r0 {7 d4 }3 ^
minfragmentlength和maxfragmentlength是保留片段的范围;
+ }4 X& E, r% ]; s$ T! ]2 {4 T6 V
cloneRetainprobability是克隆的保留率;
I% Y/ d5 S Q$ t' p
minreadslength和maxreadslength是测序reads的长度范围
1 b |- x4 B1 P& I) G
4 o) `" [ B; y; b& z- c6 E* c
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
0 ~) |+ r8 t W2 X! d q `
, t# {: N! C6 s6 Z: U
附录
9 V& L3 U5 O5 O; i2 N
* |+ Z9 F2 u2 }7 f
from Bio import SeqIO
9 A9 W: w8 q6 {
from math import exp
0 @3 b& { A9 V5 Q2 z4 s5 R
import random
9 f* d3 L1 [( {7 E0 [
; g8 r7 |4 i& R% `; S) r
class Sequencing:
' B9 R% `* w$ i% v
# N代表拷贝份数
5 L. @, |+ p; s+ u' V) g* m2 |- I& [
def __init__(self)
X1 v; G( L6 i0 q% i
self.fragmentList = []
5 o% ? R' P4 V
self.readsID = 1
: S- ]4 z" C" {6 }6 Y" S+ s
self.readsList = []
6 Z1 C8 r c) {" _ J6 V
self.averagefragmentlength = 650
2 ^' r- Z* j9 D/ v2 S
self.minfragmentlength = 500
3 [' y I( l" C; P
self.maxfragmentlength = 800
& ^) n; l Z% h4 u3 t
self.cloneRetainprobability = 1
& j% d1 y0 r2 q$ H/ D" Y r
self.minreadslength = 50
0 I. @, N/ `" S s
self.maxreadslength = 150
! Z7 {/ j1 N9 Z
self.N = 10
7 H) B# v* p5 t; R
self.genomeLength = 0
) q' C- h) f" C
self.allreadslength = 0
" T1 J% O! j3 p# }8 f( _9 h& M
9 g/ M$ q3 [) G! Y8 p) ~; S
# 生成断裂点
2 S$ _. Z% q$ u4 x' C5 ]: X
def generatebreakpoint(self, seqlen, averageLength):
7 H1 W& j* H/ L& N- {, i6 j4 ^
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
4 u% H8 v3 d0 T: [# x
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
& I: ^# G( L& G5 [
breakpoint.append(seqlen)
% x/ z+ s v. E
breakpoint.append(0)
3 {3 t6 k, G$ v' b. N8 P
# 把随机断裂点从小到大排序
3 G# C. _" ~% q, c& N: I
breakpoint.sort()
. |6 k( i% D& F! w" d- F
return breakpoint
7 N9 H Z Z+ u% @# ?
+ m$ o& k/ x# ]6 S. |
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
- N- _1 {! ?" `1 I5 M& u6 Z9 e% p
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
8 _( b7 Q$ D: d5 D) @
for i in range(len(breakpoint) - 1):
' p0 f7 A( L n5 a, d; N7 O
fragment = seq[breakpoint
:breakpoint[i + 1]]
+ f) o6 U+ a. a
if maxfragmentlength > len(fragment) > minfragmentlength:
* D+ u) ~% ?" L- y8 Z& `4 Z
self.fragmentList.append(fragment)
2 Q P3 Y4 o& l* ^& @2 [6 i# X) T% l; b
return self.fragmentList
- j; @7 w) R, I
# a% C2 f% E0 y( ?0 m
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
! M; o8 H9 [" c* N+ O7 u! z! X
def clonefragment(self, fragmentList, cloneRetainprobability):
7 h# ~* d- ^9 }# }# s1 l
clonedfragmentList = []
' ?) F" N V$ y7 s' C* `* ^ k
Lossprobability = [random.random() for _ in range(len(fragmentList))]
" _3 g/ [6 {5 j. f
for i in range(len(fragmentList)):
8 z* \! {9 X9 ~6 s/ o0 i; w9 T
if Lossprobability
<= cloneRetainprobability:
8 u8 h1 \+ u9 K. m8 { p6 K* }+ f
clonedfragmentList.append(fragmentList
)
4 K* K3 g" m. x s
return clonedfragmentList
' k6 Q. x7 D/ j' V6 f0 c' j3 K
# I1 n+ ~7 m( P" P0 n' J
# 模拟单端测序,并修改reads的ID号
0 Y6 X, m0 b, ]" `' O
def singleread(self, clonedfragmentList):
# @$ X6 N( f9 z% `. z- g
for fragment in clonedfragmentList:
. P/ L6 ^9 C, r% P6 n
fragment.id = ""
! B9 K" N1 L; ^' F8 S
fragment.name = ""
- R8 K! I! z& ?* _' @; |
fragment.description = fragment.description[12:].split(",")[0]
0 H6 \% z! K& ^ h! o: \6 X3 W
fragment.description = str(self.readsID) + "." + fragment.description
4 R4 Y! x& N8 q8 @+ l9 }; ?
self.readsID += 1
) Q: l0 R8 f4 u& _7 _
readslength = random.randint(self.minreadslength, self.maxreadslength)
4 A( [5 E6 [/ J; A% R
self.allreadslength += readslength
5 A* i( N1 s) t! n
self.readsList.append(fragment[:readslength])
8 I5 w9 T' n8 M$ S* {$ `9 d: ~
4 L* a. G& a! H0 k" r
def singlereadsequencing(self, genomedata, sequencingResult):
; [1 j4 X- a* C# U5 C) Y
for seq_record in SeqIO.parse(genomedata, "fasta"):
8 @9 e3 p. B( U7 H8 f4 c
seqlen = len(seq_record)
! u R y- |6 i- k% _8 X# C
self.genomeLength += seqlen
/ H. I, P2 `8 a* f
for i in range(self.N):
: S( V. y8 L+ n# z: F4 A2 C6 u5 K
# 生成断裂点
3 M; c! r& i4 V9 a: T0 \ B
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
! @& J) a# v* {
# 沿断裂点打断基因组
! y# p: g: i0 s; p; D* Y
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
- ~9 N( d5 n! y: {0 ]
# 模拟克隆时的随机丢失情况
; \( @/ z- G) @' y j1 J, H& y
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
' W: C1 p4 T7 A3 C0 [" N
# 模拟单端测序
! H1 j( W0 Y/ E, E9 g
self.singleread(clonedfragmentList)
* D* B' f* d8 `; W2 u5 b4 W
SeqIO.write(self.readsList, sequencingResult, "fasta")
9 X$ c& h8 H8 k2 @7 [
' q( s- @/ k4 G4 R! [
def pairread(self, clonedfragmentList):
+ J6 h$ z, [7 q( B- ?
for fragment in clonedfragmentList:
0 g. ?) U& n% x6 f, L1 W% b' {/ l; g
fragment.id = ""
: w D3 m) k. i0 a
fragment.name = ""
1 |- Z7 T: |/ ~7 d& @ X1 `3 H# H2 N
description = fragment.description[12:].split(",")[0]
- N: O/ @7 c8 @( S. \
fragment.description = str(self.readsID) + "." + description
; z( H8 d8 ] R+ d- m+ @1 z
readslength = random.randint(self.minreadslength, self.maxreadslength)
% k& f. A8 C6 a, b# h ]% H
self.allreadslength += readslength
, ], z: W; W) v3 m' W8 S8 N3 h8 Y$ \
self.readsList.append(fragment[:readslength])
, `: {# O, G6 `4 r: s
# A: h5 t* m& J: x% x& L
readslength = random.randint(self.minreadslength, self.maxreadslength)
$ ~/ ?/ y* j, O
self.allreadslength += readslength
7 a2 P1 }" N% y
3 o! }0 T6 @1 G: h! M1 V1 O
fragmentcomplement = fragment.reverse_complement()
" e- |6 n4 T6 G }- }% B1 |2 d' ?
fragmentcomplement.id = ""
; \ o" I& U, B R1 y: r1 B
fragmentcomplement.name = ""
9 M9 ?; y" V* |* r
fragmentcomplement.description = str(self.readsID) + "." + description
- D8 c( Z$ h' X5 z$ M
self.readsList.append(fragmentcomplement[:readslength])
' m/ I5 f. b( D1 v
3 s1 c7 L F1 \" Z; G2 G
self.readsID += 1
" @! z: k$ w- O/ ]9 B' @+ X S4 M, O
. _/ Z: p+ b9 h* R* d$ z
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
4 s0 s# G& _* `
for seq_record in SeqIO.parse(genomedata, "fasta"):
0 |# z6 {) u8 M
seqlen = len(seq_record)
- m0 N& Z, _! p) N x! K
self.genomeLength += seqlen
6 j0 j# `. l7 J
for i in range(self.N):
5 d8 g9 n6 p. I( i9 G7 \9 C6 u) R* s! Q
# 生成断裂点
' x; m" ? i* I) j; x7 n; u
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
. L! l- \ s# E) o
# 沿断裂点打断基因组
& b1 N; x, G" p8 }) v9 Z! i3 o7 _
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
" S+ p7 m7 c2 h# ~2 u' m) L+ N
# 模拟克隆时的随机丢失情况
6 t: {8 [9 q, @* h! Y" b3 J
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
+ u) }/ a- R" T3 U r& I5 R
# 模拟双端测序
8 M. d; x# C9 r6 q1 W
self.pairread(clonedfragmentList)
1 J7 }2 M8 e' V3 d& A& Z8 D0 g/ m
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
1 A- n8 s6 q" X( Z% z
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
7 M7 h) A3 b% u& m
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
$ H2 G. r& f* h6 c4 _6 {) m
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
3 P2 F0 F6 F2 v3 g/ z& N2 Z/ V j5 U1 a
% ?& F* t$ L5 |0 v4 ]2 c" u
def resultsummary(self):
, e3 L+ y! V! r* i
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
. n% _7 d$ T$ M/ L
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
0 p$ D5 ^/ a( t' b
print("N值:" + str(self.N))
8 [- m. p K* T% b% f) Q4 A
print("期望片段长度:" + str(self.averagefragmentlength))
- E% w# Y* q% X7 d
print("克隆保留率:" + str(self.cloneRetainprobability))
6 q5 u1 N* h! R& L4 g4 q" T, z R' j: U
print("片段数量:" + str(len(self.fragmentList)))
% S& d! I( u9 Y+ _* v* J* w
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
! [5 `7 ?9 R' s
print("reads总数量:" + str(len(self.readsList)))
8 p$ u; y6 `+ y e3 a
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
e' l4 m; b: p; i- M/ `: o
m = self.allreadslength / self.genomeLength
5 c, Z( Q+ ~* ~3 ^" g( T& h: M7 Y
print("覆盖度(m值):" + str(round(m, 5)))
7 T0 A; A" H4 G# K& D
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
2 }. a' W q% J0 H, G5 }) u& C4 N8 Y
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
7 r8 h/ `- |, O. l5 X( S
# -------------------------------------------主程序-------------------------------------------
$ v% C( n- M+ {
# 模拟单端测序
+ P- c0 S4 v8 O, j6 S: h, O# |
sequencingObj = Sequencing()
0 C! T6 E0 B( b+ }! z
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
6 Z. C" w, X9 b+ y4 v: m
sequencingObj.resultsummary()
0 G; p8 B$ Y* l; ^
2 }- G! y8 \2 _, ?, ?) m3 l; u
# 模拟双端测序
5 k6 Y3 y. W" D$ ^
sequencingObj = Sequencing()
2 b0 g3 i; @3 j
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
$ t% D1 Z' H1 S6 F) G$ l3 o; W! _% m* V9 i
sequencingObj.resultsummary()
/ G* k2 j8 z/ y( A
from Bio import SeqIO
' o5 Q9 d! }2 i6 d
from math import exp
1 C L; F( h9 N. c
import random
" E, H9 }& W6 D( o7 Z
. K2 O- D6 r: ^% h
class Sequencing:
5 N3 b {0 ^# V, l! @8 P
# N代表拷贝份数
) ^' `* r7 H( u. H
def __init__(self):
- M& h' V! d% G% _ ` e* {4 B6 f
self.fragmentList = []
4 C6 ^# [" T( l
self.readsID = 1
* K3 l8 I0 q1 n5 k, D, p
self.readsList = []
" b) P3 P+ J* P l/ \
self.averagefragmentlength = 650
# ~4 H. D9 j& y8 @# G
self.minfragmentlength = 500
2 t- q4 `" [- R ~( G, t
self.maxfragmentlength = 800
# R2 W0 F- k" S8 @* n
self.cloneRetainprobability = 1
- o- J: s' i' W1 ^ J
self.minreadslength = 50
& d* J U0 Q9 y5 U
self.maxreadslength = 150
; h: L" p; f2 f+ _1 R
self.N = 10
$ X8 W, V6 E) z, E
self.genomeLength = 0
4 {5 i" z* F* Q% \+ I
self.allreadslength = 0
* U8 \: P! @) I
% @0 [: N* |" I+ B6 N( W% m: W
# 生成断裂点
! n1 S0 z Q- W- G# Y& [8 M, R8 L
def generatebreakpoint(self, seqlen, averageLength):
: b8 A) H9 F% O
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. i( H6 f8 l' J$ x9 _. D. E+ P
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
) O$ @7 u7 R( X; s: |
breakpoint.append(seqlen)
2 o: o, A. y- r2 [ M
breakpoint.append(0)
4 o6 j) F! q0 W2 z/ r, i. a
# 把随机断裂点从小到大排序
# J: V% p. E5 R+ k) m
breakpoint.sort()
9 L+ C+ p' b) C% N
return breakpoint
' d$ u, R8 O8 S# n
& R/ N% W8 m+ t7 s9 X3 e4 G
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
- \, F! ?* D1 V
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
) C( X# P6 t8 W5 h; N; B
for i in range(len(breakpoint) - 1):
6 b [) r v* O& @ i
fragment = seq[breakpoint
:breakpoint[i + 1]]
. P# ?: [ j4 r
if maxfragmentlength > len(fragment) > minfragmentlength:
2 ~6 C) j, ^8 Q* P
self.fragmentList.append(fragment)
0 G& S: N) V A
return self.fragmentList
1 w& j' c% }4 [: i9 m' ^2 `8 l
, b* {. y; f6 ?6 M. Y
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
: G: ^' r/ A1 @3 f2 P( Q
def clonefragment(self, fragmentList, cloneRetainprobability):
! m$ \) Q% H n/ D* l4 `! k
clonedfragmentList = []
+ ` @- F) K) F' g0 I6 h: C
Lossprobability = [random.random() for _ in range(len(fragmentList))]
1 P+ H5 ^# d1 q# V4 \" O+ s
for i in range(len(fragmentList)):
( W/ x7 d2 U! u( \7 d
if Lossprobability
<= cloneRetainprobability:
* t1 m; S3 e$ s* B( Q
clonedfragmentList.append(fragmentList
)
5 C7 p1 c( P0 e% N# j4 e# d
return clonedfragmentList
# u, z9 K% s* r- b
. c) [% I) l0 x! X: q& w! F
# 模拟单端测序,并修改reads的ID号
# W& t8 \9 q+ _! ^
def singleread(self, clonedfragmentList):
! P7 ?4 {. k7 q6 Q
for fragment in clonedfragmentList:
, f9 G4 x* |: x" T5 q
fragment.id = ""
+ g6 b4 M- a: t+ ]" a& Q+ u
fragment.name = ""
& T5 O" y* s/ h9 B" ~- @( ?
fragment.description = fragment.description[12:].split(",")[0]
; m" b' Z: ?2 w0 |+ ? | c
fragment.description = str(self.readsID) + "." + fragment.description
* G/ O0 Y6 ~3 B* p% ?! C
self.readsID += 1
! J7 Z2 F' P1 K1 h' f$ f
readslength = random.randint(self.minreadslength, self.maxreadslength)
# C4 N4 Y: K4 v7 t: Y
self.allreadslength += readslength
% T k8 u4 }, k( J4 \0 o
self.readsList.append(fragment[:readslength])
3 C0 |+ j& |+ N4 H% ]2 b4 c
- y( f/ h0 Q4 I. z$ L7 ?
def singlereadsequencing(self, genomedata, sequencingResult):
+ l- A6 @5 G1 ~) i9 K* g
for seq_record in SeqIO.parse(genomedata, "fasta"):
2 |$ B/ x, G1 ^+ t
seqlen = len(seq_record)
) r1 X4 U8 s# D! l5 k0 L4 ^
self.genomeLength += seqlen
& W" Z% S6 @, y- j+ S
for i in range(self.N):
" S% v8 ^$ y& F$ N/ H/ ~! G6 M
# 生成断裂点
' f/ e: W# r; g. V- I! |
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
( U5 ?/ p" E+ e7 b
# 沿断裂点打断基因组
# [" ]/ a1 O1 V. {% s6 T
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
8 h# _+ r8 ^: }7 A
# 模拟克隆时的随机丢失情况
/ k. C# d, u0 Q. L
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
7 I. z. K+ Y; g" [! C# Y0 E
# 模拟单端测序
- q8 _6 j b2 p8 y) A O# M
self.singleread(clonedfragmentList)
5 |! ~' J8 E9 k% W
SeqIO.write(self.readsList, sequencingResult, "fasta")
3 X' p2 \2 z9 H3 {' v
! S( ^8 ]$ Y3 U% g& i4 v8 h
def pairread(self, clonedfragmentList):
; G: U. z+ a' R }- {" Y7 T
for fragment in clonedfragmentList:
& Q2 O$ J4 t D- {- X6 O& |
fragment.id = ""
9 g# `9 T0 a* j* i* Z, s
fragment.name = ""
f! u6 U3 B7 A! d: i( n$ V
description = fragment.description[12:].split(",")[0]
/ d2 q5 E6 N( | n7 j3 @. r- s
fragment.description = str(self.readsID) + "." + description
. y1 n; E% d3 d
readslength = random.randint(self.minreadslength, self.maxreadslength)
0 m" s- e+ D1 O' E# T ] @
self.allreadslength += readslength
. }8 X+ |8 A0 k- X: y" p+ L" M4 W
self.readsList.append(fragment[:readslength])
' u0 S ]" Y% J7 s
1 l: K2 |! t) y0 |/ r7 T4 F
readslength = random.randint(self.minreadslength, self.maxreadslength)
# Q' y# C4 {6 X6 R7 E4 d0 _
self.allreadslength += readslength
4 Q' i8 U3 O3 b1 i
! S+ s! ?/ _4 h, w7 }9 r
fragmentcomplement = fragment.reverse_complement()
3 X B8 X) ]: Y/ t; f1 {4 S
fragmentcomplement.id = ""
4 R* r- d. ]# B. _( S9 a C
fragmentcomplement.name = ""
6 A3 Q0 T( q, t# |4 @
fragmentcomplement.description = str(self.readsID) + "." + description
: h4 H- a7 g4 K1 |; I) }0 ` w
self.readsList.append(fragmentcomplement[:readslength])
% S( y; F+ M" W; a
* Y) U5 d0 C0 I# E8 R. v
self.readsID += 1
0 R9 g7 j7 v7 j* K( \) R
7 x$ K2 D( h! T6 q, p
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
p7 ]9 V1 H6 I
for seq_record in SeqIO.parse(genomedata, "fasta"):
* h0 h( }$ L! p- T3 ?9 p! t
seqlen = len(seq_record)
' V8 _& m& ^; ]4 |% O0 \, L8 Q7 X0 n
self.genomeLength += seqlen
: E% Q& q- u, Y2 q7 F9 S( H9 e
for i in range(self.N):
. {: ~# {8 F* R) U; U# j- R
# 生成断裂点
( @, \1 L' q9 w
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
5 T' Y+ N" y6 W( r% g& B# f, Z
# 沿断裂点打断基因组
! n0 T$ S- h" A: V+ x# t
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
4 ]& r; k: P, N* z$ [
# 模拟克隆时的随机丢失情况
$ D5 V8 \. W R8 s* u
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
" G G+ m3 ~9 `; E
# 模拟双端测序
6 L. n& d' ]$ L/ e
self.pairread(clonedfragmentList)
9 C8 S+ D$ w: X& R ]6 u/ G8 N$ e% i
readsList_1 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 0]
% |& T5 G2 A; T' n+ q% M2 g, u6 C
readsList_2 = [self.readsList
for i in range(len(self.readsList)) if i % 2 == 1]
. t" p$ `3 K r9 P8 `: ^% z3 S
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
- ?( t' A. K, W4 n# P
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
) z0 F6 a$ K' ]
. n, H2 T$ z5 e. C, F
def resultsummary(self):
& a2 Y! ?. L2 f8 R/ |
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
! o% x$ f) T) q4 K
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
3 e0 Z* q/ t$ X: d. l
print("N值:" + str(self.N))
2 o" d8 |+ F' c3 Q/ p0 E4 L
print("期望片段长度:" + str(self.averagefragmentlength))
+ @ N. `2 K( j2 _, O) F
print("克隆保留率:" + str(self.cloneRetainprobability))
3 Y8 Q8 O* X: r2 ~. v8 g; B3 r
print("片段数量:" + str(len(self.fragmentList)))
W. \" E1 h# g5 t
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
' }/ p! c% t: |5 s5 K" ~3 Z
print("reads总数量:" + str(len(self.readsList)))
0 F! g" B$ @9 e" N2 `" d. {
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
0 t6 |! m# \& K3 z( h& ]' m
m = self.allreadslength / self.genomeLength
8 R$ g- z. D' J: F. k6 C6 [5 |
print("覆盖度(m值):" + str(round(m, 5)))
: N$ O: d l# i! c$ |$ Z9 `
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
- @& M) Z; G( S6 F$ Y( l* k4 o# A# K
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
' i$ {; C: `/ G* r: L$ j
# -------------------------------------------主程序-------------------------------------------
: k' M9 G# z6 a f x' |
# 模拟单端测序
/ s8 `: R* l( p& l
sequencingObj = Sequencing()
: b* j& v0 `" d$ @# v- [
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
1 `6 R6 g e. F: q
sequencingObj.resultsummary()
! S, B J, x5 j/ _# X2 {) F' a/ V
' m4 g0 w* R( w9 q
# 模拟双端测序
8 y9 m( u+ K+ O% v7 ^6 t
sequencingObj = Sequencing()
* }: L& ^- U/ i, r( x- ~7 J0 E+ i" Q
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
0 Z* J5 K9 _6 O, r R
sequencingObj.resultsummary()
- j# W2 U6 T' w' m' m
, A- r+ C. M7 O M8 t( f
1 [+ M6 j2 h$ M* }: }
( c' g) N( U6 w$ f U
* X/ |3 b" O( ?5 k
数学建模解题思路与方法.pptx
2019-4-21 14:56 上传
点击文件名下载附件
下载积分: 体力 -2 点
117.69 KB, 下载次数: 4, 下载积分: 体力 -2 点
作者:
3477959497
时间:
2019-4-22 10:49
不错。。。。。。。。。。。。。。。。。
5 \+ R: k' z5 Z7 b5 ^
欢迎光临 数学建模社区-数学中国 (http://www.madio.net/)
Powered by Discuz! X2.5