- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563400 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174243
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟& Z* Y6 V0 q- K$ ^; g" | t3 ^
基因组测序模拟
) s$ R- S/ ~5 w O% y' Q9 V# T4 d
一、摘要; V8 M, ~' d1 ]) ~# j/ Q
+ e, C5 {; r9 O; M% C通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
9 C- d2 G7 K4 N/ a) U
* f; c1 ^: Y! Z* \; s3 O, ~二、材料和方法# m) Y/ @% J+ s
/ d: A$ Y4 F1 t6 b* z1、硬件平台: S: |8 g9 W. ]0 c$ x4 S# z+ r
/ W; l$ Q% j' E; Y% @
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz 3 Z$ l# W. c4 m/ c8 E
安装内存(RAM):16.0GB9 t" y" H, Q1 ~5 A* [4 h: n9 B$ Z
4 n3 c y E( |' l1 t
2、系统平台$ J2 F8 ?$ b3 i3 l O, _- `
Windows 8.1,Ubuntu
- t7 n% d; M0 D2 w! m
3 ~' I- ^0 I* f$ g3、软件平台
- m# z E4 F: M! v) Y+ `3 H! {8 f7 w1 d" N8 D1 a/ l" s, a
art_454
Z$ {- c2 h4 X8 I/ c& P# aGenomeABC http://crdd.osdd.net/raghava/genomeabc/
: [, C; x0 n7 P9 H, ^Python3.5; }4 y7 Z/ P$ `" ]5 g" V
Biopython
) d, p1 w7 }6 B5 D4、数据库资源" w( x3 u4 X9 q3 z" T( F9 g
: X2 W" }8 K" W6 N( w
NCBI数据库:https://www.ncbi.nlm.nih.gov/8 H1 v: d8 D J7 D
+ K6 ]& V2 a$ U2 U4 m
5、研究对象
' u' @# p6 ]( z; v% I' S4 B4 Y+ z
; S6 Z# v8 E# B! x( |1 y Y酵母基因组Saccharomyces cerevisiae S288c (assembly R64) ) l3 i# ?( {* a4 y+ S# d) g
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz' l% ^8 T: U+ Q3 i* W
) K2 M# e9 \# O$ Z6、方法
9 Z' K% {# {( o' r8 ]
5 o# k4 M" Y9 D5 y1 gart_454的使用 6 m1 _ e" K& ^) `. Q
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。5 ]6 v9 ?/ y9 V3 W+ M
GenomeABC 5 V5 c1 p: y& V
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
! m; R' M! x/ y! t L* D编程模拟测序
# x" f4 D [: q; ~4 Q" V下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
. h7 d- ]9 x; ?. [( k" p+ {0 ~" @三、结果3 P: ?6 n; i- O
) s. k& H( a' ~4 i$ k' T" G1、art_454的运行结果, q( B+ U5 G% ~9 k$ f
2 ~$ r4 T) A5 S, s6 `$ Q4 B; M, \5 q6 g无参数art_454运行,阅读帮助文档 . y" E) m+ P" _' f. ]
& F6 J3 b3 ]% h- O) d/ B图表 1无参数art_454运行 3 Q Q: a7 c9 j! }- n9 U* j: b
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. 8 }) o i' n$ H6 E# U. T7 D
下图为模拟单端测序,程序运行过程及结果
4 k: A3 m( \ {! m: j, I: p: Q o9 ~1 a2 ^4 G- R$ e8 m
图表 2 art454单端测序 7 T) P5 `7 T \% [, B5 r
9 E* o* C6 n6 y* w! M0 M
图表 3 art454单端模拟结果 ; {) C" _+ F/ b# n, T
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
! p2 o9 Y% O$ d3 C! k: x' W2 e下图为模拟双端测序,程序运行过程及结果 : u1 u( Q9 P; j* ~, S4 u4 l
, e" j) i7 i) X6 O3 j7 v图表 4 art454双端测序 ' s. Q4 ^' Z* F/ d) r% F
: ^% F! r+ w8 u2 p
图表 5 art454双端模拟结果 l+ t' s$ R1 X1 X# J* m* A- u
2、GenomeABC
, K& X' C. U- G$ n* z) H* z下图为设置参数页面 ! {+ [( [2 u% c$ _2 F
1 Y- }1 J- p: o; [1 V1 g
下图为结果下载页面 : p" R3 X, O" I/ r0 }6 ]
- l4 v+ V4 g+ W: W6 o图表 6 结果下载页面
* t. u( s2 T' o0 X4 n9 F" e3、编程模拟测序结果 2 t: J, {6 o! g
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
; Y2 S& v3 c0 b3 y" d1 x5 [* f单端测序
& Q" `- n' o3 y; V4 ?8 ?+ A. \! R& N. s1 L2 A2 w$ X
图表 7 程序模拟单端测序
* I) ^: o+ Y/ a' H |8 n6 U双端测序 0 M H3 L0 d1 S/ Y9 Z; e
' x' H7 G: I# K J, W' y, J
图表 8 程序模拟双端测序
/ {8 U0 S- p& L6 u: A测序结果 6 V$ b( v/ {/ W6 o
5 d$ k) M! `; `4 B& P5 X: `图表 9 结果文件# S* @( m: k5 z& M8 v/ D
7 x8 ^ C: D3 T& ?; ]" G4 f4 R6 p
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
1 j# l# J, d5 c. Y测序结果统计表8 g4 B0 G9 a3 y1 q& W e
4 P. o6 {6 I7 ^$ r( t测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)# a' t0 f" D7 a7 C* a0 f8 P
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682 O! S( m/ A! }9 A- s/ }, U. ?2 @
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424) t/ I& L; [2 P8 u) e$ j' t
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.713885 q. y, c5 _8 B+ f' { {: E$ M3 C' Q
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
! e5 a: H: U5 i; E7 l' n四、讨论和结论
$ g. w. k' q0 ?) v5 z! R% p8 L. v& A6 E5 |3 L) r$ j" c
程序运行方法
; I) S5 Y A+ x- y8 b" z) W4 S8 }; ?4 i1 q' h
在类的构造方法init()中,调整参数。
! G$ P9 V: p3 ?2 L, X. cAveragefragmentlength为片段平均的长度;
$ b5 ~, |7 K2 ], u) k3 u8 Nminfragmentlength和maxfragmentlength是保留片段的范围;
3 b/ z+ S& Z% a2 ncloneRetainprobability是克隆的保留率; ' T4 m/ U1 q J& H
minreadslength和maxreadslength是测序reads的长度范围
9 I+ B3 T0 N/ u+ l; u
& w' \# S$ |8 \模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。4 h! m# S7 ]: K9 s8 x! `/ T
$ x1 `) A$ a6 `7 z/ r* R( x附录
% q6 m# I& p# e) m4 _4 S, |7 f% ]4 r; ~% l
from Bio import SeqIO( z3 k5 P/ h5 P: c& K
from math import exp
7 k8 D% r& v2 q0 E, C" }$ pimport random8 B+ Z( D7 O- l- r3 h
u$ f. L o7 O) Z' K: Z( F
class Sequencing:
( o( d0 P+ ~/ t5 _) P # N代表拷贝份数: y" D( s% y8 [$ u6 S
def __init__(self)! e, T' P/ Z' P
self.fragmentList = []
: i8 Y9 p7 J4 D9 \ self.readsID = 1# [! ]2 y$ V* U
self.readsList = []
- G; T6 o; I# d; Y- Z, O. V/ R self.averagefragmentlength = 6502 `/ [! r$ C1 i5 W4 I9 i- R
self.minfragmentlength = 500
2 @; x+ ~% K9 g7 c, J self.maxfragmentlength = 800
) O2 y, z3 e1 e& L9 S self.cloneRetainprobability = 1
0 e4 w S0 r9 C! A. [! n5 Y. q self.minreadslength = 50# u& s, u% X# o
self.maxreadslength = 1500 p$ E) v) m L2 ~/ |
self.N = 10
" c2 q4 K7 X* r T, R' f self.genomeLength = 0
% W4 Q$ w6 c$ p9 c: s) V K self.allreadslength = 0
6 S: v0 s& |( Q: V: o% t/ V: P. v: }3 a/ ?
# 生成断裂点7 w. o9 k0 Y. L7 X9 H" u
def generatebreakpoint(self, seqlen, averageLength):
+ P, P1 S+ C7 l3 C1 ?- Q1 J7 C7 [2 n3 Z # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)1 c$ {! P6 `. v! w R* e$ J
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
( h, ^% \( o8 g+ @. A breakpoint.append(seqlen)
5 W5 W/ D/ P6 e breakpoint.append(0)
. C8 A+ l ^ q! ? # 把随机断裂点从小到大排序8 x6 D( w! l) [8 E% z5 [$ D S
breakpoint.sort()4 `/ G" j1 a( Q& y( e( I! v8 b9 z
return breakpoint
& B& q8 Y, a- T2 [0 [1 o7 p% w% J; U
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp, @; u: ~. E4 d# I' b3 |
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):8 e! M5 Y/ L. \ R
for i in range(len(breakpoint) - 1):
8 N4 a& F T" P fragment = seq[breakpoint:breakpoint[i + 1]]
0 w' {( V1 v. D, K' h/ `; u: [" P if maxfragmentlength > len(fragment) > minfragmentlength:
6 }2 A: M2 C0 t; ~" A3 F* X5 K self.fragmentList.append(fragment)
/ c6 `4 J+ `) ], B/ I( i5 i- l return self.fragmentList
h) e) S# ]" i: u, S8 F4 l9 W! e& A, J' z) a0 Z9 l' e
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
; o* H* ]; w$ v+ w def clonefragment(self, fragmentList, cloneRetainprobability):
) K7 K/ e) o$ s) O, ^0 n+ i clonedfragmentList = []
/ d: V" e1 p* S& o* d Lossprobability = [random.random() for _ in range(len(fragmentList))]
7 o5 F/ G6 i1 t. _0 M1 Y for i in range(len(fragmentList)):
" b) J! C R+ l n if Lossprobability <= cloneRetainprobability:
: H0 n1 a; e: t1 t4 Q) b clonedfragmentList.append(fragmentList)/ O/ [$ c# p2 X' I6 e; I# n n
return clonedfragmentList
% Z& Z9 Z) g4 [, _* G( y( |5 w( V9 f2 e( s/ r, w; |; ]6 Z
# 模拟单端测序,并修改reads的ID号; s) O( Y {0 D! L# p
def singleread(self, clonedfragmentList):" d ^$ f+ C/ k0 s
for fragment in clonedfragmentList:( {. y' ~. h& c6 V2 V
fragment.id = ""+ t/ a, F4 L J
fragment.name = ""+ S$ E1 a, |4 y- Z0 `7 t1 w' J
fragment.description = fragment.description[12:].split(",")[0]2 E0 ^% F( O$ M9 H
fragment.description = str(self.readsID) + "." + fragment.description
?& p7 i. d$ ]. {1 Q+ | V self.readsID += 13 @5 }$ m) e$ R4 H, R
readslength = random.randint(self.minreadslength, self.maxreadslength) H4 x. ]/ J- Y, n# M! p
self.allreadslength += readslength' Q N# r3 l6 m
self.readsList.append(fragment[:readslength])/ h" E0 E5 M% \. i, q# Q; b) M
* w9 X, } s6 G* n1 Q def singlereadsequencing(self, genomedata, sequencingResult):
. b( z2 N+ M& T4 e9 v, l! r for seq_record in SeqIO.parse(genomedata, "fasta"):
& v! M* |' i& B seqlen = len(seq_record)- d! ~' a" L' l6 R" Q
self.genomeLength += seqlen0 S. {% J7 |% B/ y g; s8 i
for i in range(self.N):+ e3 @4 N# X2 x% a/ G3 S1 C9 n
# 生成断裂点 \- Q5 b! F3 Z, n [4 K7 O* V
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)7 z' y, V$ d" p" m3 B$ y( {3 v
# 沿断裂点打断基因组# [& d( c! B6 w( m2 V
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
% k7 c. o: \* D# j # 模拟克隆时的随机丢失情况; |* }' Q8 z5 F6 g& J% e8 B
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)$ t* v: x4 |/ L( ^
# 模拟单端测序
2 j& E1 [$ r/ b: f5 m: f self.singleread(clonedfragmentList)
4 Z4 E0 E, w- Y( { SeqIO.write(self.readsList, sequencingResult, "fasta")9 T$ x# T" a/ u/ d. s) a' `
2 |- i% C( n9 g$ D/ u def pairread(self, clonedfragmentList):, r) c( p+ P$ z0 w, i9 L" @
for fragment in clonedfragmentList:7 l9 x6 Q3 v- }9 O8 [
fragment.id = ""+ l/ f' f4 ^/ Z5 G% H" _* E* j( {
fragment.name = ""
" w% B- l3 N; Y6 Q1 h% v description = fragment.description[12:].split(",")[0]
/ m" Z3 Y1 w1 G; L' o fragment.description = str(self.readsID) + "." + description; I3 {5 H; P! D' A0 h0 l- X
readslength = random.randint(self.minreadslength, self.maxreadslength)
6 [0 G# b9 J! u self.allreadslength += readslength2 \3 q) ~7 z" ]* X0 }
self.readsList.append(fragment[:readslength])
6 T: g7 t5 w7 s7 M- O9 K1 `# ~) c( u8 k% U2 `! _. A
readslength = random.randint(self.minreadslength, self.maxreadslength)6 m& F0 ?+ H5 R1 C
self.allreadslength += readslength
. D X3 k c4 u9 M3 x- t, j( l) N; v; S4 {3 `
fragmentcomplement = fragment.reverse_complement()
( l$ |; s7 V# E+ P6 h m4 O& L fragmentcomplement.id = ""
! W* }0 D$ j+ U* @8 y5 A8 G5 V fragmentcomplement.name = ""
+ F5 S+ z- K- _8 v" f fragmentcomplement.description = str(self.readsID) + "." + description) k3 b0 v8 Q5 s! i; ?
self.readsList.append(fragmentcomplement[:readslength])
9 k1 H; w6 L! D. W
$ I% E/ N1 S- v8 ? self.readsID += 1- l0 \! a5 ~: V1 f. o. o
& x1 Z/ Q3 J1 P) O0 f def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
( j# K8 V5 V, n2 A, X, h0 G for seq_record in SeqIO.parse(genomedata, "fasta"):
. q4 e8 Y7 Y7 U% i9 }) c seqlen = len(seq_record)
/ G: p% A9 @* l/ K$ b! @ self.genomeLength += seqlen
4 q+ l5 ?8 m1 x" q' s3 E0 ? for i in range(self.N):/ y+ a% g" V* ] A5 D. ^' }5 ^. K
# 生成断裂点
9 p7 o) m1 H& k8 A& D! `% A8 W$ K breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)) D& o. ^ L# `
# 沿断裂点打断基因组
) ~" f$ `+ f, g. i& V1 i u" n self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
3 ?! M! F" X1 }' `% u- _ # 模拟克隆时的随机丢失情况. ]3 D3 @3 y0 ?; \! G& O
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability): b8 c b4 }/ c. n
# 模拟双端测序9 z3 f& K, N( c7 p$ S9 [6 N2 \
self.pairread(clonedfragmentList)
3 v0 G: i" A$ \, v. q: j, O& F readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]8 D. m2 ~2 e% u( a D
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
- E- C# W, ]3 P% ^0 F" | SeqIO.write(readsList_1, sequencingResult_1, "fasta")
! j& V" Q- v. G9 B% { SeqIO.write(readsList_2, sequencingResult_2, "fasta")
+ R$ l4 @# E! ~/ I6 h$ A* X; ^, N5 B4 A, _) i c0 {/ G6 ]
def resultsummary(self):( a! r1 }2 ^/ Y7 z @
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")1 q' o. ?2 I# U/ J0 Z( @
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))3 c$ o% x+ h; [. Z
print("N值:" + str(self.N))2 Q& M+ n( z4 p
print("期望片段长度:" + str(self.averagefragmentlength))
: l5 n, T7 c: G- g print("克隆保留率:" + str(self.cloneRetainprobability))0 w' h d# |* S- \* v! f' v
print("片段数量:" + str(len(self.fragmentList)))
: f4 _6 P; w0 a2 H5 @ print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))0 o1 E9 ?4 ^ V- ] D- C$ A' n6 J% `
print("reads总数量:" + str(len(self.readsList)))2 ]1 s6 m2 j" d" [6 A
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
9 a5 h5 B, p8 R. N: \ m = self.allreadslength / self.genomeLength
& |5 a! P" E6 L' Q% t* |; M print("覆盖度(m值):" + str(round(m, 5)))% m7 w# D5 I3 K
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
- Y6 e3 n* [/ _5 l4 F7 b" h4 \ print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
/ V3 `+ ~; X( t1 K4 `- H# -------------------------------------------主程序-------------------------------------------
) R1 O% _+ g9 c. {# 模拟单端测序8 R8 V C! v* t0 j1 u
sequencingObj = Sequencing()
7 V H- }$ K2 i3 F* ysequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
2 D; F) m0 d' x; ~, S6 YsequencingObj.resultsummary()$ Q- d Z1 v! l3 ~5 Q0 ^
, d7 [1 O5 y }% v% H$ C' m3 w
# 模拟双端测序
) a S6 Z1 w6 Z* q' y0 TsequencingObj = Sequencing()" G, u, ?/ T7 o) {
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa"). W4 @; h2 `. K# F- M
sequencingObj.resultsummary()7 [0 I+ d! r) r9 |2 ^* c% b
from Bio import SeqIO
# x$ E" D8 [& ~' A3 b$ Z, P7 kfrom math import exp" j! L" N. w* }( Z+ g
import random
) V1 ~! I. n4 K# Q. i; i) M4 Q: ]3 D0 j: f
class Sequencing:
6 s' O5 i; s' y6 l$ v # N代表拷贝份数
' t8 N0 e9 \+ l. k6 W8 u& ^& r def __init__(self):
7 Y2 _" N" P7 L. { self.fragmentList = []8 [( ^( c# {8 w
self.readsID = 1
0 P4 J2 M8 Q, t! Z+ F2 L self.readsList = []8 s9 O& Q6 e6 Y% D: g2 J! r
self.averagefragmentlength = 650; K/ E5 i+ K7 h2 ]/ p+ s. o4 n+ {
self.minfragmentlength = 500# h5 |! t; K/ n0 Z {" B' I& W# i& \
self.maxfragmentlength = 8003 i: \' W( M, c; _6 Y ?) u
self.cloneRetainprobability = 1* b8 |8 Y4 A4 ~, K" n8 ]6 G0 D- M
self.minreadslength = 50/ U* Q9 P( W9 S" x+ Q0 s
self.maxreadslength = 150' }! q. d# `. x; J" Z3 e% Q; }" b
self.N = 10- n7 G" G; F y, ]
self.genomeLength = 0
/ @, H i. b& F6 P1 k3 k! a: j self.allreadslength = 0
4 [2 W7 N- [" @/ A
8 p3 J3 L6 V6 v+ } # 生成断裂点+ Q6 L) b, n" w; H$ y
def generatebreakpoint(self, seqlen, averageLength):3 l& {& ]0 L4 k# d
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
+ y: y% u) ?" r3 C8 x) v" W breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]1 ^9 d" @* _9 H0 y9 \9 s3 O
breakpoint.append(seqlen)2 W' E O( q8 `( ^( C
breakpoint.append(0)
) O7 V) j/ o2 N, v4 G # 把随机断裂点从小到大排序" f1 s( C9 p& o
breakpoint.sort()$ w( [8 g% _ g: A3 v3 x- [
return breakpoint
0 p5 ~! e# U, R4 D+ V1 g" A' o
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
- o- \' I- Z" q; r. W def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):+ r/ Z( C+ B; ~' F! w' j/ D3 Q
for i in range(len(breakpoint) - 1):3 K9 P4 E& v) O' |# b5 u8 \
fragment = seq[breakpoint:breakpoint[i + 1]]
+ B) }6 w* [0 I! f if maxfragmentlength > len(fragment) > minfragmentlength:
' I* z7 r1 ?7 q1 I9 I. t6 p3 t z self.fragmentList.append(fragment), @. u! Y, b. B! w
return self.fragmentList/ A9 u) M$ F; x/ [7 p7 V
0 j; @9 @" `* N4 u7 Q
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率$ R7 K- n! _( c q
def clonefragment(self, fragmentList, cloneRetainprobability):5 |3 f( @$ s5 l
clonedfragmentList = [] z; v: e0 D" f; T' N
Lossprobability = [random.random() for _ in range(len(fragmentList))]
4 w+ N) F! x6 W& q for i in range(len(fragmentList)):
2 g4 ]) ?4 O. S& d# x) \ z if Lossprobability <= cloneRetainprobability:5 n {* E V$ B2 }/ P
clonedfragmentList.append(fragmentList)
+ c$ Y9 k$ |+ k2 Y! g return clonedfragmentList6 g' S. t6 G' p8 s, Z: e' b4 V
% G5 H6 ]1 T' t- j8 B) y
# 模拟单端测序,并修改reads的ID号; }; K1 D( K, y0 |3 p
def singleread(self, clonedfragmentList):% X f% f O1 W- X n
for fragment in clonedfragmentList:# Q3 t0 I/ Y2 ^" o$ L1 g& g/ @3 h
fragment.id = ""1 e1 w: Z* [. ]$ R2 \! H
fragment.name = ""
, z: u- D9 B- U# V6 X fragment.description = fragment.description[12:].split(",")[0]% P0 |. f; m& G$ T0 b% s- t
fragment.description = str(self.readsID) + "." + fragment.description8 u, a K$ g% Y, E
self.readsID += 1
+ S+ ~3 h4 P6 `1 } readslength = random.randint(self.minreadslength, self.maxreadslength)
; [; }* f7 Y! y$ t+ e8 y4 Z self.allreadslength += readslength
; h& ~% M/ H4 q! u7 q8 x. m2 W self.readsList.append(fragment[:readslength])
7 R" B# T P$ N, C0 |% Y
$ ]6 e7 M* Y- ?. r; f' [# p def singlereadsequencing(self, genomedata, sequencingResult):5 e6 X f4 ^" `) ~% p' @4 k6 P
for seq_record in SeqIO.parse(genomedata, "fasta"):9 e% D) ~2 t2 c# f, C- q7 { E
seqlen = len(seq_record)/ ?6 y' x! L; I) g: S" b2 Z) G
self.genomeLength += seqlen6 L5 l: V/ Q v
for i in range(self.N):
' a9 o. g) E3 r) [, _& O # 生成断裂点 u; i7 h# A0 ~6 O4 Y5 f4 _
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)& H6 `: A% k; W( f7 S
# 沿断裂点打断基因组
/ h" |& d9 q" r1 x) h9 K: n self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
5 G6 c/ G4 q$ T7 O. E9 @ ]) a # 模拟克隆时的随机丢失情况
4 W' ]; I! C# o' @1 @1 ~& h% X clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)7 u9 `- l$ y. ~- Y
# 模拟单端测序
& t, Y+ K) h# l% p7 I2 K; N7 q9 d self.singleread(clonedfragmentList)
. F% ~0 T; P" a. ^ SeqIO.write(self.readsList, sequencingResult, "fasta"), Q$ ~4 g; d2 Y
: t9 g* l# T( {: Z$ y' {/ C0 | def pairread(self, clonedfragmentList):
5 d7 T5 H \! N3 w' N for fragment in clonedfragmentList:
; J4 [# A7 A* e+ p) y fragment.id = ""# O) P* N% F: \
fragment.name = ""- _# F! `% g1 `
description = fragment.description[12:].split(",")[0]
[$ X- `" g4 z( t l e fragment.description = str(self.readsID) + "." + description
; \4 {5 K" t; r6 u* P2 z" E readslength = random.randint(self.minreadslength, self.maxreadslength)5 x# I+ O3 x& E# x
self.allreadslength += readslength" m2 c5 Q& v) o8 n/ q @% H: T; o h
self.readsList.append(fragment[:readslength]); k6 e& T" G0 }8 r. T. ~8 B
) C! W, i% D7 }+ N5 H/ U1 j+ y readslength = random.randint(self.minreadslength, self.maxreadslength)
2 L+ g4 V% y# |( I self.allreadslength += readslength- `/ @" k* n$ u j0 ~! n' }
4 w* C" m" v. O; Z' l/ C @ fragmentcomplement = fragment.reverse_complement()* M0 A5 ?$ Z/ b9 d( Z: s3 P
fragmentcomplement.id = ""0 R2 T6 h6 q2 Q7 z/ j- Z
fragmentcomplement.name = ""* J. o/ V9 g* S3 I; `* a
fragmentcomplement.description = str(self.readsID) + "." + description- r9 p# @/ l5 ^
self.readsList.append(fragmentcomplement[:readslength])
3 r& b: `$ V$ X# B& y# o
; @( z' D4 L' C7 }# e self.readsID += 1( j- c, ~- @# u/ {- K% V
/ z3 {# s- Y; I( H r7 Z. u$ ` def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):4 D, ~" g( Z0 ? f K7 p
for seq_record in SeqIO.parse(genomedata, "fasta"):
H7 g5 G7 ~, B3 u# L2 k seqlen = len(seq_record)0 X0 I: C$ I- F
self.genomeLength += seqlen
2 J+ x2 R c G) ]/ v) ]2 v for i in range(self.N):# Y: h3 y3 J5 Y* `8 u
# 生成断裂点- W, y& | @$ o/ b- D4 N: m
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)2 T1 @6 q' V( T/ J5 t+ @
# 沿断裂点打断基因组
- ^4 h; r, U% c* X self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)3 ?8 w# |$ o7 b8 ~, V, a
# 模拟克隆时的随机丢失情况4 L& b# B! x: }9 |7 i
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
6 A. t7 ]% x0 j" M- ~1 F8 } # 模拟双端测序# z( W5 P2 s% D" p. I
self.pairread(clonedfragmentList)7 N) n& e I+ C; K d* E
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
/ @+ {$ f8 }2 ^- j! H: V' a7 ~ readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]+ `2 I2 k( n- t
SeqIO.write(readsList_1, sequencingResult_1, "fasta")! X G0 O( M S
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
- Y, T- m6 H% N3 |) x t' O4 d- o. u8 a+ n! f: T: N& F) }
def resultsummary(self):
# q5 n& J( M$ }# Z3 t print("基因组长度:" + str(self.genomeLength / 1000) + "kb")" f: v: t; C; n
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))" R& B# F3 W: ^
print("N值:" + str(self.N))& E# M' w! o5 g/ e
print("期望片段长度:" + str(self.averagefragmentlength))# V* y& C- h T5 D2 u/ G' P
print("克隆保留率:" + str(self.cloneRetainprobability)); T0 `1 E( b$ M
print("片段数量:" + str(len(self.fragmentList)))) G. |! V6 w8 B- @* |
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))# |9 V/ |( ^( c" s8 r9 Q5 F4 Y: K
print("reads总数量:" + str(len(self.readsList)))
6 V: p9 w1 }: b% C$ ^5 a& {. G print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
& U \! ^# |# Y0 F m = self.allreadslength / self.genomeLength
( m* x$ L# ]' z% |- R& v2 U6 g print("覆盖度(m值):" + str(round(m, 5)))
5 v4 P. ]# @! l9 b4 k5 s1 N* e2 N print("理论丢失率(e^-m):" + str(round(exp(-m), 5))), P: E/ {4 G2 @( `" ^7 F2 I: G N
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
( c+ [. h, O2 \0 [- p# -------------------------------------------主程序-------------------------------------------
( e# R5 M5 w" x5 P9 z# 模拟单端测序
* I# _2 m' x w3 N' s' ^sequencingObj = Sequencing()' g0 H8 d( ^1 j1 V- ?! h
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
5 v; X, z) M! S4 z1 j6 W! D1 g) N- wsequencingObj.resultsummary(): a( @6 x9 `* q6 Z+ B; q3 K
9 I2 s5 O' T! P6 ]6 ^5 o0 }" n# 模拟双端测序
- U) w2 M i3 w2 L$ ` a: vsequencingObj = Sequencing()- g- w5 S0 G K4 B# E
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa"). U* B' q5 j- F$ S1 t' j$ O0 ~
sequencingObj.resultsummary()
3 H: o- O \3 g
( m, j$ Q7 p" u0 @0 f2 W
( X' L: j6 d8 W- j+ I( R2 Q# a
~- x8 _, I2 V : d# y+ r9 P% n0 a
|
zan
|