- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 562739 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174195
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 18
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
: n* h7 q- }. \1 s- W. w+ A基因组测序模拟2 I9 r C* y; M
' L5 `& x2 B0 e+ p* ^/ o
一、摘要
2 j5 a; `4 Y ] p% L( M
; r4 o0 i9 d$ S通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
8 v4 ~! z6 L9 m T1 X7 y/ D
; q. f3 ^$ u7 s9 G+ O2 U+ I1 Y/ k二、材料和方法
! r. Y3 `& j& B, l+ }1 I' R9 [( S, v) ^: P; a
1、硬件平台
S: i' f% a) p' r9 g4 c( h0 t) p, G4 i& Y' t
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
( t7 n1 s, w6 y6 a: G安装内存(RAM):16.0GB6 L# ~: R2 Q6 f+ H
, u, t' V7 O+ V7 ^ c( d2、系统平台
+ H* \. y+ j- T! X' JWindows 8.1,Ubuntu; g. ]& f/ x" \6 K8 |4 x: O
" F% e( ]6 Y6 B9 L0 {
3、软件平台! z4 F7 g$ c/ S& R2 j1 g5 ~" I& i' m6 u
8 Y) b$ @. i; E* V4 g+ `
art_454& ~8 N' V6 t2 f% S" v+ ~2 V' o3 \
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
2 c$ f8 m: C+ M" OPython3.5
8 I `7 }8 N% ]6 t0 j% v DBiopython# N3 y2 e- `4 r/ I9 b( t
4、数据库资源3 W6 S5 y: T* ~
# Z/ Z3 w, G2 E; K4 X. ZNCBI数据库:https://www.ncbi.nlm.nih.gov/9 ~7 E" C" [6 ^5 W( _
# n8 o9 U3 y" e5 d5、研究对象5 U. ?3 z; q- P. I# g
- n7 N$ B* {" x( y. F酵母基因组Saccharomyces cerevisiae S288c (assembly R64) ! q! M3 \/ u' G: i# I
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz* t9 E% g d- V6 T8 K4 ~
* E# P7 c! c2 s: _. z* r, n+ ]8 J4 E6、方法
9 q" F3 |. S2 n% M9 j1 \. B/ ] o3 F2 b& L* j- p$ y, n, g1 C
art_454的使用 . Q, F2 u" a g; T
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
3 C+ M) S! P& ~2 y' XGenomeABC
3 z6 N* ]9 q4 F3 D d6 ?5 |进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。9 o) I% \* S: E7 ^+ f3 Y9 ]
编程模拟测序
! x0 {0 k2 l* _; y& I下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
/ ~( X3 m( x* D: F, L三、结果4 J: S3 m' ~9 D! n
% Z8 w5 x0 }/ t+ B- c7 ` ^: I3 W
1、art_454的运行结果4 R7 e `4 Q# u% Y! ?
% G1 j: ?# i3 \; |) e1 {( o
无参数art_454运行,阅读帮助文档 7 {; Q! i. V* q
1 I5 q% p2 n8 A/ D/ R7 Q% ?% u图表 1无参数art_454运行 , f0 ~9 |, t' P
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
1 p7 F$ T5 E0 n下图为模拟单端测序,程序运行过程及结果
/ E5 Y- w/ e/ F' K% l) G! c" o( j
8 f1 [- R9 {, G图表 2 art454单端测序 + j3 [- x7 o T
: z2 C, R7 w( s7 T& B2 U图表 3 art454单端模拟结果
: H6 x& }9 U- A8 m: w双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
" ?6 V: r* s& C, I4 k下图为模拟双端测序,程序运行过程及结果 9 o) ?: }. W% | P/ {3 X/ D. D
7 k5 _' z4 ^6 |1 e9 E4 f
图表 4 art454双端测序
& t. G2 V0 Y2 e& T) O& s& _- _- T, e/ ?' E/ }, }3 y
图表 5 art454双端模拟结果 ! K( l- A8 ]. V
2、GenomeABC 0 C* S/ A8 L9 K" p& `
下图为设置参数页面 - i/ {- c6 ~1 U+ p3 {
2 I4 F5 B1 ` I下图为结果下载页面 - B5 X# k: C4 w( d v( x# R
" U- y0 }6 f( Q, P, R* I$ A/ v1 M- G. [图表 6 结果下载页面
( s9 s) R1 q3 X5 O# G7 c/ H0 q3、编程模拟测序结果
$ }8 ?2 @3 w+ F! `- ~ o拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 ( L1 s4 T' ^1 C; N, t8 d0 ^+ T
单端测序 : O2 `/ f9 V* Y# L) N' D! t( _
1 i5 x; E1 Q. J: n* t: X6 i
图表 7 程序模拟单端测序
, K! U+ n! E: X5 \双端测序 4 s3 y' f G5 \
# p9 K. G( K+ z2 j/ `/ R图表 8 程序模拟双端测序
# y/ g. @: N4 }4 N0 \测序结果 2 h& u. X6 o' g* Y' G8 x- u
7 t' D# X+ `! C; q# Y
图表 9 结果文件3 d& i% m* I% Y( e [! v% m- t
" e* i* x$ Y; e8 O, t8 o% I. X
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 $ ^! @" @0 e8 N2 |0 m
测序结果统计表
1 `$ o; j6 l! u" d- G( k0 A/ E6 i1 I3 ]. {. b1 ]5 J
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)# ^; n0 ?* t" a. R ^! v. W6 C
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682+ p* c' b9 n- Y' ^% k4 D" F
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424( q! p: l/ C: ]
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.713888 R6 l1 O7 r+ [! o& N; [* V! u/ L
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.918868 u/ X, ]! [& W' \
四、讨论和结论( O' B# A0 C) t7 V+ I1 F
: i9 r/ b/ p+ j# _) E. t; T; a0 U
程序运行方法
; d4 [1 ^; L6 Q3 D
- U4 w+ V* y9 ^ H# r: K, H在类的构造方法init()中,调整参数。 1 K$ j$ F+ [* O c# b, Y
Averagefragmentlength为片段平均的长度; ! f5 [9 O4 x7 @/ ?8 _) Z% \
minfragmentlength和maxfragmentlength是保留片段的范围; ; ]! U% w2 u, |2 l0 f: |: V4 s
cloneRetainprobability是克隆的保留率;
- @& b! I. @4 L, W" O# p8 Aminreadslength和maxreadslength是测序reads的长度范围/ c/ r; q! o1 {
5 b# k1 l9 t2 s模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。7 v1 u, X" }9 u% k9 \: k
9 f3 g4 I( s: C9 U2 r0 _, z% k J
附录
6 o& |2 M |% S$ f$ t7 J
3 l0 v, V6 [4 e: m/ wfrom Bio import SeqIO
% v+ n. N7 E. @ M) ofrom math import exp
2 R; r$ a5 Y5 Himport random
) |* |" O' ]4 r% x7 f# a; k
! R2 b) P9 l* _) w/ uclass Sequencing:
) d8 X% c% F# U5 R3 J # N代表拷贝份数
6 u7 {) D! _1 C8 X* x7 G. N9 q$ u def __init__(self)% Y+ m, b5 } e4 j
self.fragmentList = []
4 Y% c% y4 O7 V self.readsID = 1' p* H# S H+ ?, Q
self.readsList = []
9 `5 X# {" D/ r& R1 c self.averagefragmentlength = 650
( p: ~6 u1 o4 \" }( i3 b/ Q/ E self.minfragmentlength = 5009 E- k$ n5 V! ^# V2 g2 q, |- V
self.maxfragmentlength = 8002 t- ^$ o( @2 [6 I3 q; P
self.cloneRetainprobability = 1
& a4 i8 D' e' m& J0 N4 Z self.minreadslength = 50
1 z/ R+ [: H/ l- V) ?, k1 J3 B self.maxreadslength = 150# N c" N1 d% t
self.N = 10
4 [& \+ ^; K; t3 `- H8 N5 n* S self.genomeLength = 0
) W% F' @% G, d+ h6 | self.allreadslength = 0
5 {" K9 @4 v% U+ s8 ?( G' ^1 `" o1 T3 B' i
# 生成断裂点
" V/ J' g1 Q9 Y3 P" j7 n6 _. P def generatebreakpoint(self, seqlen, averageLength): D {% _6 v9 |" D* j: M
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
_5 \+ ?, U" P6 A1 K" x breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
/ w# u- }( `- ^3 b# p, P, i1 g breakpoint.append(seqlen)
+ \, a7 T: i6 L, a5 l$ z# X. G$ b breakpoint.append(0)/ i' N$ D, j5 K% D' p) U' e9 [6 E
# 把随机断裂点从小到大排序& v0 k1 u1 N: h6 b$ x$ [3 p
breakpoint.sort()
* I" w* r ^$ E; w- a return breakpoint6 [! G$ U! x6 Z, @" s$ ^* p; Q8 f
. ]/ y( X6 m2 s) y1 g" C
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp* ~' F& B7 x6 J+ z2 S' V) Y( v
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):9 @! n5 x7 t5 J- p
for i in range(len(breakpoint) - 1):
2 X9 T7 \2 |$ b1 e- d1 k fragment = seq[breakpoint:breakpoint[i + 1]]
* F1 @+ b5 |1 A* f6 F0 R: L if maxfragmentlength > len(fragment) > minfragmentlength:1 I& m" x" M& f0 `
self.fragmentList.append(fragment)
) e5 Y$ B* _4 f6 | return self.fragmentList4 }; J( O( |( Y
2 s# e2 ~4 z2 | u+ v3 r$ k5 y! @ # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率: ?$ r, |3 c) t
def clonefragment(self, fragmentList, cloneRetainprobability):
5 y8 F# N8 g3 A* N# n; M clonedfragmentList = []7 R. E4 w9 M' p: [
Lossprobability = [random.random() for _ in range(len(fragmentList))]' p1 P& z% d) q
for i in range(len(fragmentList)):" }4 q6 N2 c/ Y4 q
if Lossprobability <= cloneRetainprobability:
* e& M* X$ m( Z' M5 b$ t clonedfragmentList.append(fragmentList). m7 H, B4 p4 `- V c
return clonedfragmentList
/ T* x" F8 C! e/ h' Z$ j
: @. W+ J1 z) s# n! t # 模拟单端测序,并修改reads的ID号5 J6 n( S* {' E
def singleread(self, clonedfragmentList):) Q D* n# b0 r! V& x" h6 f
for fragment in clonedfragmentList:
2 O$ g1 \/ \2 o" b6 x fragment.id = ""
: q1 r: A) u- A# `$ Z6 Y L fragment.name = ""
C/ j c: q8 T3 U' Y fragment.description = fragment.description[12:].split(",")[0]
9 R/ d) F; o. C; p fragment.description = str(self.readsID) + "." + fragment.description
6 K) s$ p" M) V Q self.readsID += 1
# c* h3 @( c L readslength = random.randint(self.minreadslength, self.maxreadslength)
. M$ e9 L1 _- }* X# A self.allreadslength += readslength# S- ~3 f0 Z3 w# |2 s
self.readsList.append(fragment[:readslength])0 P/ K+ {! n, R) _
4 [6 W4 d# }! R1 T: x
def singlereadsequencing(self, genomedata, sequencingResult):2 s, ?" i8 x9 `2 L- ^: K
for seq_record in SeqIO.parse(genomedata, "fasta"):
) G, G$ H/ ]# f# T: | seqlen = len(seq_record)0 t, x. E6 J7 {# ?7 _1 `
self.genomeLength += seqlen* Z" d3 R) P4 ~
for i in range(self.N):
" T1 U0 h5 t2 h J0 _+ V7 l" j # 生成断裂点& ^; B3 i, Q! {/ ?5 G" S( P
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
+ v* T3 P+ P+ T' l # 沿断裂点打断基因组
* Y2 {6 c# a: A3 \( s# w3 | self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)- l2 a0 n& p$ Q! |$ O" N
# 模拟克隆时的随机丢失情况
! J8 ^6 {' h* V% y i clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
: V) T* a/ n9 p2 Y # 模拟单端测序 m+ A2 R8 N$ ?3 d( ^. L: n+ I9 n
self.singleread(clonedfragmentList)* v2 `$ E! \" B f
SeqIO.write(self.readsList, sequencingResult, "fasta")7 u& j6 p1 [$ Y+ X/ D1 E; ?
, @5 v, F( E0 |% E6 a1 ?& m
def pairread(self, clonedfragmentList):
( W" c# S. L. T1 u# ? for fragment in clonedfragmentList:) ~- h( Y& K( Z. z! N s
fragment.id = "" W, x0 N6 r/ O8 M
fragment.name = ""9 [ I0 {" } p. A
description = fragment.description[12:].split(",")[0]
H% H$ V, t! {6 ^+ T8 A& r fragment.description = str(self.readsID) + "." + description
$ R2 r, M. k& M3 W readslength = random.randint(self.minreadslength, self.maxreadslength)
, F# L6 v, r: d) ]6 x3 w8 M self.allreadslength += readslength1 _! [ e9 d" B% S3 B- d7 S
self.readsList.append(fragment[:readslength])
6 v7 P" |# [' l& j, V7 }% _# V
' a2 |. V/ }% H I readslength = random.randint(self.minreadslength, self.maxreadslength)) t' K0 y }9 Q; @
self.allreadslength += readslength
5 ?5 }0 l% v6 N4 m! k3 |* V% F* ~5 P$ Z) T6 n' K, d; N, {3 m
fragmentcomplement = fragment.reverse_complement()
, P S O$ g! Y9 R1 H fragmentcomplement.id = ""
0 ?; N. ~1 f3 j7 C2 \8 Q fragmentcomplement.name = ""7 g. r8 J, `* Z9 r5 j4 F. F
fragmentcomplement.description = str(self.readsID) + "." + description
- c. m. h7 U+ \# D& x- I4 l self.readsList.append(fragmentcomplement[:readslength])
9 l/ F- a7 S7 D# }9 J9 H/ ^$ J& r# M" v; S& L/ K5 F
self.readsID += 1
. U- z# r4 Z8 K; c# w
( W( Q6 ~2 p0 L3 Q2 y3 g- B def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):" {$ l0 _3 \$ X/ m2 I. N& ^- P! H
for seq_record in SeqIO.parse(genomedata, "fasta"):. d$ E, Q3 D: J: @6 M0 ]% S% m2 R
seqlen = len(seq_record)
% N1 ]1 e1 e! Z$ V self.genomeLength += seqlen& b: v# c- x/ r3 h8 X* \3 y6 s4 ]+ v
for i in range(self.N):
6 l8 V) d- q$ X1 b7 f9 F1 Q # 生成断裂点. a' J8 l4 E, k* a# a
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)7 O- O0 U! k/ m4 u: q& V- W. E% b
# 沿断裂点打断基因组6 O/ y% d, R* y
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
' h6 [- o7 W: s0 T/ a* ^ # 模拟克隆时的随机丢失情况4 }5 n1 O& I+ h: A9 n
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)/ l% ~* C0 {7 M. m
# 模拟双端测序$ B3 R: n' i$ j2 j. w0 o
self.pairread(clonedfragmentList)
1 G$ ]2 ]8 g' ]0 D readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]( i/ |& W4 X8 L" Y
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]1 K7 r& U# P& L- v k
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
7 R9 M/ v, `& J5 E SeqIO.write(readsList_2, sequencingResult_2, "fasta")7 C6 [1 } b" }$ ~( x
6 ]/ U5 E v" Q2 E) c% y" ^/ o/ Z5 Z
def resultsummary(self): O: ^; ]+ _1 x- P4 T; @+ O
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")( T$ S5 H) w1 k R3 G
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))8 {3 U4 F, ^8 m& E6 M5 [
print("N值:" + str(self.N))
/ _) g* o* c6 w! x/ a* y' Z. @ print("期望片段长度:" + str(self.averagefragmentlength))2 B5 w. z( m' s6 c) G
print("克隆保留率:" + str(self.cloneRetainprobability))4 W3 U& a# ^( p( S* Q
print("片段数量:" + str(len(self.fragmentList)))3 }, d# B2 i' L% b7 H# L9 n1 P
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
8 q: z2 @8 R9 C. M print("reads总数量:" + str(len(self.readsList))): D+ L, Z% Q2 [
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
+ W& X1 r# o$ ?3 Q7 R8 d m = self.allreadslength / self.genomeLength& [; E3 V7 Y3 G3 g; b
print("覆盖度(m值):" + str(round(m, 5)))+ O' Z F1 d8 Y% t5 s$ W
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))1 z. D+ c0 b9 Z" s
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
: {0 v& A- {# D/ u) U# -------------------------------------------主程序-------------------------------------------
+ k: j( ]' _) {0 }# 模拟单端测序
s. |3 z/ }! ]5 x/ usequencingObj = Sequencing()
3 H5 e, E( q) _+ d4 XsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")- m& ]: ~6 v' D
sequencingObj.resultsummary()0 R( j. [! _- e: c
; R6 C9 j3 P/ |
# 模拟双端测序 p* Y0 B6 x# [8 b: y
sequencingObj = Sequencing()* h2 T& @0 ^7 C3 L" A
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")5 u- p% E7 U T U9 I. @: _
sequencingObj.resultsummary()
( i8 {* a: k4 U" |from Bio import SeqIO4 D% e0 p& {3 g c/ d
from math import exp
, d$ ~* X P+ I0 s8 timport random, H- H8 M* q' h& R& s
8 R# H0 ?% d- ^1 Yclass Sequencing:% ~) r ]5 y2 {7 n/ Q2 J
# N代表拷贝份数( x$ g Y7 M+ ~" Q3 o) s
def __init__(self):
. T- @3 w# ~+ {' G( J+ A self.fragmentList = []
) g C6 J9 c5 _2 u) k$ B7 t& h8 S self.readsID = 1: q6 {& G8 D7 Y
self.readsList = []1 c, r& I: \* R
self.averagefragmentlength = 650% d6 o9 s; Z: I! B' W7 o# w: M4 z
self.minfragmentlength = 500
3 j! }5 y* }. Y e self.maxfragmentlength = 800
7 }9 _6 ^5 v* g. s- p1 L self.cloneRetainprobability = 1
] w1 b4 F) o7 H self.minreadslength = 504 {1 F' a8 ^2 ]! z
self.maxreadslength = 150 f# R/ i& H0 I2 G% L
self.N = 10
3 I& e; R% y% E- R7 K self.genomeLength = 02 H) o8 O2 \& @& n2 N
self.allreadslength = 0& ?+ u5 l9 x8 u t8 }# L
6 q. y& U3 {. }- Z
# 生成断裂点
* f4 H, s# ?$ j- u5 c def generatebreakpoint(self, seqlen, averageLength):
0 Y. b4 R1 r# O( z- I6 P9 h, Z # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
+ k# t7 z! @4 @6 L breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]6 c) n m* |# e$ d2 c- y
breakpoint.append(seqlen)
; }1 T) v9 }* I. g; ]0 r. m( a; g breakpoint.append(0)$ X& F( F9 B+ S! F5 j
# 把随机断裂点从小到大排序
3 q! Z4 d% Y- ~$ b4 m9 @ breakpoint.sort()8 d2 w( H/ D7 |. I8 b* n* H3 _
return breakpoint
( Y$ n# J4 l2 ^8 M' e; C
' N6 J% i+ t# u # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
" t" Y" ~! S7 ]6 F0 u def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
0 v9 n( S8 C3 t2 ]% ^' J6 l; p for i in range(len(breakpoint) - 1):
0 D1 Y$ ?$ s. @$ |1 t fragment = seq[breakpoint:breakpoint[i + 1]]
! d' K6 S8 J/ d( `& A4 r9 d% \! C9 }/ c if maxfragmentlength > len(fragment) > minfragmentlength:
. r) n* S+ z: t2 E' |* t2 h$ h, a self.fragmentList.append(fragment). n7 [! u- u3 k; `" X
return self.fragmentList
3 l9 M8 C8 V6 Z9 h/ Z0 p9 h. Y: \3 _. I% e- L$ E
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率1 e) M! L5 o* @# {7 H
def clonefragment(self, fragmentList, cloneRetainprobability):7 j/ ~0 C+ j# N3 k
clonedfragmentList = []
6 n8 U1 o: S8 Z( p2 o; }6 W2 o Lossprobability = [random.random() for _ in range(len(fragmentList))]+ K7 `& O6 [/ p5 B& P# }
for i in range(len(fragmentList)):
$ W4 @. [" S* H+ w% i7 l if Lossprobability <= cloneRetainprobability:
+ n' [& \% \3 f0 }8 Y X# v$ w2 z clonedfragmentList.append(fragmentList). @7 d. q { E
return clonedfragmentList
7 @9 t/ g( O5 E: _: c7 `* q2 r/ n5 h
; B$ G0 ]7 L0 ]4 L1 m# o # 模拟单端测序,并修改reads的ID号
* {, o/ t4 n+ p8 o def singleread(self, clonedfragmentList):9 |. ^! T5 f( a/ `3 K9 F
for fragment in clonedfragmentList:
( t0 W9 o9 L# q. S3 P. B fragment.id = ""
, D, b7 \' G1 T) _* Q fragment.name = ""* R' M, a* d1 \
fragment.description = fragment.description[12:].split(",")[0]( H# t1 d* e8 |; `1 c3 l
fragment.description = str(self.readsID) + "." + fragment.description* F3 l e1 f6 p A: R( A& c
self.readsID += 1
3 M1 ^! T1 p+ Y9 W1 Y readslength = random.randint(self.minreadslength, self.maxreadslength)* u2 T* w9 ~8 @/ r: Z
self.allreadslength += readslength' s3 x4 a, D. j& |- I
self.readsList.append(fragment[:readslength])
2 \7 M/ p4 a7 f5 O6 r
6 B! B8 e' r# _0 o def singlereadsequencing(self, genomedata, sequencingResult):! s: K. E4 l2 @2 E( P
for seq_record in SeqIO.parse(genomedata, "fasta"):
) }& c6 D2 P8 i2 @" ^ seqlen = len(seq_record)
& \* g, L4 s" }% s* L5 B. Y' m self.genomeLength += seqlen
8 u) P% K0 k1 |. s/ _6 ` for i in range(self.N):" d( R) i0 V: ]3 f( w- e# f
# 生成断裂点
0 @& n2 c6 r- W$ B: w9 A7 U( M% l; \4 W breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)/ _7 N3 L7 ?# S f2 n! v8 e( ~5 Y. ^
# 沿断裂点打断基因组
' h" H7 E/ w0 | self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)1 V$ R8 {' F* A9 P8 z4 L5 S p
# 模拟克隆时的随机丢失情况* B! d6 V( E" Z h+ P9 g
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)) K0 J- K/ s. u8 a6 `
# 模拟单端测序
% P/ c; U3 n: P8 T8 w self.singleread(clonedfragmentList)
4 m" e) F8 Q" t! w9 k% O SeqIO.write(self.readsList, sequencingResult, "fasta")
1 z* W' A, N! J) ^ |
2 c6 U1 w" @0 O def pairread(self, clonedfragmentList):
& }( I1 z! @6 K for fragment in clonedfragmentList:1 W4 M1 J, j j$ d: k
fragment.id = "": D7 E8 F0 y" p7 S: q
fragment.name = ""
0 e; k; h3 r* Z5 J/ x, [ M, } description = fragment.description[12:].split(",")[0]+ z$ B* N Q: ?: }; e) j: r
fragment.description = str(self.readsID) + "." + description
9 V" Q s! Q) S: R readslength = random.randint(self.minreadslength, self.maxreadslength)) {* a5 M2 [/ H$ Q9 O
self.allreadslength += readslength5 ^3 b8 `3 f2 j
self.readsList.append(fragment[:readslength])
+ a* @& _5 i$ S7 Q. e* H. o
2 E; k4 O M+ I k% J$ ] readslength = random.randint(self.minreadslength, self.maxreadslength)# Y5 E1 k' N* G
self.allreadslength += readslength/ {6 D% m. |1 a( g3 |
" H4 v% b0 S2 R0 t9 C2 Z fragmentcomplement = fragment.reverse_complement()1 \5 c0 \+ A/ z' t) p) D( y3 i+ D
fragmentcomplement.id = ""% o- O3 G8 |; V2 p4 S- F; S
fragmentcomplement.name = ""
9 i- C+ I" _( V: K, ]. a+ J u" z fragmentcomplement.description = str(self.readsID) + "." + description j- l/ s9 d. i/ h4 u1 w- d
self.readsList.append(fragmentcomplement[:readslength])
- X" t. X7 N' N8 p/ ?+ G5 w% z$ a4 B& N. ?: F. _
self.readsID += 1
/ c/ P) Z- P& ?/ D- f- d& Z" O8 N# h1 ^
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):- o# N/ Z2 v& H) N* t$ Q
for seq_record in SeqIO.parse(genomedata, "fasta"):
3 @) H+ m7 X5 M/ D seqlen = len(seq_record)# X2 v: m' U3 b# a
self.genomeLength += seqlen2 J. j0 Z( U5 n+ a
for i in range(self.N):7 ^$ w# E6 r6 j3 I0 ?1 L
# 生成断裂点
2 x+ v0 f2 E1 W& J& v$ ? breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
J8 x" _/ @/ g5 S& Y3 w2 x # 沿断裂点打断基因组
0 {6 Y" M! Y4 a b$ q @ self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
, R. Z. Y( \; e1 ~1 i& e5 V$ ~; F+ { # 模拟克隆时的随机丢失情况
$ E" A, f. T% X: Y clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
) D- e, B9 s! T6 { # 模拟双端测序8 y$ V; Q! R' |7 d- V
self.pairread(clonedfragmentList)
7 N( v' A2 }, M3 h6 s) m% P readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
2 \: c% j8 j2 y, C5 V: {4 j! P readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]5 P5 J6 ^9 v; x$ r5 E* N9 I3 N; s" b
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
9 _4 v4 j3 }4 v7 B+ ?( B SeqIO.write(readsList_2, sequencingResult_2, "fasta")2 Z1 A& D, u5 H+ e
8 h! N0 X6 u2 @, m
def resultsummary(self):
0 R+ Z7 T2 ~3 P3 r' o print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
7 {4 {/ ?6 {- ?/ g: s" u: {6 L print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
) D$ `: `8 C! m' }6 m% R print("N值:" + str(self.N))6 G8 A: S) S/ u' d7 I. B0 A; T
print("期望片段长度:" + str(self.averagefragmentlength))
0 ~) s1 f0 W4 F. Y L! O print("克隆保留率:" + str(self.cloneRetainprobability))
& J3 R" j0 N7 P$ A% U7 a$ _! a print("片段数量:" + str(len(self.fragmentList)))/ V% Z% l3 j- T* ^1 z
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
* b/ t& A, m( T) d print("reads总数量:" + str(len(self.readsList)))
6 s% C; D0 f$ |& w! x. i% V& ] print("reads总长度:" + str(self.allreadslength / 1000) + "kb")1 A3 ^( {1 L) X) T8 o Q# y9 K
m = self.allreadslength / self.genomeLength
9 C* w4 g0 }( e5 C) o: X print("覆盖度(m值):" + str(round(m, 5)))
2 K& |& c. E# _! J print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
, N! R) C6 _. F- i9 K# @ print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))4 Y# H m% K+ }% K4 V, C( X( g
# -------------------------------------------主程序-------------------------------------------
1 h4 [; ?, U. W( {# 模拟单端测序/ r% V% p0 y" R" \
sequencingObj = Sequencing()
$ i. e& l2 k% OsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
$ }' I/ z: [9 _1 h- N4 |sequencingObj.resultsummary()
& f% c- p8 H5 J. T1 {2 G( ?9 O3 A E
# 模拟双端测序
* Q3 U1 Y% j4 o! WsequencingObj = Sequencing()
- r4 G2 O% s" V8 IsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
, `" J7 q# p& f" H- }* XsequencingObj.resultsummary(); f+ O2 i* c; G
, z8 _1 z* F7 t
6 P# q8 C7 c" Y1 S! P7 ~3 x/ L0 v
$ @( t& U, @& d) s* R
% y3 @. N' U+ i" m+ s |
zan
|