- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 564628 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174611
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟5 L# L( c' T0 @" i, e1 t( R, N) ^
基因组测序模拟
; K n5 d! G, V4 Q! g6 g7 ]! W* s$ Z; `' k
一、摘要8 }2 T1 w P# u. G! ~
# \+ Y4 q2 q0 c" v) i
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件& L: K( O! {& ^8 t, \
1 h# \( z/ c" i: S3 G1 S二、材料和方法$ R1 F1 Z: p( \
# K0 V4 u0 h" |, n7 e" p1、硬件平台5 Y6 ~2 C" ^9 |+ f) q8 ?. V" n
& s7 ~' f" z/ m+ A( }; R& g% o) j
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz 7 _1 I: X; E# q( E5 ^) W
安装内存(RAM):16.0GB
+ T, ]) w( S; K& \: c; o- ~, A$ p4 L* C
2、系统平台6 w* G( q: C9 a: B& k
Windows 8.1,Ubuntu
/ U. \( T' b, J# J+ I, @' @: R# [" ^0 ~4 O
3、软件平台2 I5 L9 k5 l, ~) t, b, B
1 [' p4 ]) O3 A3 [0 ]- B: v$ ^1 iart_454+ ~' o! T' v) a, q
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
' M" S+ V3 h& { b6 I; zPython3.5: W" v! N3 _6 B; I7 T, e
Biopython
( E" Y* d; \7 v4 z k8 g+ P& s, c& ~4 Y4、数据库资源
a7 m, ~* t8 {, J8 D) X" @! _$ F% s! Q, P
NCBI数据库:https://www.ncbi.nlm.nih.gov/. l2 C- H5 z7 R
; i- R! f5 k% Y5、研究对象0 U, U: x% ~* f0 _2 @, Q
% G* i: ?9 w t y, w7 X
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
) y. O# q' c9 Z9 t! G' ?0 @ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
) F$ f1 ?2 q! f0 |, `
! w4 Q t7 a o0 Y4 Y7 ?" O6、方法
% H3 d# J, @1 e1 I( k4 ^1 J3 N
( I0 a7 h8 L, I' sart_454的使用
, [' Y5 P9 u2 i, g! U6 q首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。* c0 T! v: s/ {/ J3 }
GenomeABC
/ _8 W5 J. \3 R' D4 M& p进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
X: N% y+ |! v$ h编程模拟测序 % \# m8 i {0 J
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。7 B0 m5 j; O: K! O
三、结果
0 Q- Y0 M# H& _7 K/ Z$ h9 T7 r1 R& ^7 G6 j
1、art_454的运行结果; B9 W& n4 |) b8 H" }+ h$ U2 X3 K$ Y
( i. {. U L0 C F8 o* m6 `无参数art_454运行,阅读帮助文档
/ B' D; T. T4 ]+ [1 U5 A8 f. @9 B9 u! m2 x; e2 X% V" p
图表 1无参数art_454运行 . X9 D' Z$ w' m `: L9 B
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
% k( p3 V. L# F$ }9 r9 n下图为模拟单端测序,程序运行过程及结果 & @4 o( V4 G9 g6 `* f! E1 B
# f/ b* _ V+ F
图表 2 art454单端测序 * V7 [/ T0 j9 U$ K9 i# L
2 V! @7 v/ `9 J4 N$ ?/ N
图表 3 art454单端模拟结果
6 E. y/ i" d& C3 M: C8 ~. ~双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
3 o! W) K8 Z) T; x下图为模拟双端测序,程序运行过程及结果
- h0 f. Z1 u) P1 f6 W( U$ c: M* D2 v
图表 4 art454双端测序 & f/ K& G; p! l. f, `
! [& H' z) N3 a& m W4 `, ]# B
图表 5 art454双端模拟结果
/ w& I7 K. [8 e% s; U: |. Q2、GenomeABC
. h0 b2 x% D9 S) `8 ]; T# V9 u下图为设置参数页面
+ H' i' h1 @, f7 h4 t
6 f" a; D1 h: v6 w# k; `2 q下图为结果下载页面
7 E, g3 i: i; b( Y, M- y7 [6 B) O ~! x
图表 6 结果下载页面
$ e3 a! ]4 p4 a% V& `3、编程模拟测序结果
5 B, p2 _! |3 q% n拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
1 I' `/ a7 n5 ^0 l* Q J单端测序
5 L' e# B) {1 x. ~" @ q7 X3 h' \7 C
图表 7 程序模拟单端测序
: a' Q; B! Y& k0 R7 y双端测序 i O- ^2 \) H$ o
, _$ @/ k4 I$ @, f g
图表 8 程序模拟双端测序 ) q" X, ]2 ^7 c$ ?* [2 D
测序结果
. f6 \* n( Q+ H/ Y- t
7 R) ^8 ?8 R' y6 B" M( v图表 9 结果文件+ l1 X. A, t6 v8 a" A) o( O
' `; J$ v0 Q; E& s R因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 ' ~6 a" V/ V; G1 D; O
测序结果统计表; J* ^' j1 V" b& X0 h
$ B3 _. B! P6 X1 T; l测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
$ E4 U3 R7 [* S# a$ u" J+ Z2 C单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
/ a5 ~$ B6 j, d6 ^8 W" X& S单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
( h' Z5 l5 N+ {0 h' B双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.713882 L) W8 i# Q' A5 d& O
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886! V# @4 \! n! c4 R
四、讨论和结论
9 j# O; m( } H% n& k% I/ `( v1 u6 {2 x) Q" \9 ~% U
程序运行方法
& H6 D+ K( N1 f7 C3 m# w; Q8 a5 S |/ Q. T* Z
在类的构造方法init()中,调整参数。
3 s! i. R+ I* o5 DAveragefragmentlength为片段平均的长度;
8 H$ g y! L$ z x1 Q% sminfragmentlength和maxfragmentlength是保留片段的范围; " M3 R9 `: q5 Y% |* v
cloneRetainprobability是克隆的保留率;
: @: v) I0 i/ b% L0 D. w2 V) u+ nminreadslength和maxreadslength是测序reads的长度范围: k% G( v- K* s Z6 |8 O- n
" A. z$ `) a* I1 I: H; F; \0 T模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
; f9 H% J: F6 B- B
, t8 k) Z/ w' a3 y9 O5 y" g* O附录
# c" d2 {3 i4 n# X
0 _6 S% D1 j- ] zfrom Bio import SeqIO( M2 @! {( L+ A' X
from math import exp7 B! d, L* @+ R
import random4 a% a# X4 T& ~, D
& V( k) o, z3 lclass Sequencing:
' {" t" a% a4 I- S; L- W* W- X # N代表拷贝份数
( i4 M% [# i* _8 s) i0 a2 y, z def __init__(self)! |: ~% s. s( c# a; K
self.fragmentList = []
$ ~6 E) j( s2 R3 [' z$ Q self.readsID = 12 |3 K; [8 @! t2 M
self.readsList = []* ^3 g% V+ E; _8 _& J8 i) G
self.averagefragmentlength = 650
' P: I3 @6 d0 ~* w2 b self.minfragmentlength = 500! T1 h K( R8 H; C9 F
self.maxfragmentlength = 800$ a! ]( Q H H
self.cloneRetainprobability = 1
* w9 P( V% g' @" }! I0 b self.minreadslength = 50
+ W& \1 U# Q* \7 Z self.maxreadslength = 150
% {* b N, r" d# k2 B self.N = 10+ r3 ^/ @7 ~% |4 m! _) D' ?
self.genomeLength = 0
, A9 k& I& o7 k% k) T& {/ A% G self.allreadslength = 09 I; t' K0 d& \/ c9 _; M
. [6 o" h, I, R5 `" M # 生成断裂点
b9 ~8 @. h6 m. @ def generatebreakpoint(self, seqlen, averageLength):
5 z& m- l9 R+ } ]/ ? # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)4 m' R5 S5 T$ i: ]7 S D4 R- r I
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
/ C( m3 h% ]" b" |" G' k breakpoint.append(seqlen)
5 Y2 l# O3 r: T3 F' a8 Q. I: C6 J breakpoint.append(0)
( q3 p8 _% [1 z, G # 把随机断裂点从小到大排序 `, P& G6 n6 g2 l K
breakpoint.sort() `5 @7 T, z" q/ |* L9 B
return breakpoint5 ?6 j4 O& r+ j% X5 v
) j2 E- ^2 i. Q4 n; X; b; f
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
& g i9 N! S9 z: h def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
) L" l5 C; f, c4 |7 S for i in range(len(breakpoint) - 1):3 w5 j( T3 k0 B) P i
fragment = seq[breakpoint:breakpoint[i + 1]]
& M* J3 T, b* m/ ?3 H1 i& ?, Y- O if maxfragmentlength > len(fragment) > minfragmentlength:3 t) B7 N( M3 W
self.fragmentList.append(fragment)
9 f2 W* y4 A( V8 x, P0 I3 t' ?3 b return self.fragmentList# f2 z) i/ _6 o9 [
6 o' `. X8 R6 @6 J
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率; y# U9 |7 ?6 f
def clonefragment(self, fragmentList, cloneRetainprobability):. @& E. T9 ^% |0 u
clonedfragmentList = []3 O+ M) R% r1 L" @2 D, W1 J
Lossprobability = [random.random() for _ in range(len(fragmentList))]
) x* T' U5 h9 a9 [ for i in range(len(fragmentList)):1 H# h/ Y! A7 J/ y" X7 E* q, F+ \
if Lossprobability <= cloneRetainprobability:
7 r* ^- h0 `# `' {8 W clonedfragmentList.append(fragmentList)
1 H6 Z" x& H- \3 Z, Z/ Y return clonedfragmentList7 O- u$ i7 l' H+ S
# w( J! C: O* s& k # 模拟单端测序,并修改reads的ID号8 Y6 ]2 N" j x: d" A! Q( P3 v
def singleread(self, clonedfragmentList):
0 X, z1 O# e6 s8 c: q. K for fragment in clonedfragmentList:9 ^9 Z; J9 {% p& q9 G$ N5 }0 A
fragment.id = ""1 f% ?1 E! O$ j5 O
fragment.name = ""
( {, U: ~0 |" b4 m1 o* N* i6 j" c fragment.description = fragment.description[12:].split(",")[0], `" I+ A+ j, Q( i8 l- X
fragment.description = str(self.readsID) + "." + fragment.description" C& ~* l( v$ H9 C0 H
self.readsID += 1% f5 Y( {8 s9 \9 Q4 q( \* f3 K& P
readslength = random.randint(self.minreadslength, self.maxreadslength)$ E. f6 D l* [
self.allreadslength += readslength
# X* w9 P& [8 B2 M( P self.readsList.append(fragment[:readslength])' G. s0 u; z. R
. E' ^% X& J" p! B4 M
def singlereadsequencing(self, genomedata, sequencingResult):
3 H" t& b0 Y2 m1 I+ W for seq_record in SeqIO.parse(genomedata, "fasta"):" v2 Y; k7 `4 O/ M" c8 Z- x) z
seqlen = len(seq_record)
) c9 g- A7 |3 X- ~3 E* m: [ self.genomeLength += seqlen
8 z' C( d) `) c- s( ?0 d0 I8 M for i in range(self.N):/ Z/ U( U* N, L# e9 S5 P4 |* u
# 生成断裂点
- T( @# r% s5 Q3 w# F: C$ V% v breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
: d7 z3 g v+ F3 f2 D8 i! ^ # 沿断裂点打断基因组$ M O. s+ ^0 i; e+ a, N
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)$ R# X/ f% l- v# I; B q( u
# 模拟克隆时的随机丢失情况& M" D" H8 M. R
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)' i: Z2 H9 n% b/ v* H0 _
# 模拟单端测序
V& L, @ L- {7 \* x v* T& K3 S! Z self.singleread(clonedfragmentList)
& X* ^7 v2 E+ E3 i SeqIO.write(self.readsList, sequencingResult, "fasta")) A$ j7 J; W9 q4 s, v- B4 `/ P( P
6 F: F8 G G6 t; g. t def pairread(self, clonedfragmentList):
. `7 p% S& k; L o: o1 h! d for fragment in clonedfragmentList:' S6 ?8 }0 j* ?2 T& w) a
fragment.id = ""2 ` P8 h- _3 r7 k
fragment.name = ""
! ^! _ x9 L! t) x! S description = fragment.description[12:].split(",")[0]
1 K. m: g' h. w fragment.description = str(self.readsID) + "." + description
. ]( g: p- D# k) Y" ~6 A: @$ A- t readslength = random.randint(self.minreadslength, self.maxreadslength)
4 A/ d+ w7 s7 u& P( N+ s self.allreadslength += readslength6 {- e) T+ }! ]5 Q% c5 H" F( |
self.readsList.append(fragment[:readslength])
% u& q) ^: ~# u5 t: I6 d% f
2 e6 L5 ~7 e9 }4 ^1 u& ~ readslength = random.randint(self.minreadslength, self.maxreadslength)
* d. _0 {9 q, d* M- z self.allreadslength += readslength+ ~4 B/ a+ l Q5 \6 V
2 ?0 u2 j6 ~# x' n$ G- b fragmentcomplement = fragment.reverse_complement()
/ _$ ?* {2 W3 U fragmentcomplement.id = ""
/ t- j% h$ P9 Z- m0 |' [ fragmentcomplement.name = "". l% g. f8 C# r8 R; \& C y0 l
fragmentcomplement.description = str(self.readsID) + "." + description e6 j3 ]3 E* J0 g: G$ w2 H. x
self.readsList.append(fragmentcomplement[:readslength])
/ ?$ X; M* `, ^, Z( x
0 E! O, F* M5 r. g: g self.readsID += 1% w: T$ |9 v ]4 o# @! `- v
* \# ?- ?5 u e8 | l0 z( O def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
0 e8 d2 Q' e5 y, y/ C for seq_record in SeqIO.parse(genomedata, "fasta"):
. K& O4 {8 T# k/ }& R4 e9 |& k seqlen = len(seq_record)
+ G5 j. N: r3 g1 I' h self.genomeLength += seqlen
* S k2 c# p8 a. ^; r; E for i in range(self.N):9 j& h w5 h6 @+ {' J! }
# 生成断裂点
& q5 a" t5 t7 o R, K- J9 @ breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
9 M: O* H! d7 f # 沿断裂点打断基因组6 y+ U0 t" q- T2 R" }% k0 v
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
6 H1 K7 d/ X8 X # 模拟克隆时的随机丢失情况" p/ I0 d( X! P1 `
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)1 O/ q5 L/ U3 K+ y% e2 i
# 模拟双端测序
% r0 X" z4 m9 J7 F self.pairread(clonedfragmentList)
/ x7 |) F! Z! m! Q readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]8 a0 Y3 O, p% R; c6 ^/ a6 f
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]4 _+ @5 U P' T
SeqIO.write(readsList_1, sequencingResult_1, "fasta")0 L/ _0 I8 C" }7 f" \
SeqIO.write(readsList_2, sequencingResult_2, "fasta"), n1 N1 X0 g' B2 J- C0 c
4 r, l6 V% |" h9 a. a
def resultsummary(self):
6 P7 B* u- [! h- @ ^# n/ |+ Y print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
" I0 A+ M* v/ ]1 D* n print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))6 d m# k G, f$ O: _6 \
print("N值:" + str(self.N))9 |: f6 w- h9 \
print("期望片段长度:" + str(self.averagefragmentlength))
) j: n+ X s. U2 B" m% D# u. J print("克隆保留率:" + str(self.cloneRetainprobability))4 N- o7 G+ e# c6 ]$ ^; h7 ?
print("片段数量:" + str(len(self.fragmentList)))
; ] I: ]% Y5 e3 w9 m5 ^- Q9 d* {% o print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))2 O8 ]# i1 I5 P& D S7 a
print("reads总数量:" + str(len(self.readsList)))
* x/ U2 L5 b9 j9 Z print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
$ `- H1 V1 i( u. t9 m' L m = self.allreadslength / self.genomeLength
* Z. B+ y4 \2 q7 Y' o5 A print("覆盖度(m值):" + str(round(m, 5)))
6 w( r; t& K7 i- C% N7 r" r print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))3 y3 X4 F& T0 g) F, ?0 l+ `
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
3 Q/ J. R) h& T/ }" V# -------------------------------------------主程序-------------------------------------------7 c9 @ j) [# |% f
# 模拟单端测序0 i( v) N( ]9 O- W* p# N
sequencingObj = Sequencing()8 X+ r# g Y0 P) G# [$ z4 d9 p
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")! J6 q" `6 ~. R7 [% L, D3 V- q! r
sequencingObj.resultsummary(), Y) T6 L- u; Y) B& s$ K
" P: x9 P x) ? E$ @1 H c; E; a
# 模拟双端测序
# N: D3 P; X8 w6 B8 JsequencingObj = Sequencing()
3 ~" G8 F3 D1 n ]sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
9 p2 {+ V) e: D. [sequencingObj.resultsummary()3 K! K4 @2 U$ P& ?$ H% h9 a, ~7 |
from Bio import SeqIO
p' K& i6 ^9 j1 y% p6 j( l3 S8 `from math import exp
2 x' f* Y% j. |" {. Y, H! l" J8 j2 }import random O: j) t" a/ W+ ]8 O5 m8 v4 {2 E
+ l7 h' L! D y4 Gclass Sequencing:
# p; R' V3 y$ H: c" Q8 e& V # N代表拷贝份数" ^5 x2 A1 [" X
def __init__(self):2 v* s+ j# p8 [$ a" L
self.fragmentList = []
% @: W g5 k0 D3 A! c' F- ?7 P self.readsID = 1) @1 x. K! T* t
self.readsList = []3 r) @' w: [4 a; b7 i+ q
self.averagefragmentlength = 650
: `% r, I1 E, q4 \ self.minfragmentlength = 500
( D; @& Z" z6 @+ b9 j3 U9 b( `8 d self.maxfragmentlength = 800
9 f& \2 O9 c% \3 Q+ ~ self.cloneRetainprobability = 1, L C" o3 V) N( {3 _
self.minreadslength = 50
0 L5 O& P( U# ^' R: i: I self.maxreadslength = 150
; G2 {* D6 S$ c; y. x* R4 A self.N = 10
( H# q$ D: u# H9 y self.genomeLength = 0
- P( W$ O6 v/ S$ \6 f# T9 R self.allreadslength = 07 z4 m3 D6 n0 d3 u1 ?4 X' J, l
u& O* p8 o6 E, a% B# x
# 生成断裂点
' c+ r, h3 f2 a8 W- r def generatebreakpoint(self, seqlen, averageLength):
! j) s! H( ?' U$ D p! @7 p* g # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)7 d( s/ a! S, P0 F
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
* B8 V2 ]) X% M breakpoint.append(seqlen)) c5 `0 l* N8 h; S! t! d! }" J
breakpoint.append(0)
4 ]1 ^% h+ D( C8 G # 把随机断裂点从小到大排序0 v; J9 `* f4 I) X9 w7 w9 _
breakpoint.sort()! z' ]) Y: l; c, s$ l) N4 j% e
return breakpoint
$ s: @. Y N) ?+ ~; `( t* d# m) ]" b9 R, k( ?" ?
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp% s8 ]5 R I, x2 p4 N( ^; Z4 j
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength): w: Q6 i% }6 P8 x
for i in range(len(breakpoint) - 1):' _; o0 M" h, j: n& E: P% Z
fragment = seq[breakpoint:breakpoint[i + 1]]+ c# g$ k5 }& d5 d
if maxfragmentlength > len(fragment) > minfragmentlength: B. ~1 A9 T1 _& W; f. O$ z
self.fragmentList.append(fragment)
- r4 B; T6 ~4 L! N# E0 z return self.fragmentList1 p" t7 |+ Z+ X h
$ k! y4 P' R, d9 l) h: X1 V- p # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率$ o- m* R" H& _4 A
def clonefragment(self, fragmentList, cloneRetainprobability):
- M' e2 [* o# D* L clonedfragmentList = []$ C; y* s( \/ F+ k. m( N4 C
Lossprobability = [random.random() for _ in range(len(fragmentList))]: G }$ n- _, |4 j
for i in range(len(fragmentList)):
& a" z+ l* P4 b if Lossprobability <= cloneRetainprobability:
1 Y1 t9 K7 l r' L" s; m2 N clonedfragmentList.append(fragmentList)* O% e+ ~% f) _: U
return clonedfragmentList0 A4 ?! g+ P; a0 c: x
% P& x/ Z, S: @% ]" T: k( o
# 模拟单端测序,并修改reads的ID号$ M o d% Y# A+ f4 X
def singleread(self, clonedfragmentList):
, {6 C7 I7 \& r4 ^$ \! _) h4 X for fragment in clonedfragmentList:6 }& h. {3 r- ~
fragment.id = ""
# ]; _& L% K; P9 E4 u0 M: u fragment.name = ""
4 m/ F! Y" y- c% e! X7 t fragment.description = fragment.description[12:].split(",")[0], G }' m0 `( K; b* u4 g
fragment.description = str(self.readsID) + "." + fragment.description7 g* B) R& x, S% _( z' x
self.readsID += 1
F% H9 @1 g$ E+ V6 u! G readslength = random.randint(self.minreadslength, self.maxreadslength)
) B5 M' x' S6 F8 h) {7 f4 |1 y9 P self.allreadslength += readslength
9 g X* h J; r5 A. h, r7 F2 F self.readsList.append(fragment[:readslength])' h& ?9 H3 M s# u: E
$ n8 s8 Q6 |2 `2 N. R/ p% ?1 t def singlereadsequencing(self, genomedata, sequencingResult):
: b1 h* x' C2 o) T- U' c( q! L for seq_record in SeqIO.parse(genomedata, "fasta"):
3 c2 [% i: t5 m$ w seqlen = len(seq_record)
4 w6 c7 p3 [) i" c* ~ self.genomeLength += seqlen
5 }# j, M8 q# w4 G w& Y for i in range(self.N):
' K+ o! }! W1 s/ o6 C # 生成断裂点% E+ {7 C7 I+ O
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)! L( {8 Y, V3 o" X# L# k
# 沿断裂点打断基因组0 A/ V0 |+ Z: ~: S
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)# [8 O0 a+ D. A& B* W
# 模拟克隆时的随机丢失情况
8 [) Y# o1 ^. h+ H6 M8 M clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)) F. ^' Q) k, P9 ~( Y; k
# 模拟单端测序* r; i" K% e" I+ C( k6 t
self.singleread(clonedfragmentList)
& |4 \6 l% C& L SeqIO.write(self.readsList, sequencingResult, "fasta"), O% @4 e. U( D
- T- r9 l, @1 L; j; A% K5 }( N def pairread(self, clonedfragmentList):8 Q3 x7 ^, j, e3 Y" g5 f
for fragment in clonedfragmentList:
8 V" C* r; Q$ w |! q* a! n fragment.id = ""
: }1 y5 x5 q2 D+ i# \4 l- R fragment.name = ""
- t" v6 G; G y [ o description = fragment.description[12:].split(",")[0]
) M7 t g5 S' o% y8 Z4 c2 u& ~ fragment.description = str(self.readsID) + "." + description
6 {+ c! W$ A$ s, I readslength = random.randint(self.minreadslength, self.maxreadslength)
: B2 r! b# N# z9 E9 m" W self.allreadslength += readslength
8 M/ s2 O* A# r3 J3 @7 \/ O8 _3 X self.readsList.append(fragment[:readslength])
/ k6 K) R' v/ f9 g
+ ]8 s5 F$ r h" h readslength = random.randint(self.minreadslength, self.maxreadslength)
. q+ P5 q2 C" C4 A self.allreadslength += readslength
+ Y- O( v8 @/ @( O- c( D+ d+ w: e; f
fragmentcomplement = fragment.reverse_complement()/ @4 w E0 B, A& E0 T8 ^
fragmentcomplement.id = ""+ p) E$ `& O: f; X1 H( { c
fragmentcomplement.name = ""( w: T* \2 m5 |. S X! s/ P4 Y
fragmentcomplement.description = str(self.readsID) + "." + description# `7 t9 I4 F" J6 _. V
self.readsList.append(fragmentcomplement[:readslength])/ y3 ] V9 E8 f$ b X
* a1 h0 ?' ?8 c z7 V" X# s
self.readsID += 1
' Y& c3 T8 y5 i0 y; C! ~! t3 ]" W) K7 k2 @4 D# o1 \$ w: N/ ?# f; [
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
' j7 ?$ \# [+ A$ C; ] for seq_record in SeqIO.parse(genomedata, "fasta"):
3 S- p$ x4 D$ u4 }3 F3 G. }& E# j seqlen = len(seq_record)# a2 g" _, s- `. \
self.genomeLength += seqlen
7 e" q% k4 D6 q6 ^/ ^3 d/ a& n for i in range(self.N):
+ j2 B3 c1 Y' Z) T # 生成断裂点. A, G" v2 ?+ V( V$ s2 o
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)0 p) p! b' `5 H8 ?2 \8 g6 }2 R
# 沿断裂点打断基因组/ \: x2 T2 @, g
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
0 G8 {: \, a, m: U4 ]! S7 n # 模拟克隆时的随机丢失情况' W6 |& G4 y! H- [9 V& Y$ s7 B7 N N: I
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
4 o g- t( S! {: o1 z # 模拟双端测序
+ ~( s+ q1 H1 B( i/ q: O8 a9 ^# p4 y: T self.pairread(clonedfragmentList)% L- h* i$ x; i% Y5 T m
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]& p: t% X: ^7 `9 w* M- m) t& W5 q
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]& t- Y2 R! U; H/ C G5 _
SeqIO.write(readsList_1, sequencingResult_1, "fasta") p6 i* x; V9 t+ ~- L' t W4 k1 S( S7 e
SeqIO.write(readsList_2, sequencingResult_2, "fasta")1 ]* \1 P; ~; t& S
" ]+ M) a; G7 l7 W def resultsummary(self):* ?( W$ M& I0 v1 ?; k2 M5 ^5 Q* [
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
+ S9 }+ E0 [ V; r print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))/ ~% p2 o3 {) f; [, e
print("N值:" + str(self.N)). g6 Q* W* m8 V4 Q7 y0 |7 ?" Q
print("期望片段长度:" + str(self.averagefragmentlength))
; y, L8 \* R) z2 P) M print("克隆保留率:" + str(self.cloneRetainprobability)), U2 l! Z2 f( [5 R3 W- H
print("片段数量:" + str(len(self.fragmentList)))
. A1 P0 J. O6 i9 | print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))& D; s. Q7 K5 W: ]
print("reads总数量:" + str(len(self.readsList)))1 E0 l; }& L1 B" x( _* F6 J7 r
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
' Q2 x& m5 [" y: {9 N- l& o m = self.allreadslength / self.genomeLength
, j9 H2 N% |' r5 a print("覆盖度(m值):" + str(round(m, 5)))9 r2 z# M4 t& Y1 Y6 p& i! S
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
8 g: ?7 q) e- A: Z: Z- [+ i print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))" w1 |3 q1 Z6 n; B) N" z9 i) m
# -------------------------------------------主程序-------------------------------------------
) D: F/ s/ J6 N$ Q4 ?1 U& u# 模拟单端测序
$ w `# J% Q, e8 q% tsequencingObj = Sequencing(): J! Z) x$ i) i
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")- \; w) l0 W2 ~: c
sequencingObj.resultsummary()' C2 |& z* j1 t+ z/ w
$ ^. y* }- b; M* {, R% i5 Z( ]# 模拟双端测序4 b/ k" W% I7 w& d$ X3 N
sequencingObj = Sequencing()
$ W) _& X7 `9 X2 f* GsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")' e6 u$ }- ?+ t" K* r6 y' P$ v
sequencingObj.resultsummary(). h3 y% T# ~5 [& `! ]% v
2 f) W( x6 L$ g4 [! d: i8 x$ F( S0 N; Z9 X- U# F& s; {1 |' l
' [% |, z3 Q0 v6 D
6 ~0 d1 Q" B7 R* ^ |
zan
|