- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563347 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174227
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
5 y, w6 `" }* b; H' `基因组测序模拟/ J3 q# u( P+ ]. m9 B! K/ o
' w2 }' C" O- s, R9 G
一、摘要) D6 [) F* T: m& k9 V8 c( M
. c5 w6 m6 d. Q' h0 ?2 Z$ T
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件) p3 y- p5 q. U d3 \9 f, V
4 a; J3 w; ?; Q5 I, |1 W二、材料和方法9 |$ E* I) l; [: }, |2 W& \
4 s W+ t0 S1 w( K8 ~, D$ d1、硬件平台
( g0 y- i' F: ?5 g1 u) @: w/ [
+ O) l6 j8 v! Q9 [: n9 l处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz ' E3 N+ Q& F- O
安装内存(RAM):16.0GB
4 [8 P) w; W* P% z
0 m+ s1 a8 ~/ p d, `; L# W$ ~2 @2、系统平台
! E2 f) s/ K# A' O7 m" NWindows 8.1,Ubuntu
F& y' b% M; G" Y/ y2 `0 k+ Q" o" W' Z' g. f# E5 _' M6 P. n& d
3、软件平台- h1 n9 e" l& ^5 |: j9 H4 U) c
3 e! T$ a3 l& i6 D+ D; `/ Z2 N
art_454
8 ^# N. g% _( u$ dGenomeABC http://crdd.osdd.net/raghava/genomeabc// Z$ ~- h5 [6 h% D8 K( M* z& U
Python3.5
$ ^2 U% j7 F5 lBiopython
* L' N( \( z: _& d( n4、数据库资源9 e( L3 ]3 l' {) r( R! Z
3 d3 i1 J' {, j0 j: V; v/ j4 ]& ~NCBI数据库:https://www.ncbi.nlm.nih.gov/ y; V( k3 @# {1 s2 k
( e& y; n* h# U! P/ M% E7 H
5、研究对象. N6 E0 I# i, V5 P6 l) X3 e
4 g2 s, s: g( r$ }
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
$ U4 k2 s- q0 M# L' xftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz0 k3 e1 ~% {/ S) U! O9 R+ j0 z+ F% Q
5 k% K$ S) V) Z, Q+ V# a5 a$ h6 Z3 @
6、方法& [( x2 m1 P/ i; E' c
E5 v" D+ z9 N9 a2 O. |art_454的使用 : k: A9 ^1 X5 p; U$ [
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。3 [% J7 h f- s( f3 j
GenomeABC
6 j6 T, ]9 Y! U2 H进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
& N, r1 [! ^3 O0 Z% a8 Y编程模拟测序
_! a% f2 e: a1 z% c下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。. A( G% J8 f# D) _+ H$ `
三、结果
0 R0 \/ _3 m& K: o0 w9 @% H! F: a* \* D% H" Z" [3 G' l' }
1、art_454的运行结果
1 f2 J4 p% V/ e$ @
1 U* F3 o. {9 Z1 g. n无参数art_454运行,阅读帮助文档 + }, } I# j, _9 q1 G+ Z
% m1 a0 g7 p1 F; D4 A& v. }' q( O图表 1无参数art_454运行 * J9 J. O4 P% Q6 Y' I/ V8 n* \
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. 0 L1 h8 b' `0 E0 Z J; y# {8 r, o
下图为模拟单端测序,程序运行过程及结果 3 [4 a1 N0 p9 X& U# ]
Y! ]( {; r5 R图表 2 art454单端测序
2 {4 {6 W0 P& S3 c$ {
3 d6 O4 n& {9 { T- i9 g图表 3 art454单端模拟结果
# q Q: l6 w1 U4 k/ E: n双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 - D) `& D- l" ~: {+ Q4 F
下图为模拟双端测序,程序运行过程及结果
' L8 N; K; H* z2 u4 s* q: {( q9 F& J3 I
图表 4 art454双端测序 ; i- ?# G$ L" P) z( U8 H
1 ^# a2 o s6 S9 x/ N6 k
图表 5 art454双端模拟结果 ! Z* S7 M; R) u5 p
2、GenomeABC
+ M& r, z- O+ h5 }2 d) n0 ?下图为设置参数页面
0 X9 A: M: Z* L/ ^ B. S4 D$ F/ Y+ { c' W9 y% |, N
下图为结果下载页面
' k% Z+ ]. L) D0 Z, k* ]5 N; ]& D5 z+ ~4 V% Q
图表 6 结果下载页面 & @$ `: e9 B& y1 |$ _0 `
3、编程模拟测序结果 1 m" t4 p* l! A* D, c9 s
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
; q: X S8 `2 n( I8 W- I- B) ]单端测序 & b" s6 X$ A c( _
% y$ z. w5 D6 S/ ~9 j
图表 7 程序模拟单端测序
/ d$ m: n: q( j3 P4 l' K7 J- e. |0 S双端测序
* ?* ?7 h( @2 n# O* |% h
' {8 o$ Q9 D0 q6 b5 ]图表 8 程序模拟双端测序 0 V3 w K: s0 _0 c$ f3 k# v% K
测序结果 ) s" q2 P# w# d, Y0 I4 @
& n! J) m/ F& I& r* ]7 Y
图表 9 结果文件
0 U/ d- _0 I3 u- J) \% w9 G# s9 C+ y2 A
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
# v7 H& o0 D% O2 Q6 F( D测序结果统计表
$ U6 v. M6 |: \ C1 \; e# v f: k( j3 p, Q* U
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)4 m+ d* c* o; W9 Y1 S3 q8 s
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
' S! ~6 O" k6 v$ c6 X9 t) o) N单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
' d+ S$ g1 G( ~ W) G( `1 ]6 D" C双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
, L* r; S' D2 S7 Z+ L双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886 q, T9 n( H0 T' j
四、讨论和结论
" ]. ]. ]# g) n! f' S, s
8 S/ v" M9 b- }8 e, k1 G程序运行方法( R6 i# T Y% J* l* K
! p- x* z6 |2 W5 ~在类的构造方法init()中,调整参数。
( D& A, a& u: VAveragefragmentlength为片段平均的长度; . |; n. {: C" y( K. ^# h) [4 ?) T! s a
minfragmentlength和maxfragmentlength是保留片段的范围; / Q3 ~9 r- Z! e: B8 j9 j
cloneRetainprobability是克隆的保留率;
: T' X" [8 n0 ~8 Y/ Z8 fminreadslength和maxreadslength是测序reads的长度范围
4 r/ Y6 h- a. L$ o
5 h' v3 P- f3 s& J' Q) E n. c模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。9 s' p; q7 G8 `! R. D# a* k! K
8 g5 C2 V" ~( C
附录
& W9 G; Y& I K" C9 L
1 E2 E( ]1 P2 T8 Qfrom Bio import SeqIO
7 J9 l4 y. M/ b; ~from math import exp
/ L2 ]3 c( m$ zimport random& w" {- q# J* M/ g
+ q1 x& t6 ]+ Q5 a% a U3 J# Jclass Sequencing:
5 R$ z. F4 U! Y( p ]* a # N代表拷贝份数2 U0 ~$ x4 j! p4 R n9 e8 x
def __init__(self); f! \ D' `1 O+ T' Z/ {7 X9 x
self.fragmentList = []& o$ b3 R$ Q( ~' G( Z1 {
self.readsID = 1
: z0 `; B5 J! j, F self.readsList = []
$ G _* C7 W. R& \/ Q$ `$ M$ R, c self.averagefragmentlength = 650
. _7 Q5 }9 a+ V; n+ Z# h! _+ X+ B self.minfragmentlength = 500
4 ?. x: R% a" ?8 K( R+ ~7 K self.maxfragmentlength = 800
4 h7 C6 R$ g6 n V3 Z& B self.cloneRetainprobability = 1* A5 s4 L2 m! N) r* e5 F" D$ T
self.minreadslength = 50+ ]9 G6 M) Y- Q, u
self.maxreadslength = 150
( l0 p9 p5 h" G! O+ z self.N = 10
1 U( B+ W3 u( m3 l* [ self.genomeLength = 0 m r+ `2 c) [) [* r2 v4 \7 o' Q
self.allreadslength = 03 `* l5 Z. R2 A) Z) B+ a B! k
- C& G& t) S7 }- C5 v8 ~$ U # 生成断裂点3 d7 C5 y0 ]! v' C( }8 n
def generatebreakpoint(self, seqlen, averageLength):
4 x7 S9 N, B ]0 [& L5 X, ], x+ z # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
* l. @6 q5 f+ [7 W& N' e breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
9 @" ^% D: i+ @/ R$ p8 i' _4 q breakpoint.append(seqlen)8 G. @9 ^; T, x6 y$ G
breakpoint.append(0)
# C4 ?0 q4 ~4 \4 h0 O- n o9 B # 把随机断裂点从小到大排序
0 N" E2 j* ]5 Z9 X! f. p" { breakpoint.sort()
2 l9 s3 }# t$ q1 n K" ~ return breakpoint
7 I5 m( b4 {6 f9 L8 h! `% a" X: Q
; {. }9 s; q, w4 W9 m3 n # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
7 J) K; T" _! A8 C6 i def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):0 e: x' Z2 ?+ \- V+ ~
for i in range(len(breakpoint) - 1):0 ?8 o8 X A* c1 N/ u6 t
fragment = seq[breakpoint:breakpoint[i + 1]]
& T0 T! \# S/ q! y& K% o3 J if maxfragmentlength > len(fragment) > minfragmentlength:
) t8 J$ ]0 B# l4 Y0 O% C+ ` self.fragmentList.append(fragment)
& {; V: F* p; j `' d2 h3 z0 G* J3 { return self.fragmentList
- b8 S [ J1 q# p
9 n6 F; R2 v4 x6 n9 D # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率' H2 \* j+ K0 Z6 f$ e
def clonefragment(self, fragmentList, cloneRetainprobability):
O/ M8 b0 a, K6 o/ [( h. ?& H! ^ clonedfragmentList = []' p: M1 `1 l$ I6 [
Lossprobability = [random.random() for _ in range(len(fragmentList))]+ W7 E$ w- H( W% d
for i in range(len(fragmentList)):; u- g; R \; V. P4 _4 p
if Lossprobability <= cloneRetainprobability:' L. |! _- ~" Y) G# C0 \
clonedfragmentList.append(fragmentList)
, o( n* K: ?4 D( q return clonedfragmentList/ |8 A2 R; [+ p! W5 k1 q
/ h) w% J3 ~1 I, a! Y' d # 模拟单端测序,并修改reads的ID号
3 k9 |" _8 K% |" @ Z' k* t def singleread(self, clonedfragmentList):$ U2 `; S8 a) A7 E
for fragment in clonedfragmentList:. A$ G; Q5 V, \1 m+ y$ c
fragment.id = ""
; z. [& ?: E& x+ p, e. [ fragment.name = "" s7 N) X0 C" h, A+ L3 G& N- b
fragment.description = fragment.description[12:].split(",")[0]
) ^0 p: {2 u3 G7 _. V7 S+ C fragment.description = str(self.readsID) + "." + fragment.description
1 F* M. I3 M+ ~, S) q self.readsID += 1/ P+ ^. z: g3 o& }$ l7 j. R
readslength = random.randint(self.minreadslength, self.maxreadslength)
8 \5 e4 s. o y/ ?. I2 W2 e# @ self.allreadslength += readslength& P2 P* U' b- Q8 r* p- }
self.readsList.append(fragment[:readslength])# i0 X2 F0 r; {6 m; [
% @+ \( F- o- J% P# d; D def singlereadsequencing(self, genomedata, sequencingResult):* A! l7 I" q5 a8 M+ k
for seq_record in SeqIO.parse(genomedata, "fasta"):( Z) B9 W# |" U2 g
seqlen = len(seq_record)
* P& }4 f- o; Y self.genomeLength += seqlen
; g9 p* K; S. j+ r for i in range(self.N):, [$ b" S& J; \3 G3 ?
# 生成断裂点
& q7 b/ p% z0 {1 u8 Y) A# H0 S9 e breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength): e# x* S2 ^1 [
# 沿断裂点打断基因组9 m D/ t5 X) }( y/ O0 f2 M& S
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
3 I$ _! {0 L* W+ H- I, ` # 模拟克隆时的随机丢失情况" l5 F- b& m" t" U8 i) A# O
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
1 U4 E' g: j7 D i; k$ Q% G: a$ g5 O # 模拟单端测序
, c0 ]9 U+ H$ e self.singleread(clonedfragmentList); r( E2 o0 P: v8 s3 `# k
SeqIO.write(self.readsList, sequencingResult, "fasta")" f% k# y1 T5 x7 z- {$ D2 r/ x1 V0 E
9 M5 T+ ?( w0 ^' x3 A: l h& C
def pairread(self, clonedfragmentList):, F* B5 c0 k" R- R6 W& V9 C5 w
for fragment in clonedfragmentList:
. w) H9 q' [9 S fragment.id = ""6 U9 J/ X9 @8 r
fragment.name = ""
% z2 G$ I/ k6 f) ?% t# x description = fragment.description[12:].split(",")[0]8 O+ Q: H3 o/ t6 j/ G) s
fragment.description = str(self.readsID) + "." + description7 c, }/ Z, A7 j+ `# x( P
readslength = random.randint(self.minreadslength, self.maxreadslength)
' L/ p# E+ g+ X& x5 _0 t* U% c self.allreadslength += readslength4 l& x, q2 M7 z+ B* H
self.readsList.append(fragment[:readslength])/ \8 Z. m2 e1 `/ `
; I# F M9 W" e( C8 ]
readslength = random.randint(self.minreadslength, self.maxreadslength)$ O s7 q$ E& ? ]
self.allreadslength += readslength( M7 _, v% ^: b* W: L1 }
& {( ~3 @: p" C* d
fragmentcomplement = fragment.reverse_complement()7 n" T( n: z! S* H
fragmentcomplement.id = ""
. p0 E- J: }, l. w0 F9 W fragmentcomplement.name = ""
, [' `& R- E$ a, D6 H fragmentcomplement.description = str(self.readsID) + "." + description$ W( V, j5 ~! h5 a6 F0 |
self.readsList.append(fragmentcomplement[:readslength])
! B: W2 E2 p/ b0 [* t! p) a. Z; a0 z: f5 w
self.readsID += 1
- h, m; [7 w$ |6 a, h; ?1 O G
2 N' K' R$ c8 a, R def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
?6 W. h& H. M0 z1 V7 H for seq_record in SeqIO.parse(genomedata, "fasta"):
" W1 T$ }+ p% r seqlen = len(seq_record)
+ ^+ p h% e: y self.genomeLength += seqlen
; ]3 w1 [- s0 o for i in range(self.N):
; {0 H* N* `" H) U3 v: V% \ # 生成断裂点! A4 u5 l6 M- k/ ]$ T) {
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)4 Y+ E5 a) I' p' B* t/ M8 L
# 沿断裂点打断基因组
5 }% W' z5 v! W1 o! R self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)( O* h1 X2 ?; n
# 模拟克隆时的随机丢失情况7 O+ I8 ]4 ~$ h* i
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability): X: j& U7 [% z/ ^# c$ z) z
# 模拟双端测序% R5 z, s2 F4 q) U% E, [% J& Q) P2 V" H
self.pairread(clonedfragmentList)' v8 r; {+ m: z) l# d
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
- [- H1 A% T' F% J% b$ ^$ X. \2 O readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
' s* l. P+ f& G SeqIO.write(readsList_1, sequencingResult_1, "fasta"). {( X* V' Y3 s3 C
SeqIO.write(readsList_2, sequencingResult_2, "fasta")3 W- {2 E7 h8 Y5 o \. _- ~, ~- u
. v1 P+ H8 I! b1 ]
def resultsummary(self):3 l: X. F/ e. E6 M- G+ r* o
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")( [8 z/ ]2 ?8 ~; Q# @% h" e/ T
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))/ G) u2 y8 n3 G/ o
print("N值:" + str(self.N))
1 G. l$ }. H' g |% N print("期望片段长度:" + str(self.averagefragmentlength))
2 M2 h4 A+ E0 {0 i print("克隆保留率:" + str(self.cloneRetainprobability))
4 q' v6 ]. f7 a) \ T0 t5 B( L print("片段数量:" + str(len(self.fragmentList)))
, `& q1 f5 [+ C0 M$ ~: t- r. G* ]- G print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))) ^3 a5 @; f3 r! X
print("reads总数量:" + str(len(self.readsList)))' H( f; a5 z, {: ^* p% P
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
! u4 H3 x2 I2 m- F1 ^4 A m = self.allreadslength / self.genomeLength/ o! d4 ^. z5 K$ t- x1 c7 {
print("覆盖度(m值):" + str(round(m, 5)))+ P" { `+ X- k" \7 I& M! R
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
8 Z2 E% N: [4 l0 R' b# W print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))% I7 b4 U2 Z9 _ X, j) X! k
# -------------------------------------------主程序-------------------------------------------
1 ^, Q3 R+ G/ ^; Z/ ?# 模拟单端测序
. E7 Z7 O. g! B9 y0 ^sequencingObj = Sequencing()
+ z8 e, d- v2 W* X2 f$ k2 c: isequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")+ N E8 L# |# N Y+ b9 M: {
sequencingObj.resultsummary()
. k% h& D% J6 Z( F7 X: o% J( V: P/ j+ V7 N! ]6 m
# 模拟双端测序
$ @8 h2 z2 Q# R S0 _3 X. b0 ^2 ]4 c, U( ^sequencingObj = Sequencing()( O5 ~1 V7 U! j8 @
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")5 y+ V& K% ^2 S8 B+ O
sequencingObj.resultsummary()
0 p- b' l/ c: A% V J0 ~8 @! Kfrom Bio import SeqIO
8 R" i2 }) ]2 ~3 k" j, j4 l* J Ofrom math import exp4 ?# ]6 o+ N/ g- W8 ?
import random
& m/ Y% D, T- I2 ?" a. v8 `. H/ N# X, i
class Sequencing:8 A! h0 B& d; @' O8 r6 \4 {
# N代表拷贝份数
. X8 `7 [; o5 u/ |: A def __init__(self):
) l" f+ q: t, @3 p self.fragmentList = []
" [. X# Y7 n; u/ p self.readsID = 1
1 G7 b+ m# B( y4 o self.readsList = []
6 ?4 n; ~3 M) k" ^9 Y- O! { self.averagefragmentlength = 650* s+ L7 q; z- X5 Q5 o! w% C- a
self.minfragmentlength = 500' |9 r/ Y; m2 {5 K V
self.maxfragmentlength = 8006 b8 X$ Q1 D0 l" q
self.cloneRetainprobability = 1, Z3 Y, Z# J( V: M4 S
self.minreadslength = 50
$ O# h) c- k/ ?/ D self.maxreadslength = 150
% _* Z- t$ w( }' @ self.N = 10! F9 h2 a. P- r# u8 K5 C
self.genomeLength = 0) Y% t _0 e. } D8 ^
self.allreadslength = 0" v! A4 c- e8 I" a4 n0 Q4 w
) O) Q; K1 n8 p # 生成断裂点
/ Z' U( X- e# s# O E. c% g def generatebreakpoint(self, seqlen, averageLength):
) c; s. K. f/ O3 v/ o+ O # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
* h) R, d7 `, [( @8 i' b( O breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
6 f+ K5 q0 F3 D breakpoint.append(seqlen)( Z+ P" g+ T' g# Q/ L
breakpoint.append(0)
- K6 v& ?, p/ ^8 W # 把随机断裂点从小到大排序
7 C: g/ p) z' m4 W breakpoint.sort()
% k% ] O8 ]3 S# k3 J4 t return breakpoint
5 `* k h1 C, S7 Z
/ r& ^3 }/ d' n/ Y! ` # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
6 D- l* S- B' d+ V def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):, R+ W- Q/ z0 c# D$ b) ^0 M5 j
for i in range(len(breakpoint) - 1):/ I o& T0 O# W# G
fragment = seq[breakpoint:breakpoint[i + 1]]% d S% k2 A+ O/ j
if maxfragmentlength > len(fragment) > minfragmentlength:
' C. ~8 a5 c0 t" X5 K9 X1 J self.fragmentList.append(fragment)! b6 ~4 i0 n* t' G
return self.fragmentList- l, t9 B2 K! O; y7 x* |( ~
! f3 B& U$ d/ X # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
* x: r. m! y& ?3 _- u# V def clonefragment(self, fragmentList, cloneRetainprobability):
5 |" S! q K2 ~% j, ` clonedfragmentList = []
: K( ^ ~$ b0 l5 N& k; n Lossprobability = [random.random() for _ in range(len(fragmentList))]6 r( V E' X. ^) B8 X
for i in range(len(fragmentList)):$ n/ G/ h P1 Y8 {6 B; s) x O
if Lossprobability <= cloneRetainprobability:
1 }& E1 H/ p. x! K: e3 f: C0 _ clonedfragmentList.append(fragmentList)' U' z) r2 u$ O! Q9 l% F
return clonedfragmentList4 X& c: b6 G8 K+ p2 V
/ Q; {+ N" Z0 P* I% U: t! J. K; ] # 模拟单端测序,并修改reads的ID号' N/ `# s, S9 b% _6 i) m( v
def singleread(self, clonedfragmentList): U1 W& _. q/ T. U8 [
for fragment in clonedfragmentList:( i1 u8 P& w+ N; a: P
fragment.id = ""# H3 |5 Z& u: s) k
fragment.name = ""
- z5 W; s. X7 S" M! T1 ^- I) F. w fragment.description = fragment.description[12:].split(",")[0]0 V' x$ A, N- M/ n
fragment.description = str(self.readsID) + "." + fragment.description
3 d$ b* x9 A4 U1 ` self.readsID += 19 s# g Q* F1 s3 _ f
readslength = random.randint(self.minreadslength, self.maxreadslength)
# D& v9 C2 ~. z6 O- k5 K( Y: I self.allreadslength += readslength: |& S9 _' h7 H) S' `0 x
self.readsList.append(fragment[:readslength])
2 F+ p" L! Z+ \* v) {4 R' m" m
def singlereadsequencing(self, genomedata, sequencingResult):6 m+ {) g; o) w+ c8 Z( q7 O$ e
for seq_record in SeqIO.parse(genomedata, "fasta"):2 u8 u. M2 h# |5 J1 y7 W# ^
seqlen = len(seq_record)
R& M4 B/ g' ` self.genomeLength += seqlen
! S2 W, k2 p. u for i in range(self.N):# g% Y: A- A; T/ x
# 生成断裂点
- z8 G7 e- ^. i4 n% I! K breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
. V5 G5 [$ z; d; @4 ?# P, H # 沿断裂点打断基因组# j* U+ ~5 Y& t( m
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)% ~/ t6 S9 X% l5 N' M0 u8 n' @1 O
# 模拟克隆时的随机丢失情况
; s7 Z( c, x: M9 E P9 a0 R2 v8 A clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
' r3 ^0 j" _+ X- N # 模拟单端测序0 h0 Z" ~" j9 F: r6 |
self.singleread(clonedfragmentList)
- U3 r, u, z3 {0 O5 c# y SeqIO.write(self.readsList, sequencingResult, "fasta")4 @# h' K$ {, B; O. J B$ p4 j
5 v0 o5 x1 M0 ?+ T! v g1 I
def pairread(self, clonedfragmentList):
- [( G. z3 ~* Q for fragment in clonedfragmentList:
, m1 [0 ]8 a0 N+ H: l fragment.id = ""
% d( k" ~: z: [. T fragment.name = ""
; C4 o b9 z2 Y- B" [2 e* i0 V; F/ n description = fragment.description[12:].split(",")[0]/ V' }; N6 C$ N$ M- i
fragment.description = str(self.readsID) + "." + description) K7 @6 M0 w. G
readslength = random.randint(self.minreadslength, self.maxreadslength)% l: x4 z$ W2 F |% V5 _& d o
self.allreadslength += readslength0 @# J( R$ [9 b& f( j
self.readsList.append(fragment[:readslength])
9 R, U" W, B( ?
/ k' b+ S) r! X0 Z readslength = random.randint(self.minreadslength, self.maxreadslength), b& t( n/ G; N3 N! {' m
self.allreadslength += readslength* Y* h. k6 D1 g% {
9 j* F) x) f3 Z) q fragmentcomplement = fragment.reverse_complement()3 e) w. }& p1 n d [) K/ r
fragmentcomplement.id = ""8 M, J6 ?. v& r$ E% i+ W% `1 m
fragmentcomplement.name = ""
) r. B& ?) ~7 G9 R- J5 T; `: l fragmentcomplement.description = str(self.readsID) + "." + description
" Z2 f) C# |* u2 r: r* y/ v! B% J self.readsList.append(fragmentcomplement[:readslength])
% K0 ^/ K6 G; f7 A+ E
" B: ^. ]3 e, _ self.readsID += 1
0 g- q- Q7 E2 ]
' d# d3 X- W( E$ G def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):& o! B8 x& V9 i8 U+ s P# B
for seq_record in SeqIO.parse(genomedata, "fasta"):
# \, `3 s( O$ H' K seqlen = len(seq_record)
9 \: T" [% F6 u& ^3 S5 H8 m/ [. q self.genomeLength += seqlen M0 K5 l. u. _- q
for i in range(self.N):1 {/ [3 H( [3 R
# 生成断裂点8 n; y; v! R- @( ^# k) s! \! h
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength) Q3 S9 T% [0 V2 T; R9 a( U# L
# 沿断裂点打断基因组+ F2 F; ^' [9 |6 v1 ?
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
. e4 d0 B8 Q9 V9 ~ # 模拟克隆时的随机丢失情况
9 C; J7 d) E# g" J, _" ?) p clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)7 F7 ~( ~. d2 r p, T
# 模拟双端测序' x9 T5 Y1 |. ]1 i& C E
self.pairread(clonedfragmentList)
$ }( F5 _' ~: ^4 R7 V* V- _ readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
; K( w5 Y: G. `3 \ readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
4 ^3 D; S7 n5 I1 }$ n SeqIO.write(readsList_1, sequencingResult_1, "fasta")
5 a W, Y- D% ~2 {4 E- N( F SeqIO.write(readsList_2, sequencingResult_2, "fasta")
' ~2 C+ G( g- o( K& p3 w
+ R' Y' Q8 b% B( [( ]1 f5 A3 V def resultsummary(self):( ?- S2 v$ ~9 W7 Q5 m
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
, f. z8 p0 \$ h; z print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
, T+ a& m4 Q# m5 x0 _0 _ print("N值:" + str(self.N))6 |/ Y) V* ^9 G1 ?3 }; E! s
print("期望片段长度:" + str(self.averagefragmentlength))' u5 Q& X+ M2 @% O* W% B$ v/ N, T
print("克隆保留率:" + str(self.cloneRetainprobability))9 y; y0 \ K% S3 a. a9 ~1 y
print("片段数量:" + str(len(self.fragmentList)))
1 R2 i+ k, N+ ?1 s W; [0 m print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
9 {) K6 ]" H* ~& S print("reads总数量:" + str(len(self.readsList)))
" ]# ^& B. {0 d/ o! d) S7 F print("reads总长度:" + str(self.allreadslength / 1000) + "kb"). s( O: ]) }2 V! F3 F7 v: F
m = self.allreadslength / self.genomeLength
5 Z+ e. t) V9 {8 E print("覆盖度(m值):" + str(round(m, 5)))
3 T5 _) M. U4 x: J) F3 T print("理论丢失率(e^-m):" + str(round(exp(-m), 5))): S& K: P; y8 m" d. m) X$ V$ `
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
* A9 u0 Q& ~8 n% R# E/ Z# -------------------------------------------主程序-------------------------------------------8 U5 A! O1 c2 s J3 c; W; \+ b
# 模拟单端测序8 ^ }) a3 z# W# C# y1 G
sequencingObj = Sequencing()
8 f3 X0 n# p- j: U! jsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
. o6 ?' \& ?9 a$ C' H0 H, N/ B* |sequencingObj.resultsummary()
: v4 |+ @- T/ K8 C1 s5 f% c% X8 _% _
' A* P2 z5 X( b# s* g# 模拟双端测序 M& i/ A" e( Z5 Y# g9 a
sequencingObj = Sequencing()1 z1 P2 m K3 G0 Q
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")/ u- s& |+ n0 b. \8 T+ J5 u
sequencingObj.resultsummary()% a( j" d+ ^" m' V! S- ?5 g; b: ?4 k
& q) f! a+ p, X) I# M1 L
. d8 d* J; A H D! H* G8 K+ \
$ F0 Z+ e( q* p' E7 V) d
7 P8 @$ @8 i' Q |
zan
|