- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563350 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174228
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟; }& Y6 V) ~6 l$ H/ _3 q
基因组测序模拟* U0 B6 c8 H2 K1 i
, }5 H% j, E2 j* T* r
一、摘要% I; y- x( w# Y; L6 G$ ?
: d9 i. p: z) K
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
7 W! d' N/ T0 f$ @# {3 P: G1 H
. y: Z& N8 x) X, U, a! h二、材料和方法
( Z2 o W1 N. r5 I( G k1 ~+ J6 L
1、硬件平台
# [( U7 u H: i( @3 A3 p
: L% V: {% }7 o3 z6 N8 \: \处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz , H) ]" I( h* H% ^
安装内存(RAM):16.0GB% e$ r) i* l! F8 S( G
0 ~& w& o5 ]9 ^3 y1 g2、系统平台
: M* t/ k0 Q# E6 g) A0 P2 R# u7 N2 rWindows 8.1,Ubuntu
{0 G2 |8 p2 c1 x ~$ ^8 J! k- R5 {0 `" ]% j/ G$ D6 ?7 a
3、软件平台7 w' v! G' p; i6 @3 k2 ^
! ^# U% I# K# m
art_4542 g: V0 ^1 t/ c; S9 C7 M
GenomeABC http://crdd.osdd.net/raghava/genomeabc/8 p2 u% W8 Q2 w9 U" u
Python3.5& \' e7 E, |% l, g9 q1 R0 J6 E& `
Biopython& i4 K; x% Y2 |/ K3 f* k3 X
4、数据库资源( v4 {) ]% e* ]
- K: B: S; w4 C1 c% E; z' F p! P
NCBI数据库:https://www.ncbi.nlm.nih.gov/$ _6 n4 f- x y8 K' S
5 O6 N9 L7 Z2 Y. I4 c- v5、研究对象
) R2 d" r0 P6 W, g
}4 T/ f9 j, Z0 ^5 o2 c% d. y酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
4 Q$ f+ m4 U" c6 k0 j2 V0 `ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz- O. b: G0 ^0 @& @3 f [
- }6 n- x1 D2 C R/ Z# l
6、方法. h5 ~$ W5 P' \- f9 ^
8 H# s$ \& s4 t& [- K% |art_454的使用 $ n0 N- j, k! o
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。" v& p- M( m# U( |7 m
GenomeABC
( k5 H$ e) `0 ~; e0 \4 G进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。% [0 c% Z+ p- J* u
编程模拟测序
! v8 ~& u& g+ J下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
! I/ E1 O% }) D三、结果) g, q" E( e: r2 X# W
+ |# l' Z/ ~& Z! B5 ^
1、art_454的运行结果- V, r+ R+ B f2 X3 x7 s
# t9 i+ z% F, |6 |. z& `
无参数art_454运行,阅读帮助文档
. j4 q: E5 z) l5 S$ z$ x# _: N# }# a6 W ]' A# [. Y
图表 1无参数art_454运行 4 p6 ~" [( ^! t7 w/ p* N. D0 s( E
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
$ e. g2 M4 w/ b: ?1 Y. L+ N) n/ U( N0 F下图为模拟单端测序,程序运行过程及结果 + ~' I5 b, K; U& b ~$ S9 H: l
. [' ]" p* ~( H z
图表 2 art454单端测序 , |( z6 Q c" k. }
+ d" C. H+ R; B$ K/ v5 m# o图表 3 art454单端模拟结果
q# x/ G9 j" c% ^- E2 O: x双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
1 [: t# Z. E& u% z+ v下图为模拟双端测序,程序运行过程及结果
( A" z! o0 V1 F1 e2 V* z. P' S$ E( [ u9 i
图表 4 art454双端测序 ' \2 D* W# @8 Y% }0 X" G
5 H, p9 A! K$ {1 [9 s; v- `图表 5 art454双端模拟结果 , _& D7 r% y( z; [) c% O0 X
2、GenomeABC
& g9 N" G$ R; M& K9 E' [" p下图为设置参数页面 & K: ^2 ~. [4 R. `. N
' @8 g6 A% u4 k% I0 F8 e- G$ z# }下图为结果下载页面 # f& h0 j. K n2 Q, J* h- x
0 z, _5 ]" @, T5 B. c
图表 6 结果下载页面 . y- c" _8 K# x) C8 S% o
3、编程模拟测序结果
9 j' u# ?/ F3 Z拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 . M4 n6 _! n; `$ b1 l% [/ @
单端测序 + Y* [0 k" }# _8 R) ]" h% z% G- }5 n
9 e* E5 R) X5 Y4 w; A4 Q
图表 7 程序模拟单端测序 ; m0 |0 C. \/ f7 P
双端测序 $ K6 x- ~3 _2 A; n; t) ]) M
# D* V& I8 |( V6 h/ x. x
图表 8 程序模拟双端测序 8 n) o6 g# |) ^% ~# d4 A6 q5 U
测序结果 0 s2 m% N5 y. r1 Z" x/ O; J
7 o5 ^) O1 |0 R6 |9 L7 L
图表 9 结果文件
8 |9 \. q1 ^5 J9 s/ e, [7 p# o9 }1 ?5 ^: z2 b$ L% {6 f
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 . Q V, H. L6 r; @* @: O* A
测序结果统计表' H: y6 c. p+ z$ t, Q; a% _6 S) p
% |* F- g" L4 j测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)/ @3 D" |+ N7 R: K) T! F
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.466822 y8 o8 r7 f+ t _! |% E4 X$ A
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424, ~3 ~2 Z7 s* @4 Y- o
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.713887 t0 R/ R& j; i/ }# M5 B
双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
6 @* i* k% u. }四、讨论和结论
9 p* L; n+ U6 d3 j3 V$ \/ p( w* z2 ]3 z+ n+ U2 O( ]
程序运行方法
, W. y% x- N" [8 [6 _
- ?1 s8 l2 P* `- s0 |" [% N+ D在类的构造方法init()中,调整参数。
" x$ Q5 p6 `: J/ pAveragefragmentlength为片段平均的长度;
: K( ^9 t. `, ~0 g7 Z q5 Vminfragmentlength和maxfragmentlength是保留片段的范围;
9 w. {7 |; Q; d$ {cloneRetainprobability是克隆的保留率; 0 `- P' f- X' I0 c0 \
minreadslength和maxreadslength是测序reads的长度范围
5 J: u% ` N x6 O f1 t {' R/ ?+ r6 |/ U8 `) F9 b0 C2 l
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。1 u8 U+ K1 S0 j/ G( O7 {
! B; d! `1 f3 l- x. x
附录$ _& r( U% s$ q
; h5 _7 M T! v1 ~: y' }. g
from Bio import SeqIO
, ?0 W% V3 O Y, Ffrom math import exp0 `' c6 Q% G3 @* h0 v$ y, M
import random
, x5 M1 _8 _& D' h1 S- j0 a3 z8 X, v& [9 \% Y" D) [8 f8 n
class Sequencing:
6 n2 ?3 i( N5 [) G* C4 ~ # N代表拷贝份数& m& x7 K o, T, ^! a
def __init__(self)7 x) L3 a$ N" N Q; D
self.fragmentList = []
: ? V9 o+ n4 q, x& y self.readsID = 18 x5 S; _% L- k5 p2 r
self.readsList = []. W. f: e5 _4 q6 p/ L. R; }
self.averagefragmentlength = 650( Q# F1 @% {$ D% C8 w; L$ `5 B
self.minfragmentlength = 500: q& R% h% e. y- s6 v' n) p
self.maxfragmentlength = 8009 @3 E$ Q$ j! [' z i' U
self.cloneRetainprobability = 1
6 x3 U& w- M8 `5 @ self.minreadslength = 50! t5 q: Q9 j/ v& y$ X
self.maxreadslength = 1501 U8 K7 G8 L4 P
self.N = 100 p+ D8 @( r. `
self.genomeLength = 06 r' C, J3 n! L
self.allreadslength = 0
1 V- U. r+ Y* v; {0 d
. l& m# }9 O$ C% k+ D' ^ # 生成断裂点4 E. j' _' F. g$ w" q5 q, ]
def generatebreakpoint(self, seqlen, averageLength):
7 X2 K5 G6 ?/ A B! N% G$ x # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
- T* B- c( a! u2 X6 k, G$ I breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
. o& R# h; C0 C# A9 Q0 p' v9 c# \ breakpoint.append(seqlen)- e% q; u" d& C: a
breakpoint.append(0)) d' C9 a* c1 L4 p! L
# 把随机断裂点从小到大排序% W5 h" C0 W" v. C3 z. }3 Y
breakpoint.sort()
# U1 U' K9 Q% M: r9 Y return breakpoint
: t9 k; \! U9 H W) s4 J- z1 }- v' Y' y- t) p- L# L& I+ @
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
8 B$ C4 H1 t3 X: | def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):5 m" B4 g; ]9 z, v: p( m
for i in range(len(breakpoint) - 1):" b+ ]0 _# s. N' s- y- O
fragment = seq[breakpoint:breakpoint[i + 1]]; ?9 K: f8 c& O8 {5 t& q+ u, T
if maxfragmentlength > len(fragment) > minfragmentlength:4 Z7 P6 a5 I1 j& Z- `& G
self.fragmentList.append(fragment)7 `" [3 R4 t! q# {' Z- m) @; p1 g
return self.fragmentList
2 w3 k. j$ b* _, m* A
, W- L7 L* g! R9 H! x. C$ w+ ~ # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率: `3 N# b1 r. O, |% k
def clonefragment(self, fragmentList, cloneRetainprobability):
, K+ k7 L: c7 S( s! e8 [4 ` clonedfragmentList = []5 T: Y4 w/ i1 [2 v/ C' k5 z. ?! s7 e
Lossprobability = [random.random() for _ in range(len(fragmentList))], N) P# b4 D0 E/ f4 L. z" ?4 j) o, [
for i in range(len(fragmentList)):
5 G' j* r7 h$ r3 p if Lossprobability <= cloneRetainprobability:, ]- v! h- O0 ?7 Q, n$ S" v; c1 e
clonedfragmentList.append(fragmentList)' B# r- y1 j" h' i( \2 H7 I
return clonedfragmentList
$ F7 W9 M8 k# V4 W3 U
, X$ y- @' b( N4 Z3 U1 p, n* N: i$ O # 模拟单端测序,并修改reads的ID号
8 P* Y6 c$ d) } def singleread(self, clonedfragmentList):
0 p, A1 N9 A7 M2 }7 ` U) q for fragment in clonedfragmentList:
! x( ^9 K, i" Z7 j9 t fragment.id = ""
: ~9 p, A# c/ N3 w fragment.name = ""
$ m8 b9 p R5 b3 b fragment.description = fragment.description[12:].split(",")[0]
7 k Y; ~) V4 a fragment.description = str(self.readsID) + "." + fragment.description# R+ q8 n) m8 O4 J& }
self.readsID += 1
2 ^1 B) ^. g7 o, [ readslength = random.randint(self.minreadslength, self.maxreadslength)
x8 V& j% T: [( \: B) @ self.allreadslength += readslength
7 w: ]4 h, b8 d2 n0 w& @ self.readsList.append(fragment[:readslength])
" o& T4 @ ~; T5 m
9 [" n' j# p) G. N( S- F def singlereadsequencing(self, genomedata, sequencingResult):
@) B" Y' _3 \( q. c4 v+ o6 [; h for seq_record in SeqIO.parse(genomedata, "fasta"):$ N* J6 M! w9 ~. G" v4 C
seqlen = len(seq_record)
2 Q. M! m: N/ c. T6 Z* p _6 j self.genomeLength += seqlen
- k" H2 k8 b8 b9 K. q3 Y3 u$ p/ H. ` for i in range(self.N):0 C' @8 v. g4 k/ C
# 生成断裂点
7 L, K9 B7 L/ a breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)2 X9 |. e+ @- h# Z+ J
# 沿断裂点打断基因组
5 ~! A. s9 p0 U6 E! l self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
+ A; I* y* E( X: A4 d& U # 模拟克隆时的随机丢失情况
- n, j" Y! M5 {8 N1 Z5 B$ l) r( D clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)6 _8 V8 _0 y8 j' e3 z# a7 d' `/ L
# 模拟单端测序
6 j) l9 S( v/ j& k3 W) J self.singleread(clonedfragmentList)) B1 o: x, Z& ?% Q& O
SeqIO.write(self.readsList, sequencingResult, "fasta")
0 T% L. u D( @5 r; B& \2 W! z ~# i' A, I7 U+ `9 R m
def pairread(self, clonedfragmentList):
+ J; e8 Q3 ^' L# d1 I$ M for fragment in clonedfragmentList:
0 e. v5 ]4 a8 n# b: a2 b- g fragment.id = ""
# O! S/ S* F7 n# R$ f fragment.name = ""
) K: T# a8 K- |+ v* ~& f' v/ s description = fragment.description[12:].split(",")[0], K/ {4 y$ |) ~" Q
fragment.description = str(self.readsID) + "." + description
0 j" P) S! H) M. t1 s$ N9 f. z readslength = random.randint(self.minreadslength, self.maxreadslength)0 W$ o$ {; P g$ Y( A( T
self.allreadslength += readslength5 G7 }( X: X3 Y- L6 X4 @2 u1 u
self.readsList.append(fragment[:readslength])* J6 ]1 }! K" w$ Z! V
9 z+ M! f6 ~7 a; x8 u* v readslength = random.randint(self.minreadslength, self.maxreadslength)
2 j4 H9 K. x2 S u8 W$ e! v) b self.allreadslength += readslength# c8 W( B1 b B2 t. T! a
+ v( a v3 ?% [7 U, U d. V; ^
fragmentcomplement = fragment.reverse_complement()
( T: P' D9 d. E& L% w" ~# O1 `. L fragmentcomplement.id = ""
% {2 ^! c' S0 v" R8 H k, A' | fragmentcomplement.name = ""1 ]3 _7 m5 s. m; [. _5 Y: @
fragmentcomplement.description = str(self.readsID) + "." + description' x. B2 A: B5 h( I3 n
self.readsList.append(fragmentcomplement[:readslength])
) G$ v# Q' G9 D1 n, |3 t4 {
6 [9 `+ Z- r; _, m* e self.readsID += 1
# P: X* y7 N, w; U) e- a+ \1 L
; q6 _" T3 o8 [% x+ @+ m def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):$ j7 z5 S& E! E% s+ K& O
for seq_record in SeqIO.parse(genomedata, "fasta"):
/ x7 d% u6 `4 \- h5 c seqlen = len(seq_record)
7 s7 A# I9 m9 }) u3 p self.genomeLength += seqlen5 d, s0 ]# A- h L. K: F, ~
for i in range(self.N):! ^- d+ X& [% [5 N7 L
# 生成断裂点
* L! \7 @. h1 c9 L% Y. z* R8 c* V breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)1 [% }" N1 l# e* b4 E& ?$ v- t1 t! d
# 沿断裂点打断基因组0 }% E6 I, [2 S
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)+ d- W6 @5 c, x# g3 A8 K' k$ g
# 模拟克隆时的随机丢失情况
& B) h m2 L9 s$ ~6 r clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)8 Z6 s3 U$ R+ H% R1 V
# 模拟双端测序( u% ]1 S! X$ T
self.pairread(clonedfragmentList)4 ~) r3 p$ y8 t) Q2 d8 a, l
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]- J# ^$ {1 B# G) i- N
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]* J. a* u) J- f0 u
SeqIO.write(readsList_1, sequencingResult_1, "fasta")3 B: V e$ g1 L5 a( W2 H p
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
* t( [5 S$ F% @6 g& ~) e/ V+ b" y9 A" V3 c7 g! P1 i
def resultsummary(self):
3 s5 V/ N! B: V- ` q print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
8 B/ o, ]5 ~+ ]- X# D print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
2 B* m9 v9 h& Z* D5 N1 q print("N值:" + str(self.N))
0 u7 T5 w" p. h8 }' Y$ W8 [ print("期望片段长度:" + str(self.averagefragmentlength))
6 b) [4 a3 Y3 `& C9 b0 s7 J/ o print("克隆保留率:" + str(self.cloneRetainprobability))4 ^" e# V3 M- ?8 L1 P+ P! _% `
print("片段数量:" + str(len(self.fragmentList)))0 ~- Q1 k/ O/ U" e( Y! ~
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
* r. k4 b0 B) q! o3 x- I9 |; q print("reads总数量:" + str(len(self.readsList)))0 i, ~: U' |* k6 K
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
: N6 V+ u5 k0 d8 z0 ^ m = self.allreadslength / self.genomeLength
4 Z! Z i; {0 p) M* L print("覆盖度(m值):" + str(round(m, 5)))" r1 @9 d5 c' c7 l0 o- q. \: d4 [- ~
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))) }8 p: K2 a: T$ b$ j# r/ ^1 M- ~. s
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
( x. O0 t+ X3 y+ C; ]4 u( A) B: E# -------------------------------------------主程序-------------------------------------------
1 k) V: U4 c- t5 h8 {3 R' s# 模拟单端测序7 M3 t# V k: b( W1 W' ~
sequencingObj = Sequencing()
; W# y ^9 S$ l7 UsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
& t5 {+ Z' X" o2 P+ \sequencingObj.resultsummary()# y9 L. _( J: f( u" e
* s% o; Z5 C! b+ F' n
# 模拟双端测序
+ `8 e: H# Z* `6 ?$ ?( L- V3 OsequencingObj = Sequencing() Z/ N1 T8 l: e
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
+ y2 S: ~: Z OsequencingObj.resultsummary(), V. b6 J0 s* t# h* v
from Bio import SeqIO' t5 i& N" u D/ D2 e" t, _
from math import exp, C; Y, v8 |2 A" o8 Q8 _& |, ]
import random
" {. t( C) B/ S4 x" {& b2 e
* X! a- A2 D& e( a" F- c; d7 ?7 |class Sequencing:/ J( b! I% u9 ^1 D
# N代表拷贝份数
% w8 z# W( L9 @; L2 R" }7 G def __init__(self):
: [; G# D. W3 O* s$ [4 a3 y self.fragmentList = []5 O! i3 `' ?; Q0 [' W% `, C
self.readsID = 12 C0 p* u& G& Q/ y
self.readsList = []( s' H$ K( \% h1 N+ W
self.averagefragmentlength = 6501 r" s+ F" q! H5 I: N" E, ]$ s
self.minfragmentlength = 500$ v8 o: }, H4 |$ X
self.maxfragmentlength = 800
7 W' q' R9 ]: v" ]7 v- q self.cloneRetainprobability = 1" c1 \3 `& \; q( M) L' j: b2 l# s
self.minreadslength = 50
% W) D8 p. B; F1 K( B& Q& ? self.maxreadslength = 150. G1 V" c) l j! e$ `
self.N = 10( J' {9 I# N1 ]* H) `! t
self.genomeLength = 0' N& V. K7 p' X+ j$ _+ P9 R
self.allreadslength = 0
2 `( V/ h3 _9 [4 m6 t
; s! r1 u8 w1 k4 f( x4 t # 生成断裂点
: a: J: e/ j& @* w2 S) ^4 R% S; E def generatebreakpoint(self, seqlen, averageLength):2 W0 O0 u; X: u0 l Z; z2 L8 i
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
8 A: A8 S, U5 |/ x+ I/ [" z* V breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]& D! d; G9 J4 K" `
breakpoint.append(seqlen)+ ~) F, j' h( K4 K' \4 z
breakpoint.append(0)! G! s. Y3 T2 l* f! @4 P% @
# 把随机断裂点从小到大排序
4 ]: n7 V, R j breakpoint.sort()
% v2 n3 V+ r) n" }# J' f return breakpoint5 X; M" [) n% E
( C8 M a+ x) w
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp# Y. y1 ]/ A0 s ~8 `
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
* J% y# B1 d( l! P6 h) [ for i in range(len(breakpoint) - 1):; w% x8 z2 ?: f# I& `7 [/ U8 E, Z
fragment = seq[breakpoint:breakpoint[i + 1]]8 V. {8 \0 c4 v3 D/ y' N
if maxfragmentlength > len(fragment) > minfragmentlength:
/ u# S2 }; D3 _% U1 N; G$ @# B self.fragmentList.append(fragment)
' [$ ?$ B, T7 `& R2 Q& v% X! W1 L return self.fragmentList% y0 f$ ?: ^ V7 g. D( k9 R [7 m
% M3 l; |6 j& Y; G
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率* x# _$ u! L) B- X; \+ b: b
def clonefragment(self, fragmentList, cloneRetainprobability):
$ w; p6 T) T$ k, G1 d2 w0 @, o clonedfragmentList = []- [" X, t7 x6 [' J* |3 ?/ `/ x
Lossprobability = [random.random() for _ in range(len(fragmentList))]
' T. i6 { m( _7 F9 a for i in range(len(fragmentList)): \ i' M1 o9 _% H" A! {
if Lossprobability <= cloneRetainprobability:
6 `& F9 _9 r, ~# m) F$ _+ G. b clonedfragmentList.append(fragmentList); K. X3 t5 K! O4 z
return clonedfragmentList
9 V {! E* q9 k4 `+ M; [9 Q* }) p0 F) j- ~% x `& l4 q
# 模拟单端测序,并修改reads的ID号: q4 N+ h0 O) s7 _+ e
def singleread(self, clonedfragmentList):
* y' A. U a5 B for fragment in clonedfragmentList:
( r9 Y4 m: o0 H fragment.id = "": H ~6 V$ Q% `. i w3 d
fragment.name = ""1 B( q* E0 |. K4 _) f H
fragment.description = fragment.description[12:].split(",")[0]. ~! O- O% F: c* {/ `1 D
fragment.description = str(self.readsID) + "." + fragment.description( E# L [* a( i7 A7 G+ h! z
self.readsID += 1
+ o- R. w0 s1 w" E; U Y9 } readslength = random.randint(self.minreadslength, self.maxreadslength)) _$ S$ D# k1 P% `+ j) R8 C- b7 x
self.allreadslength += readslength: r# U1 W6 D, \3 @+ \/ F! n$ |
self.readsList.append(fragment[:readslength])
5 K6 d) ?5 t) [* x8 @# O" o: S" {/ X
+ R4 q3 l+ C e' Z5 m+ B: ] def singlereadsequencing(self, genomedata, sequencingResult):
* n* m& B& q# d3 s$ @$ D1 n% z for seq_record in SeqIO.parse(genomedata, "fasta"):- n' a2 S4 I, Y! i$ y O( B. d2 C
seqlen = len(seq_record)
8 j& d' x" T6 g2 J self.genomeLength += seqlen. m; M" O# c0 p) n/ r$ O# Q4 J
for i in range(self.N):1 Z! E. h% H$ {
# 生成断裂点7 ^8 t( q/ m& t: Z7 _$ c$ u
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)& U5 W- Z$ x: a' F P' s8 i$ c& w' h
# 沿断裂点打断基因组" k" V7 X/ g; f
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
+ @, \. X* Y# o" M' S1 s" g9 x # 模拟克隆时的随机丢失情况& ~6 O5 q* t3 e8 c1 a: [* r# H
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
% e- M4 k" \% n2 l( t # 模拟单端测序/ N+ @; X* d! C# m
self.singleread(clonedfragmentList)7 ?4 G4 q- _" y& t+ f
SeqIO.write(self.readsList, sequencingResult, "fasta")+ g% D& _# v& j; h/ r
! }# `* \ m0 Z Z |/ n
def pairread(self, clonedfragmentList):
b4 b0 Q2 t0 f7 J for fragment in clonedfragmentList:
* `# h1 n6 H* G1 Y5 L fragment.id = ""
9 y3 ^- l. ^' G2 |( |$ W fragment.name = ""5 R6 t- E6 v- T! L! f0 X
description = fragment.description[12:].split(",")[0]# K0 I' c K: B; x% V+ u! x
fragment.description = str(self.readsID) + "." + description% ^9 D6 |5 { a1 [, g9 W4 x4 {
readslength = random.randint(self.minreadslength, self.maxreadslength)
o6 e, t D# j" v3 }1 ~* D, y self.allreadslength += readslength! g# E/ e; O4 H7 S4 L* z9 \ ?! y
self.readsList.append(fragment[:readslength])
: F. o+ ]. e3 A
: [5 `& T2 u: ~0 @, C" k/ B; O readslength = random.randint(self.minreadslength, self.maxreadslength)" H% y9 F5 E$ Z$ A% @
self.allreadslength += readslength
4 P8 ?7 N& U; Q% N; } e, b! O/ c# S
fragmentcomplement = fragment.reverse_complement()
" O3 W5 _% ?% W2 _ fragmentcomplement.id = ""
! J7 c$ a. M' i5 n. P fragmentcomplement.name = ""5 Q8 ? q3 |; M! q! |5 m8 p
fragmentcomplement.description = str(self.readsID) + "." + description/ E! w' t# Z+ H8 M, M2 T& h. A
self.readsList.append(fragmentcomplement[:readslength])! W0 b& H6 A. l
* `, \1 [ X; t3 Z5 J2 y/ Q+ d self.readsID += 14 U% t( H8 O+ f& u5 J
9 {5 @7 L8 R' O3 X% i def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):1 T. D% `! }4 s
for seq_record in SeqIO.parse(genomedata, "fasta"):
( e" v+ l9 s4 g4 y. o seqlen = len(seq_record)
# \- O9 U' z3 G9 _; _' d3 i$ q self.genomeLength += seqlen7 o3 ~3 B: E6 j c: |
for i in range(self.N):
2 z2 ?6 Q0 d2 c4 s2 t # 生成断裂点+ A0 z. t& S1 K* Y( r
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
/ L$ B/ q, U3 @- |' c7 x3 i/ o # 沿断裂点打断基因组
- G- f6 \) t- W' {+ A; i6 U c self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)) j4 j G% f5 y, i# e1 B
# 模拟克隆时的随机丢失情况
y9 t- }/ y0 q9 D$ f3 i9 l clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)! q5 \! m" X+ W- ?$ G/ |0 v" V
# 模拟双端测序
* T- D7 i( |6 t( a I( m2 \ self.pairread(clonedfragmentList)1 h, B2 F: {0 N1 Z
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]% O8 \( n, S0 K) J# s
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]$ j! [: h7 A" o4 B! P. ]0 j# L1 _+ H
SeqIO.write(readsList_1, sequencingResult_1, "fasta")& t" E- d$ [% A4 c+ k& p! z
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
& F1 P+ w. n$ Y7 a' o) T- K( b; K% Q/ L
def resultsummary(self):+ p9 f1 c2 Y, @- Z4 r7 s- J
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
^! _! i# X( Q+ f+ Z* ?% F+ h6 q print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength)) e+ }+ q- t9 K4 R9 r6 H( u9 f! R
print("N值:" + str(self.N))
, a1 N3 k2 U* U3 v print("期望片段长度:" + str(self.averagefragmentlength))' S7 A" S% d! v) o
print("克隆保留率:" + str(self.cloneRetainprobability))
: }$ q' b8 H5 \( H$ z/ G" S: M print("片段数量:" + str(len(self.fragmentList)))
1 i/ Z7 c/ c6 G( _+ B. f0 [( I. z5 B( z print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
7 g- G J7 x8 u( H% P" k print("reads总数量:" + str(len(self.readsList))) i. H) K9 ]6 n; x
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")9 j$ }) ]1 @8 N, j8 ]4 }+ b) l1 [
m = self.allreadslength / self.genomeLength2 v+ ]5 f9 H- w1 A9 \) F) w
print("覆盖度(m值):" + str(round(m, 5)))
9 G6 N8 T, C) [/ g, e7 l print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))$ g/ @' R. J* h$ V8 K
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))5 ~. J3 A" e! [. u9 W
# -------------------------------------------主程序-------------------------------------------2 E0 N5 L* c4 b, G( Z
# 模拟单端测序+ @* h! k4 z! Z1 r2 v* g
sequencingObj = Sequencing()7 X8 T! Q# x0 T9 G5 ?% U& a, w2 F
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
* c4 i* Y; B' w* d. k# Q8 a: zsequencingObj.resultsummary()3 e% u5 U5 c$ x! I
& r& r" z& ~! R0 B( f# 模拟双端测序& Z1 y: m: ]* L3 d6 S+ X7 x
sequencingObj = Sequencing()
9 ?$ t* w A8 O" J$ }$ H. dsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")7 Y' w$ d8 V6 W1 ] r
sequencingObj.resultsummary()% ` _8 L& c0 f1 p! l+ ]. v$ A+ E J
7 P- { a# T# B8 \5 d" G4 ~
1 }2 |- L& e. R1 L4 G ~7 e- x
* t: w! |1 e9 E& m1 p * ^9 q# _# q! i% e+ q) X7 y) V
|
zan
|