- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 561419 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 173799
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 18
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
/ ]7 g/ C. n2 ?# d n基因组测序模拟4 U: ~, y' L3 L5 m2 _2 ^ O
( T( |* `" t) e) W# L7 Z
一、摘要
2 g5 X% a7 V* x- D& p/ m( e- G* I8 s c( g
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件! e" P& h0 D2 b: z- s5 [& N# W$ \
" U$ o7 J9 [' T二、材料和方法
% U8 T2 v. m& [6 r) R6 F) m
( ^& n% a9 N+ T* v1、硬件平台, v1 {) M" C n* B3 {6 G, Z& }- [
+ O4 H5 a$ A. @3 D. ^- M
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz + m( v0 i6 q* [7 c+ u
安装内存(RAM):16.0GB
9 o& j& i" b* m, b! D3 ]/ ]
- v8 w2 ~& \" Q) W$ h2、系统平台
# `" ]. _# A2 T1 ?! [7 nWindows 8.1,Ubuntu
& K- Y2 N9 X5 F/ B! y/ ]& T* X6 f( a9 O% t4 y$ R
3、软件平台. l( U$ y$ A" ?: B% B
; q ^3 h3 {( d, f) O& I- u
art_454
- P5 z* t5 ~1 T; P+ ?GenomeABC http://crdd.osdd.net/raghava/genomeabc/
# }, a3 K2 Y0 K. T9 LPython3.5
8 M/ S0 } M8 |# q4 SBiopython
# ~+ }% z/ I! K( ^$ _) i4、数据库资源- ^) i- ^: x0 a8 X1 `
; S! r+ G+ v. I$ M8 aNCBI数据库:https://www.ncbi.nlm.nih.gov/5 L& J$ }0 C4 N+ g5 ~% L1 y7 k8 }& Y
6 `9 \) U4 J. N) R8 ]
5、研究对象
8 Q3 d S# ], z! c) W: b' Z% a9 G
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
6 l3 W* R' W! Wftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
# _" n1 j/ u+ R) S
2 W- ~- |" V3 u, D7 l: j6、方法
7 V7 w' b) B9 I; g! |1 S) O% w7 h4 C: }2 R* J8 e! I6 j
art_454的使用
5 i5 q$ q4 u# s- J3 O9 A首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。( E. c# b, N# G. T; j8 Z
GenomeABC
1 C2 F y. ]$ a D2 D+ g进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。1 r8 F3 E! @1 L3 c
编程模拟测序
3 d: Y4 a0 {8 [3 e# W: J7 E; ^下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
( z) C% w; {0 I" v$ G三、结果1 y6 g, O, P& c7 o/ w, L% e
: {- k( X" q( \& l+ ?$ ^
1、art_454的运行结果5 x: E$ l% L! _" x
3 Y: R5 h G9 I* [无参数art_454运行,阅读帮助文档 ( x8 l U- J( M; U" `* H" H) c
8 ]4 i! d% r0 z2 z# G8 m5 m5 h图表 1无参数art_454运行
L4 H$ V( L' H7 d8 E对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. ' K9 O" I a5 a* ?) }+ q
下图为模拟单端测序,程序运行过程及结果
5 W, B, r6 }: }, s$ o% V( p; ?* t9 |9 o/ E4 C; s+ L
图表 2 art454单端测序 ' m# c1 \0 I* k9 L0 o3 Y- u2 B
) g0 u( E8 ^. L- w
图表 3 art454单端模拟结果 ) n; K$ z, u* _2 C
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 ) l8 I$ @7 c$ g2 `. g
下图为模拟双端测序,程序运行过程及结果
5 t* P! q. k' Y( n2 E: h2 b$ e: D% k) v1 v7 G
图表 4 art454双端测序
1 `9 u; C7 l6 P( y% F- s. M
6 @0 c8 p+ Y+ _5 O3 K/ _' S& a图表 5 art454双端模拟结果
7 |3 W9 d5 t l* Y# e/ ~& B5 t2、GenomeABC
! r _% s/ W) t2 ?- u) t下图为设置参数页面 8 Z- W' J6 X. g) @
( x2 I- h9 ~6 ?# ]5 |
下图为结果下载页面 7 Y/ v V/ @3 ?9 z* y5 r
+ Y) C; E" N9 s" S ~) _5 ]0 d$ u2 D图表 6 结果下载页面 . ?% s# h+ \. d0 x$ h' ^$ ~& z
3、编程模拟测序结果
" B+ j0 R+ b, b拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 . S; D: s5 M) `% o7 J
单端测序 : {- P4 L9 S% [2 P
3 d+ U/ S ?1 D# R5 w! [) `
图表 7 程序模拟单端测序 : ]1 M f# P9 O8 w' R, c# Z
双端测序
% e4 Y0 j$ W6 |7 A3 |0 S$ i, F' R1 j. C' l4 m/ K& x* ^$ I
图表 8 程序模拟双端测序
4 U: S7 d4 \% r* h测序结果 # _" A+ k9 s8 y! }& o' {/ t
9 W7 m: Y/ ?/ ]6 B' Q- f3 T
图表 9 结果文件" D0 B# R7 n! f* q" x
# ^( k# J- j2 u因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 6 U7 m8 k# _* ?; ^
测序结果统计表
- S. R1 K4 U# l9 L6 G+ U3 a0 z/ a0 t& g" y, w& ?
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
& ]# [+ c" z2 k- W/ D3 }& b单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682* b S6 \( F6 r
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424) @" Z, N/ `6 H
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
3 w7 @/ A; [5 u# z双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
! z2 [: }' g9 x四、讨论和结论) X5 G2 \8 v; v0 W& l7 c
3 r: y- g5 y3 M程序运行方法; O7 E" |0 i8 S" }0 E7 I, G% E1 C
# X8 H6 i% f, P0 M在类的构造方法init()中,调整参数。 ; G1 O' v$ A+ x1 R
Averagefragmentlength为片段平均的长度; 2 Q/ R* |' e' W2 W
minfragmentlength和maxfragmentlength是保留片段的范围; $ b: ?" S$ S9 \2 E% w
cloneRetainprobability是克隆的保留率;
" p* D+ U$ \+ Y. Pminreadslength和maxreadslength是测序reads的长度范围
) E" [ z/ [. S* }5 h
$ u$ K4 D$ {1 i/ |模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。! ~; y6 N) H! H. r2 s/ O/ \
; U4 C& Y0 v# c0 }% G- I" G( H
附录. D: }6 A1 P5 Z9 K; U
& @# l* C4 l- N( @from Bio import SeqIO$ l% C. z) \! S0 Z9 g6 Q$ `; ~$ O
from math import exp
% K( S1 B: [5 D" Y# u. bimport random% N; d+ A. W7 m0 O
- V5 t2 D' u c/ gclass Sequencing:
5 ~ w$ i: `$ j' m # N代表拷贝份数! q! e, l* U- q, M( c
def __init__(self): V' f/ D% d' p' ~. \4 H4 W* O' U
self.fragmentList = []
6 X4 `( ?4 Q5 X0 n2 n" r! Q6 t( \ self.readsID = 1
8 J' j$ A0 \( J( w self.readsList = []
( A& E6 ]' P( z8 m4 c4 Z self.averagefragmentlength = 650
' c3 W3 G2 U( l2 `# ^ self.minfragmentlength = 500
" i. O& O- y+ | self.maxfragmentlength = 800% I! n4 d& d, K4 } e
self.cloneRetainprobability = 1
" Q9 N, k& F' e1 y1 v self.minreadslength = 50 o' t. N9 h1 r0 y) I5 s# s
self.maxreadslength = 1502 }0 m9 b7 W4 I/ ~; I7 H1 T5 z+ R
self.N = 10% R+ @- X5 f+ }- `( f, q9 i, C
self.genomeLength = 0
# ]6 b6 }" y- O self.allreadslength = 0
7 A% L y+ ] ]; t7 }
+ }3 R: k, f0 o2 K # 生成断裂点
" A& Z( r; |: `: Y* D' y8 J1 Q def generatebreakpoint(self, seqlen, averageLength):
( i) }5 \% g1 S2 p1 B; X7 U0 z # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数) M. c0 H, r' t; R2 k
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]( y! o. _; }1 T; B2 C$ u
breakpoint.append(seqlen)( n4 Y- J- E5 O$ H- |* A
breakpoint.append(0)
& M6 r0 t( e1 u; c N! _* X. { # 把随机断裂点从小到大排序( ~2 u8 ?# U/ L5 B# `$ q; w8 Q
breakpoint.sort()5 P7 D' d1 d# _- {+ N' ~- j9 K
return breakpoint
8 b$ H& H6 @' r% h' b
" X: H: ]0 P% l, w* F' y # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp1 r( G5 X+ U3 n4 M4 B1 a0 ~
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength): y7 c/ d# r; k2 O9 ^
for i in range(len(breakpoint) - 1):
1 w* \) ?9 X8 a& z fragment = seq[breakpoint:breakpoint[i + 1]]
& q* }# W6 V# i, m3 z if maxfragmentlength > len(fragment) > minfragmentlength:
" P/ D0 [, [& t7 A self.fragmentList.append(fragment)
5 J7 i6 P! c8 L3 p& ` return self.fragmentList. W; s2 i# m# j/ V+ Y
3 j8 U! T) J% ^! ~2 q/ y. z
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率' s: [, S9 L8 V. o7 i' ^
def clonefragment(self, fragmentList, cloneRetainprobability):) x4 t; P4 L, J0 i& [ O2 G `
clonedfragmentList = []
, z# P9 a, j8 \1 y5 X2 u Lossprobability = [random.random() for _ in range(len(fragmentList))]
9 N/ v1 x# ~6 o3 M9 z; z/ B- E* B for i in range(len(fragmentList)):1 H3 C/ E9 C3 w/ }. f+ C' k3 Q
if Lossprobability <= cloneRetainprobability:( W7 W% W; _3 j" }7 h6 O& n. n; i- L
clonedfragmentList.append(fragmentList)+ A& e! m1 }% E% a4 v
return clonedfragmentList
/ J0 k3 `4 x% t+ D
! d$ m" G/ b9 g' Z6 s/ [ # 模拟单端测序,并修改reads的ID号
, h7 j1 G5 [1 d, S def singleread(self, clonedfragmentList):
7 f& {, S+ z0 }* s7 [ for fragment in clonedfragmentList:
& F+ K8 `/ ?) m7 o( P3 i: d" E) q fragment.id = ""
3 }6 S; r3 C1 G6 F; A# w% H: G, z* R fragment.name = ""% _& R; t# o- i$ i0 {
fragment.description = fragment.description[12:].split(",")[0]: N& b1 v9 s c2 x2 G
fragment.description = str(self.readsID) + "." + fragment.description5 S; {6 a. _8 A) V1 }1 ]9 }2 B
self.readsID += 18 C% a$ q7 N- I( G
readslength = random.randint(self.minreadslength, self.maxreadslength)1 w) X; c* T6 q9 F5 p
self.allreadslength += readslength
! \5 i$ E2 \& R; x& ~" C" b self.readsList.append(fragment[:readslength])
5 u R/ B- M7 C; d7 }% g; R+ N' n8 o& F/ E: D( ~3 i2 d4 q
def singlereadsequencing(self, genomedata, sequencingResult):
, v0 G( e) V8 W for seq_record in SeqIO.parse(genomedata, "fasta"):
+ [: I( U- {! F a. f seqlen = len(seq_record)& P$ h- f1 x& K4 [
self.genomeLength += seqlen5 [, K1 w/ s9 Q. Y8 H
for i in range(self.N):
. R- O/ J S4 N1 s # 生成断裂点
9 W4 g7 ?" w8 c9 L& E# z. Z breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength): r5 X9 q$ ^, h) ~1 d; ]- A" a4 x
# 沿断裂点打断基因组
/ P. ~$ l; g( ?+ k self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
0 ^3 p; a3 J' j+ U # 模拟克隆时的随机丢失情况
" O: ?, W3 Z/ E. J- S clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)9 i( [! b5 x$ w4 k- A, @) I6 x
# 模拟单端测序/ ]7 v5 b1 O- r+ k- s. W r4 }
self.singleread(clonedfragmentList) V& u* s3 f/ q* ~) S8 r! k
SeqIO.write(self.readsList, sequencingResult, "fasta")4 Y: P$ t' k' ~$ ]0 d/ H
/ X/ {1 R! k) @8 {& X
def pairread(self, clonedfragmentList):; ~$ }- w: ?! e* p
for fragment in clonedfragmentList:
- p) d+ l# |( a fragment.id = ""
! v$ D% b& q, V |3 ? fragment.name = ""
K) V8 z4 o* P$ @( t% M description = fragment.description[12:].split(",")[0]
8 V3 }$ l, @) B: R8 S fragment.description = str(self.readsID) + "." + description- B, O9 a1 Y& W3 f
readslength = random.randint(self.minreadslength, self.maxreadslength)
# G! x- M2 q1 n, y& q& Y& k self.allreadslength += readslength
6 F+ ]8 A" E' U4 p self.readsList.append(fragment[:readslength])! z _& _& F, ^7 K/ D' s+ j
+ |2 s9 [3 d) k
readslength = random.randint(self.minreadslength, self.maxreadslength)0 v1 w' m1 e+ }" c# t0 k' o2 G8 Q
self.allreadslength += readslength
. r3 A9 w5 `9 X' b; [; J, Y% e( [' e* a- r- B4 F' J& L& ?
fragmentcomplement = fragment.reverse_complement()
( y. ^2 |5 N+ Z9 W& x/ V fragmentcomplement.id = ""
1 ?2 u* X" a0 v1 J& i0 q% n) I fragmentcomplement.name = ""0 B+ a. W0 e, q. K1 Q r; H# t
fragmentcomplement.description = str(self.readsID) + "." + description( G, Q) y; `1 ^) [1 E& V0 { Q
self.readsList.append(fragmentcomplement[:readslength])
, n# t# B4 f0 N3 k) q
( b0 v/ n( z* D5 z1 `8 ]5 u$ c self.readsID += 1$ h! `+ v- @1 k3 N
& r' d# b ~& }/ c6 e; a
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
% w# U, K" p9 A8 G9 q for seq_record in SeqIO.parse(genomedata, "fasta"):" S3 F' q# R$ a# N X2 }, ]' U
seqlen = len(seq_record)6 a& C4 L. [( F0 N8 \9 T
self.genomeLength += seqlen2 a( j5 s& Y( j0 W0 c
for i in range(self.N):1 s2 ? H: B9 g( N8 E d" {
# 生成断裂点
+ Z0 s, n* y) c* K# H' l$ }7 X breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
$ @; L" e3 ?8 n. h% J5 ^7 G # 沿断裂点打断基因组: E! s) J0 G5 w- j, X
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
6 D: H e3 z& @; d # 模拟克隆时的随机丢失情况/ O2 n# ^; E0 O5 ^
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
3 R% `' x1 X+ Q9 U # 模拟双端测序
, \% \# D) a* [# c" d9 G, ~( r self.pairread(clonedfragmentList)
C, S4 p6 c2 A readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]- d# J4 Q1 t0 k. F
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
7 \; A+ Y" A. J9 }. W* i7 E( z SeqIO.write(readsList_1, sequencingResult_1, "fasta")8 H/ ?+ x; Y. d9 B& p0 q
SeqIO.write(readsList_2, sequencingResult_2, "fasta"); b$ s- D+ u. h9 b; z: f# T5 s- \
) @# j$ y# z4 y% J. |% W$ i
def resultsummary(self):
5 q" U7 R2 A* Z3 [ print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
3 a. N6 T3 \& w9 |3 p print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
. p' X3 ]! k E" L$ m1 ~ R print("N值:" + str(self.N))
4 r' W" L3 c2 s- a print("期望片段长度:" + str(self.averagefragmentlength))
p% k) W0 l: R7 Y2 a" {! s8 G print("克隆保留率:" + str(self.cloneRetainprobability))
, ^# [& `8 U: {, Z5 V- D8 o print("片段数量:" + str(len(self.fragmentList)))
: X% [: n \1 l5 @$ Q! i print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))4 W- i" m1 n$ g7 ]
print("reads总数量:" + str(len(self.readsList)))
2 A" }0 h+ s8 s* a' v9 V! l print("reads总长度:" + str(self.allreadslength / 1000) + "kb")+ H3 c2 ^7 O. N1 N
m = self.allreadslength / self.genomeLength
# s D2 [8 u2 S ^ print("覆盖度(m值):" + str(round(m, 5)))( Q! c1 I, x; l4 R& C
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
/ c& N, E3 J; b* v print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
+ ]: @9 b5 K8 m& e W# -------------------------------------------主程序-------------------------------------------3 X5 }8 O4 ~" h7 o1 E
# 模拟单端测序
$ _4 ]3 L0 x/ c a; I! XsequencingObj = Sequencing()6 n" @" I- V$ @* z
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
B0 z4 j8 M) I' C5 W; Z% ZsequencingObj.resultsummary()! @: u4 h" V {
; w$ Y/ n$ s1 N6 F& ]6 I( |# 模拟双端测序
0 D1 ]% g! V( G9 p/ }7 T5 F. o, ^sequencingObj = Sequencing()' F, U& U( @, M2 w7 C! C; K$ B
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")/ q1 @4 @ V' K, i5 v$ s
sequencingObj.resultsummary()
6 t$ A. Z3 m1 `0 A X$ o$ |. x" ifrom Bio import SeqIO
7 g% K7 B8 g8 G l2 Y# n) J3 `from math import exp0 ?# u' M) U. B3 G' A( W
import random' [5 i" H$ W. h8 x2 }
0 @' y" G+ h$ `/ ^* B7 t
class Sequencing:
6 Z3 ~/ A: G0 p \$ m! C # N代表拷贝份数
1 Q$ v- J! M$ o# U5 T def __init__(self):
1 a4 z" N6 [) @# y self.fragmentList = []7 }. H7 f+ V( A k% |! S, X: ^
self.readsID = 1
0 h8 @; ?2 R4 c" _, Y* K9 v self.readsList = []
! d3 s4 K2 @! [* }" W self.averagefragmentlength = 650- R! g) h5 P6 _ I5 j# ^
self.minfragmentlength = 5005 G1 q- F, q# l3 c: ^: C7 Z
self.maxfragmentlength = 800- f$ i& h$ k: [
self.cloneRetainprobability = 1
8 ~0 }) N$ ?; Y* g% L. @9 V2 K self.minreadslength = 500 e3 o9 }% s" L2 C0 J! X' A0 T
self.maxreadslength = 150
! @; ]% P# P3 V/ b3 V+ w self.N = 101 j; t3 p9 q$ K1 \
self.genomeLength = 01 I; O5 L7 w4 ^7 p( r
self.allreadslength = 0
1 q: u% Z! G7 m
; w' l" p9 @; O! z9 F( f3 D # 生成断裂点0 j* y: x) i0 d
def generatebreakpoint(self, seqlen, averageLength):2 y2 L2 [3 ]/ F7 T2 @4 |
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. j7 u( b1 w U) N7 u; ^. z& s breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]5 x! k6 x8 }5 b# z$ U+ }# q
breakpoint.append(seqlen)- {: B8 D5 m& S
breakpoint.append(0); M+ @2 s7 `6 c3 b0 F5 B1 m
# 把随机断裂点从小到大排序4 V. |0 R8 _+ z+ L- L1 ?7 l$ q0 d0 x8 R
breakpoint.sort()3 R" l9 F7 f, o
return breakpoint2 Y; t5 ^, e# d
1 ?4 Q3 a; `/ H7 r # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp4 J1 W# m; C, E# T6 s; v
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):$ ~4 L1 V0 ? e4 V0 c+ t* t
for i in range(len(breakpoint) - 1):8 ?. S C( b: U! ]5 I3 f6 ^5 ~
fragment = seq[breakpoint:breakpoint[i + 1]]
; E# y1 D, d" T& y if maxfragmentlength > len(fragment) > minfragmentlength:( H/ t6 t4 {- e% v1 w
self.fragmentList.append(fragment)
' `. y; X4 P2 W& |- K return self.fragmentList$ @( K+ N$ ?/ P/ j: t( W
+ o/ f5 B; T" F) g
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
# {. Z* g* b& q# Y def clonefragment(self, fragmentList, cloneRetainprobability):5 F( I6 d! K; c: o, Y
clonedfragmentList = []* U9 W* B; z+ [3 ~' I- r- p
Lossprobability = [random.random() for _ in range(len(fragmentList))]- x2 c6 J) _4 T8 i) i, [5 q
for i in range(len(fragmentList)):
% Q, `, g1 b& Z5 P- G" g if Lossprobability <= cloneRetainprobability:- j! J: L) e: S# D! p
clonedfragmentList.append(fragmentList)
4 y" b8 d. J4 G, v0 D# m return clonedfragmentList2 h' S% q7 h Q1 m& G% ?
2 U) p; n/ }& T
# 模拟单端测序,并修改reads的ID号
$ ?: T# \3 [, f8 Z) g def singleread(self, clonedfragmentList):
. e& Z+ _( E# \; l, q( | for fragment in clonedfragmentList:. o0 d; i& U% ?
fragment.id = ""
2 c% r. I+ x; Z fragment.name = ""( C" m+ V: ~5 y4 [2 p
fragment.description = fragment.description[12:].split(",")[0]
% J# l1 H. r* O: e, f+ N fragment.description = str(self.readsID) + "." + fragment.description
* [ ~ t/ o3 C) `0 A) Q" p self.readsID += 13 C k# N$ H) _! M6 e+ M7 G
readslength = random.randint(self.minreadslength, self.maxreadslength)
$ w* N; j1 {3 }, P self.allreadslength += readslength
' K& {3 ?1 H3 r8 b/ _: t0 [ self.readsList.append(fragment[:readslength])
5 L+ [) f; |" v+ Q9 N' f6 ]; j8 I
\0 ^7 R3 \8 O& X def singlereadsequencing(self, genomedata, sequencingResult):
, E3 w' I$ j" G! @ P for seq_record in SeqIO.parse(genomedata, "fasta"):/ G4 d2 K; Y/ A& W6 R
seqlen = len(seq_record)
' i5 R- @0 z0 n self.genomeLength += seqlen5 E/ I/ T4 @, q0 i' t# R
for i in range(self.N):" g$ ?8 c# o' E3 R7 d2 z
# 生成断裂点5 l: P- n0 H# b- W1 k" s' ?
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)2 W" [4 Y$ \5 C' t2 E/ c3 W2 o
# 沿断裂点打断基因组
1 p, Y& Y# {' e/ m5 `$ Z+ u self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)4 K4 [: f: |6 F& A' {) l* W
# 模拟克隆时的随机丢失情况
0 N7 }4 l0 w) ]1 r1 m u2 p clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)8 f4 b: S0 e* O! P3 o C' K8 G
# 模拟单端测序$ K, z" ]# _$ i: l3 d2 F
self.singleread(clonedfragmentList); w g& u$ T' R
SeqIO.write(self.readsList, sequencingResult, "fasta")
" x: `: e! f+ q9 v
7 a! O: m" C: g def pairread(self, clonedfragmentList):
6 h0 X/ X7 g9 j for fragment in clonedfragmentList:! x( G4 O) V' Y) e1 b
fragment.id = ""( K$ ~+ j) @* U1 ~ @
fragment.name = ""
8 V+ p x; V ^$ f$ m( f7 O description = fragment.description[12:].split(",")[0]
4 R( }0 G5 m% A fragment.description = str(self.readsID) + "." + description
" J' b; j* |" k# o9 |4 R readslength = random.randint(self.minreadslength, self.maxreadslength)
; }, D$ w; r" y2 d self.allreadslength += readslength
. B% ?# i6 f4 u& V: z# T2 \8 w1 T self.readsList.append(fragment[:readslength]) o+ [) j0 _# Z8 ^
7 L/ F# ~0 J2 [3 G6 f, E
readslength = random.randint(self.minreadslength, self.maxreadslength)
& k5 m1 y5 N: a. b self.allreadslength += readslength
2 L: V* ]. U& O; ]5 D& l$ `$ p( S4 C5 `- v1 H
fragmentcomplement = fragment.reverse_complement()
( a9 t/ T% v6 Y9 `/ z8 S5 d2 n fragmentcomplement.id = ""
L3 ~! [& g6 k fragmentcomplement.name = ""
/ @( B0 y) a- u- H1 L$ C+ x, { fragmentcomplement.description = str(self.readsID) + "." + description' T4 I2 T; p# o
self.readsList.append(fragmentcomplement[:readslength])
! V8 J5 O8 v/ P2 e: L5 [, l# Y$ N( A n3 a+ c' z" i( `% T
self.readsID += 1' O ]& V [0 i# [1 u
3 x' `, o2 K% a$ n def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):9 v1 y. |$ b' C8 K
for seq_record in SeqIO.parse(genomedata, "fasta"):+ d" Q [$ R2 V# q& U6 t: L' h. |
seqlen = len(seq_record)$ b# W& _8 T5 h
self.genomeLength += seqlen
" A' {) o5 K, o6 t4 m' C- @ for i in range(self.N):
: u- B2 e2 I1 V; k) Z # 生成断裂点$ @0 }: i2 c8 ~" {
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
" F7 x5 U& `# R6 y; N4 g$ I # 沿断裂点打断基因组
3 F& s$ Q5 Z; e* X- H, [: H self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
3 I! A% V+ R- J6 J/ U # 模拟克隆时的随机丢失情况
, s5 h' @# _+ m8 `+ R- Q4 J% F clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
# U+ q& ]6 ?3 h% g2 ~2 [) Q # 模拟双端测序
1 _* {" C/ b5 e, x3 U9 O( C self.pairread(clonedfragmentList)" Z6 y: ?0 @/ H0 ?- G$ G
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]7 W3 F. ~% i5 q
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]# A( @9 Y2 u' W/ }0 K
SeqIO.write(readsList_1, sequencingResult_1, "fasta")/ [ [& H$ o( W% M' e
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
# C: N8 a7 C/ t( G
. N8 Y9 B! a/ W8 i def resultsummary(self):
9 I. ?* r) o! ?- ~ print("基因组长度:" + str(self.genomeLength / 1000) + "kb"), t5 P q- a1 J; G. @+ i5 J
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
) m# b3 y9 _, ?% _ print("N值:" + str(self.N)). N. I: S. ~4 K, O) g
print("期望片段长度:" + str(self.averagefragmentlength))! e. s3 L: D) X$ K3 S) }+ r
print("克隆保留率:" + str(self.cloneRetainprobability))
4 L3 M& s- Z# N) [* w print("片段数量:" + str(len(self.fragmentList)))
/ S- T1 Q/ S; ` print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))' o* i" O: q6 i1 V; w* W
print("reads总数量:" + str(len(self.readsList)))- Y4 x9 k M% v1 ~
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
$ f* x& g+ h& K# b. d6 y m = self.allreadslength / self.genomeLength& z/ v+ f/ ], d' \- B
print("覆盖度(m值):" + str(round(m, 5)))
, [( d3 h, C5 Q0 W( e+ Y print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))% e, W( U0 @$ }4 `" [+ A& V
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
9 Z8 N/ [1 ^' w0 x: h Q s: Q# -------------------------------------------主程序-------------------------------------------4 p0 t" F7 {5 j/ O4 {: @. Y5 m
# 模拟单端测序* ^0 C" e4 l- z
sequencingObj = Sequencing()9 ?% F3 [9 D) N, U5 L
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")9 J2 p& G$ |) z- n2 \
sequencingObj.resultsummary()
5 f' B9 M, v6 m% J0 h, P8 t7 Q" A
# 模拟双端测序8 n m( f: z" ~7 W. p9 z0 L4 X
sequencingObj = Sequencing()
: s2 @2 p! h- UsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
/ O" h& w+ Q: U. W. ^0 w0 ksequencingObj.resultsummary()+ U- W- H1 |: S. F/ o
. B/ l( R; \% ^: G
2 Q' H0 f3 |2 p3 j: g c- X2 v
( ~; N/ k; g, J2 H" g+ Z/ k [" s . Q D1 `2 `: P+ W1 k8 o: j& B
|
zan
|