- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 81
- 收听数
- 1
- 能力
- 120 分
- 体力
- 544105 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 168610
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5324
- 主题
- 5250
- 精华
- 18
- 分享
- 0
- 好友
- 163
TA的每日心情![](source/plugin/dsu_paulsign/img/emot/kx.gif) | 开心 2021-8-11 17:59 |
---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
![](plugin.php?id=eis_qrcode2:make_qrcode&tid=441184) 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟 G& g$ {: C) w8 b# L! H) I
基因组测序模拟
* ^: B3 f' `0 M( k4 m% c5 F+ ?7 Z# ]" b* t ]( V" O
一、摘要
+ M2 _: O( x% w4 N+ K! `' F% `
. V0 b; q, J9 M- `! l7 n* F通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
# g' u" R$ H5 K% m" p$ n! i0 R4 ~9 [
二、材料和方法
: X" @5 e9 _- o4 w/ r6 R) Y$ u
! Z4 \, u( h0 Z$ M/ j& P2 B# Q1、硬件平台
8 a! A+ Q: @) c4 \% C( U4 T4 w. R& a( s! D! j
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
( Y* f5 x% B2 H/ b/ @$ Y0 Y2 a安装内存(RAM):16.0GB) m" V0 }5 ~; G! G
& ^+ d1 G* k4 N7 `) v, r2、系统平台
. i& e9 D5 L& `9 J" \3 sWindows 8.1,Ubuntu
- I- B4 ^+ H7 h. g' K! Y
) Y5 |' x; e3 }- D% Z6 X1 w" t3、软件平台
5 D% V; t1 d0 }2 A# s
7 Z8 ?0 P. h0 f9 T$ V- eart_454
4 ^& _% ~* E/ r* P' ]# jGenomeABC http://crdd.osdd.net/raghava/genomeabc/% C- O/ u4 e h0 s. {/ N% b
Python3.5
6 Z1 o3 i; q% ~9 s# k1 j5 vBiopython& s, }- l6 \" O- H/ l
4、数据库资源
3 j8 i* U& A: @' `; W* o- j% a2 [1 P9 |
NCBI数据库:https://www.ncbi.nlm.nih.gov/# _9 a; s7 g4 K( t3 v
0 `7 }6 z6 [& g+ M; o2 j0 T
5、研究对象# M3 I8 ~0 h& r* @0 v2 `$ [
- \4 h: u+ s" F1 @* Q+ t: J
酵母基因组Saccharomyces cerevisiae S288c (assembly R64) n. @% w- d9 w2 C# w
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz" k( L( E/ L( t. L
& P! f9 O, [# D+ ]. i* l3 ]
6、方法4 k0 f0 |8 k# V4 z- \% A# J: S& G5 U
3 g3 m) W+ o: V/ [4 \. U ~3 Xart_454的使用 . ?, u& i( G$ q) c$ w, p) l2 ?
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。0 e: F. V$ Q6 z
GenomeABC
" l' s+ v" K5 o! i' h7 ~& B5 g6 k进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。0 T$ ?) T3 a* @2 d9 u: R: q6 `
编程模拟测序 * B+ Z* ~+ L, N
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。4 C! K, [6 u. N! h
三、结果5 t/ R' _5 a( L# S
/ P2 B2 Z8 f2 w; t( ~, s- Y
1、art_454的运行结果
. ]2 h' l. [5 m
) ^2 z! e; p' g7 ~无参数art_454运行,阅读帮助文档
- o0 z5 x( [7 A; I
, }) @# b' Q7 S+ ^ [图表 1无参数art_454运行 # x# Q4 H4 @5 @* a; p# f
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. ; p& I' J$ ]0 P: P% C
下图为模拟单端测序,程序运行过程及结果
8 A* A0 q/ k1 X/ P% m4 I8 V: L9 R2 i- L5 S) s3 r
图表 2 art454单端测序
8 u% F: }$ w5 b- |% \
8 `( ~8 g4 b2 y3 [: q6 `) B图表 3 art454单端模拟结果
$ w6 W" {0 z7 V+ w+ N# a双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
7 M1 R+ r8 R' P- D! q# }& N下图为模拟双端测序,程序运行过程及结果
5 O! C/ b) H; s5 Z7 _& Q) ]+ f3 [/ h8 n$ C
图表 4 art454双端测序 . k% Q- S8 h+ g! n+ n9 H
% C# r0 B% i) v0 c4 o, _, x' }图表 5 art454双端模拟结果
2 U- }5 p" P3 r8 }) |$ J' `4 t ]2、GenomeABC 3 ~, {( Y3 a- N3 C9 o& Q
下图为设置参数页面 $ x' @9 J7 `, @" s: i
3 L: F% y) n! b9 G! d) t* f下图为结果下载页面
8 d B! N D6 X9 U- D) ~5 j+ Y; C9 m# D" g9 k0 k( M
图表 6 结果下载页面
) r: L, u' E; H+ }3、编程模拟测序结果
8 U7 r4 D6 ]$ l$ z3 N拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
- S/ r: K( \. l" @! s单端测序
) N0 q% U) R0 n, h( F- H, V' D$ W$ M: N0 j* q$ b) w* D' O" O, c. R1 K
图表 7 程序模拟单端测序 . R/ k+ _- F8 q2 G# U9 ?
双端测序
7 p- g* l( g3 B2 _: k" o
+ u& ]5 B7 V9 t) F9 R1 j图表 8 程序模拟双端测序
4 p0 [+ c0 [* ~" p测序结果
8 m8 e9 \; Y1 [% ]2 |9 N$ {& k% E0 \8 Y6 I- [; `
图表 9 结果文件, a& @3 Q- s0 ~& Z: f% w, K
2 d$ s) ~& \0 l$ K因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
) M y- V( N% a @6 l* k测序结果统计表
1 |# W- n4 h+ L& k% b$ [
; |& v0 B5 F6 b9 i测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
1 T$ \9 f: k, V' v单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
5 g$ l& g# |3 H单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.714246 H. z+ p) I- }( I& Y
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
5 r0 q8 B- V" f! S: a) I: D$ m双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.918862 u9 ^- {% v- ?( O
四、讨论和结论6 l& ^9 P+ `- A9 n# L
/ T8 l% m8 i- K6 q' A
程序运行方法
, S! D0 D0 B: f& Z3 K( V3 T( f5 l4 w. y# f( d& H0 [
在类的构造方法init()中,调整参数。
# T- O' A9 D0 @0 w3 U6 y$ Y fAveragefragmentlength为片段平均的长度; 8 L/ }$ G9 I& n& N/ {' \& r
minfragmentlength和maxfragmentlength是保留片段的范围;
) B8 H9 x4 `! O0 @+ z2 B2 J* \( FcloneRetainprobability是克隆的保留率;
+ C- S0 n* f: x" }4 uminreadslength和maxreadslength是测序reads的长度范围
- {( W% A$ D9 u& I/ u E1 d3 x# ]) H, a7 v' d: f
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
" m( n$ u; f. }) k. O; y% O. p
d/ }- i+ D# X; b$ J附录
( o; O( ]7 M: P S( R) ?+ F' B3 \0 ~; w+ S, f
from Bio import SeqIO& {. i+ j3 S( m+ K ^$ g# X
from math import exp$ B- U7 Z, _# o0 P/ \
import random
1 u2 _! i6 @: r% z3 w# \8 @$ _) J) _, V4 _5 E( m
class Sequencing:
K; ~; I$ I8 R4 @* l; q/ K V # N代表拷贝份数
: L5 U9 G: }! \* c def __init__(self)
( q0 X" A& X8 R: d2 `( C5 I! O+ @ self.fragmentList = []
5 t, M3 h6 _: c9 f) G+ S self.readsID = 1 m0 ~4 J( ?& X, l4 ^# U
self.readsList = []
( \% u) e l; O& m self.averagefragmentlength = 650! I. X: D9 N Q$ z7 }
self.minfragmentlength = 500
- X1 ?$ C0 i. F( G# n5 K self.maxfragmentlength = 800, j+ B5 h; D# D9 E
self.cloneRetainprobability = 1
/ c( B0 O1 ]6 |: ?0 W. V self.minreadslength = 500 ?# L6 w9 v/ b0 B, x
self.maxreadslength = 150
9 {. q( j! U3 `/ a& l: ` self.N = 10 {6 C, O; c5 n4 j' f8 [) r
self.genomeLength = 0- t6 Z8 _ w0 H3 D) a9 c
self.allreadslength = 08 j: @& r* b- s, ~1 y( Y( Y- P
( E) q G. d% S' X
# 生成断裂点
7 {! B0 n6 ?0 a3 P3 T def generatebreakpoint(self, seqlen, averageLength):
G6 d* F6 k) W# ~$ X # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
( n5 q$ s8 T, v3 r8 W breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
9 H7 @0 V/ J' k, J; ` breakpoint.append(seqlen)! K1 e# a' z* g9 U! j- y
breakpoint.append(0)
0 |. o! \' U: S! T" b+ H& s9 i" P # 把随机断裂点从小到大排序" p7 K: v$ s9 b) i& ^7 x- p* d
breakpoint.sort()7 N9 B" w1 ]. P2 b9 ~; K4 T# q, b6 u
return breakpoint
v1 d3 o" \3 k/ X6 Y) V5 v* _: F; {8 W( v+ |" C N2 z
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp- q9 ?; B! V. e5 \1 S' N5 o
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
! ~4 R( d! I+ x3 D, | x for i in range(len(breakpoint) - 1):
% v/ E5 t, ]% t6 O6 X. q- M% u fragment = seq[breakpoint:breakpoint[i + 1]]! Y- R0 P4 T3 R5 Z6 S( a* n
if maxfragmentlength > len(fragment) > minfragmentlength:" U: B V( Q, N% k
self.fragmentList.append(fragment)! ~% E6 r p2 G& W
return self.fragmentList
) d: n+ R0 W/ G, M, I7 M2 B: B: B; l
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率" R: B9 l1 {! ~
def clonefragment(self, fragmentList, cloneRetainprobability):, x* ]' j. E5 e) t2 T
clonedfragmentList = []
6 c: C3 G# r! q# g& s1 L8 I" c Lossprobability = [random.random() for _ in range(len(fragmentList))]" U- | O* n3 q, W7 w
for i in range(len(fragmentList)):) z; x. ]- e: H9 w: {: V" }; E7 m { q
if Lossprobability <= cloneRetainprobability:# D6 z2 s" w. m2 g' a$ s
clonedfragmentList.append(fragmentList)2 r! U" G' j3 r# P+ c, V* e2 F
return clonedfragmentList
1 b+ v0 e o5 S) Q. S9 ]4 T+ r
[% G6 {( e) M9 Y # 模拟单端测序,并修改reads的ID号6 A! k0 K; W. |! a6 {" E
def singleread(self, clonedfragmentList):
, e( Z& u' [; v" r8 @- y* M6 ^6 V for fragment in clonedfragmentList:5 \2 m6 B+ W2 ~1 H" C% `3 n
fragment.id = ""
/ Z2 Q9 u2 U/ k5 U2 {; R% { fragment.name = ""7 \" \+ M! s. {: b+ i' y) X5 T+ ]
fragment.description = fragment.description[12:].split(",")[0]
5 A/ Y5 A& u1 s `. p, k fragment.description = str(self.readsID) + "." + fragment.description g% ?1 N- K& `+ s- W- J! D
self.readsID += 1
/ n0 g+ S1 S4 y; S. o' Y readslength = random.randint(self.minreadslength, self.maxreadslength)
& f. H7 g; P. J0 v self.allreadslength += readslength( r2 ]/ k& i1 }; Q$ S# t% }/ w
self.readsList.append(fragment[:readslength])
+ N t/ d/ i1 x. B4 x1 e* P8 x
& E1 _: Z6 n4 p7 x# e0 }6 a; g def singlereadsequencing(self, genomedata, sequencingResult):* _/ E8 p% o& H
for seq_record in SeqIO.parse(genomedata, "fasta"):, l- W9 @, I- P, n
seqlen = len(seq_record)6 l" P' s' t' P$ O# K4 ^% f) e
self.genomeLength += seqlen: d$ X" m( w7 c4 Z% I* Y" Z
for i in range(self.N):$ J) ^" k1 ?* x, W' N
# 生成断裂点
! i& w. |! \- k breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)# e! ^2 E' b4 C" B% _5 c
# 沿断裂点打断基因组
+ O$ {: H' F. \* A, c: M self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)" b' \3 J* F$ Y+ Q4 w" h
# 模拟克隆时的随机丢失情况: V. h" h" R9 n! s8 `8 i8 d
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability) Y7 b* y0 X* `" _* d
# 模拟单端测序
5 p. J0 y' D3 W9 ?- I1 D self.singleread(clonedfragmentList)1 G) C# E6 X3 k1 v2 g7 ]5 e$ J; a! i
SeqIO.write(self.readsList, sequencingResult, "fasta")5 I' a4 |+ W. e; ^6 X) A1 q, _1 w) i/ G* a
; i. B& y% d. l def pairread(self, clonedfragmentList):1 |3 Q) w& Z) u9 [; @4 ^$ o% Q9 B
for fragment in clonedfragmentList:! Y! V1 v+ H3 R4 Y. h. j. C
fragment.id = ""5 g; G$ d( G1 o. ~
fragment.name = ""9 l2 K9 Z T# Q$ F& V& |0 m4 i. y0 [
description = fragment.description[12:].split(",")[0]- E ~) N2 E/ x5 n
fragment.description = str(self.readsID) + "." + description3 i" x/ U7 e9 d" [% O
readslength = random.randint(self.minreadslength, self.maxreadslength)
7 P5 G% y& d% c6 i6 j1 T3 o. T4 s self.allreadslength += readslength+ Q, D. ]$ g5 A
self.readsList.append(fragment[:readslength]), l a7 {. `' j9 C6 i& g! o" H( d
& P4 \; U3 i, I# {5 v4 M+ Q# ]/ _' M6 u' i
readslength = random.randint(self.minreadslength, self.maxreadslength)$ I. k3 n2 o3 a, A
self.allreadslength += readslength
/ N" d+ R% H M( u6 O6 @0 q1 ]
+ C9 v. ]7 n1 @1 C* P fragmentcomplement = fragment.reverse_complement()
% n6 x* d3 a, }1 S( d5 [ fragmentcomplement.id = """ J0 u1 e2 n6 A% {1 s
fragmentcomplement.name = ""
3 u2 z! h5 D t/ e, w1 I7 O fragmentcomplement.description = str(self.readsID) + "." + description
# M! O( m& t9 R1 N+ z self.readsList.append(fragmentcomplement[:readslength])3 L% d5 e- O5 r3 a/ h3 d
# V# v# w# | N+ `0 f* a5 k, a: d* a self.readsID += 1
8 T' e2 }- q9 W+ M& f" ?1 ]+ m; B
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
0 m) a. l7 S. a& M3 f- J) P for seq_record in SeqIO.parse(genomedata, "fasta"):
& ~# B* \( n) T# T2 [* d seqlen = len(seq_record)
# B' `' Q2 l' z" j5 }, G self.genomeLength += seqlen4 c& g, Q$ W/ R3 t
for i in range(self.N):
* Y6 T+ U5 r, v# u1 F" J # 生成断裂点# c3 D3 v: N3 Q+ Z9 P
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
" C( y' J. P7 t# a% l b. w+ p # 沿断裂点打断基因组- A2 J% A/ t( D2 T" q; V! ]/ ^4 s
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
' F. x8 U% K8 @# p' h2 V # 模拟克隆时的随机丢失情况
7 B$ N; r9 o: V" j clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
/ H# W/ w5 ~* a; R # 模拟双端测序
+ f( U) k% }/ f9 f9 J' N3 W self.pairread(clonedfragmentList)
5 `; C1 o2 _9 N readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
7 v5 @; h7 V6 y/ g; c2 l( b readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]; k3 K& A! k( e6 I
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
8 V3 b/ u9 T9 t SeqIO.write(readsList_2, sequencingResult_2, "fasta")
( _5 ^- n! [9 T. S
, c2 p; B, g. }/ L/ x; ` def resultsummary(self):" n2 Q! Q$ s8 l6 E' Y% f
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")$ H/ H [2 T& S% c( u
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))5 _% c7 j0 g0 Z0 e) T8 \8 b
print("N值:" + str(self.N))3 I5 P) j! {( m5 X; j2 k0 f6 N; O
print("期望片段长度:" + str(self.averagefragmentlength))0 `, H* B0 ?$ L
print("克隆保留率:" + str(self.cloneRetainprobability))
# [3 g( j5 Y! u7 h, {2 w2 v print("片段数量:" + str(len(self.fragmentList))): \* m9 r& x/ w: O+ L3 l% ^! s0 j% n
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))7 _: h. H! ^' N2 r3 z
print("reads总数量:" + str(len(self.readsList)))
" i: Z0 U: K" A1 Q) D; U5 x print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
+ R, F5 w: ^; X$ T1 o. ] m = self.allreadslength / self.genomeLength
: G9 C4 X: I' k; a& p, z; Q print("覆盖度(m值):" + str(round(m, 5))). Q Y6 O! L" e7 b2 C5 F1 ]; }& Y
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))) C/ j: w; d* \; p' D0 g5 g5 a8 ^
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))' N* r0 K8 X/ f4 d& u/ ]
# -------------------------------------------主程序-------------------------------------------1 n. @9 Y8 Q% I+ j3 i; b1 {
# 模拟单端测序- x3 s. p7 M+ k. n% N( c! n+ Z+ p4 i
sequencingObj = Sequencing(); q, l. G9 k3 I9 l# q7 w3 C1 p
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
, |" J$ Y4 ^) g0 x F4 dsequencingObj.resultsummary()
9 c# i8 X8 g' I. C; b3 L5 [0 B8 `2 O* T9 d8 U3 [" }
# 模拟双端测序
2 P# `% y+ r0 W- MsequencingObj = Sequencing()
; l) \1 K2 E, D nsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
: [. m' U" B" ~* {# isequencingObj.resultsummary()4 M4 D! y j+ ^4 L
from Bio import SeqIO
- ~2 I# r* c# }from math import exp, s( G' G0 r: c9 f# {
import random
% g$ A8 P! x7 q8 ^8 L, c5 o9 T* F& C: V( O X+ y8 V) b
class Sequencing:1 L9 b8 b9 h/ A T/ s! _$ H) _
# N代表拷贝份数
2 l9 P" q, F) p( x( H def __init__(self):
1 [1 a; o" V! u5 u& }% F" D) l" d+ v self.fragmentList = [] f0 U1 J3 w+ l$ V% F2 H0 x- C
self.readsID = 1
F/ F% T& a9 w9 j( }" D: { self.readsList = []
7 g# ]# G2 X( }3 ]7 F self.averagefragmentlength = 650! B: t7 E4 O' C
self.minfragmentlength = 500; w: \) u5 o1 R1 {9 S6 L$ b$ W
self.maxfragmentlength = 800
$ g$ d3 I0 {, } I4 k+ @3 w self.cloneRetainprobability = 1
, W A: S. D$ {- g) ] self.minreadslength = 50
' L S( X$ Z/ [+ n1 k self.maxreadslength = 150- A G2 \& B9 n
self.N = 10( j' V5 D9 ?; r! I, X
self.genomeLength = 0
, y! V3 Q& O, b1 Q& \, ] self.allreadslength = 06 D' M5 Z' C9 F$ @ w2 [& N+ u
; }# Q c5 e9 a* r4 L& a7 B7 t" o # 生成断裂点- u' ~% [! |8 J* d2 s1 \' o
def generatebreakpoint(self, seqlen, averageLength):
, ]. t4 s# ~6 G/ y; v7 F4 {, L5 r* x # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)% G& ]$ o3 U' p; g
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
4 U$ g9 V' K9 Y7 R breakpoint.append(seqlen). M# @0 L! o+ }
breakpoint.append(0)" S4 W8 @, x% h% k) L/ ]- w
# 把随机断裂点从小到大排序
3 R7 t( Q3 w0 P$ B& y& {. H breakpoint.sort()
- L+ O# t! g! {+ O/ w# m return breakpoint, F" i" s5 `6 ?( ]
9 x F. E" U! m- u
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp9 `0 m3 @6 y2 A: c! m; h1 V
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):6 M% j* Z) t- ^: p
for i in range(len(breakpoint) - 1):/ k: B# n" ?! |& a4 u& _4 j
fragment = seq[breakpoint:breakpoint[i + 1]]
* y; g& Y; S0 ~; z! z8 g A7 G if maxfragmentlength > len(fragment) > minfragmentlength:
5 f0 S/ X1 S Y! a self.fragmentList.append(fragment)" Y0 w+ e& v5 q, s" k( ]6 v9 T
return self.fragmentList' Z8 l$ `! o, f% a
& L: g+ d t2 S2 } # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率. P$ J; T' ~$ Y/ o3 U; l- h
def clonefragment(self, fragmentList, cloneRetainprobability):
8 H$ x; ~3 F. i, g0 G clonedfragmentList = []
1 R P4 m9 `& Y/ f Lossprobability = [random.random() for _ in range(len(fragmentList))]
: m$ J3 B. P0 P( r for i in range(len(fragmentList)):
, p3 @% ]$ h+ U3 o2 {4 b$ h- V9 W if Lossprobability <= cloneRetainprobability:7 V& a J P2 b$ F C! H1 p- j
clonedfragmentList.append(fragmentList)+ [" c7 t1 w0 y8 o
return clonedfragmentList
/ i# V x; l; V$ c. v- V: S
$ K! q' `, ^2 c9 O # 模拟单端测序,并修改reads的ID号; b+ Q& X. D/ }7 s" t
def singleread(self, clonedfragmentList):
* k' s. _& z# k for fragment in clonedfragmentList:
4 ^) y7 g/ S* g9 n2 T# _ fragment.id = ""9 c) T" \0 f* v
fragment.name = ""$ g% v9 p% V { ?
fragment.description = fragment.description[12:].split(",")[0]
& t1 q3 v: v/ F, ^ fragment.description = str(self.readsID) + "." + fragment.description* P: }+ j. J' z. j/ m( I- }
self.readsID += 1
' q( Y' j% S, K9 D2 K" R2 _ readslength = random.randint(self.minreadslength, self.maxreadslength)* g: g+ A; N6 s$ r! J5 x
self.allreadslength += readslength
, ^4 u0 ^$ j2 O self.readsList.append(fragment[:readslength])
I$ E. a4 \+ n( T2 p/ s7 j0 q% ~
; ^3 m$ M7 D4 S1 S% o/ Z- \+ {. z4 H def singlereadsequencing(self, genomedata, sequencingResult):
% g8 ~2 b: m7 Z; a9 Y for seq_record in SeqIO.parse(genomedata, "fasta"): k3 ^2 H' o o9 a! B8 D: `, P# X! b' M
seqlen = len(seq_record)
R& m0 B/ }: B$ U- \ self.genomeLength += seqlen
4 [' u! w: h F2 A0 h for i in range(self.N):
- n8 t l* g5 I1 l$ k, [0 f # 生成断裂点
6 J( M( G" O- ?& J" M# O breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)+ H# X+ n+ [5 E( `. o7 s$ `
# 沿断裂点打断基因组3 w: R3 B3 F }) }* i# J
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
( x- |+ W% P+ ^" b* B/ o8 U$ v # 模拟克隆时的随机丢失情况
* t+ l" T! ?5 q8 I/ V/ i7 j clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)) U( V4 g! A- a/ w6 P4 x2 b
# 模拟单端测序
3 ^# Q, i* }" \# Q! O self.singleread(clonedfragmentList)
8 ^' i6 A5 ~/ c. ^/ o k ^ SeqIO.write(self.readsList, sequencingResult, "fasta")! x5 o6 C+ C3 W$ n; I9 |
" W! y/ T) Y; }! K$ s4 }- X def pairread(self, clonedfragmentList):) A- Z# C+ F5 ^3 u6 @& \
for fragment in clonedfragmentList:
) t l L6 [/ j2 K& B: A1 H& D fragment.id = ""; L% `! Q k, h9 Q
fragment.name = ""2 U" F/ }% \. E% V% a3 K; Z
description = fragment.description[12:].split(",")[0]
, O5 P# H/ y; r8 D% ^ fragment.description = str(self.readsID) + "." + description
3 M3 L& p( r7 U/ Y. m9 A readslength = random.randint(self.minreadslength, self.maxreadslength)0 A9 @' S A M, ^) b8 D( v' Y8 ~
self.allreadslength += readslength
6 r( f9 ?9 ~- y% e. V self.readsList.append(fragment[:readslength])
7 D+ Z9 E( `4 d! V$ M% G" C4 w1 G2 K: s+ {
readslength = random.randint(self.minreadslength, self.maxreadslength)2 D* g3 x3 q- ^
self.allreadslength += readslength
( i) b- z5 T" I6 f$ Q0 a9 J8 ~; f4 J, f/ }5 Y
fragmentcomplement = fragment.reverse_complement()2 r* I& }; G: S
fragmentcomplement.id = "") B; @3 ~% X2 V8 |& Z9 q( z' E
fragmentcomplement.name = ""
9 ~' I4 e2 c7 y fragmentcomplement.description = str(self.readsID) + "." + description% g8 h# l2 {* g% j
self.readsList.append(fragmentcomplement[:readslength])/ a6 X2 X+ R8 E$ }+ C: `& o+ q! f; B7 j
% U4 [3 a# ?# B: ^8 x1 u/ r self.readsID += 12 M2 y/ w* N+ j" s6 ^+ K
( f& u- E, S7 I" z1 k+ d9 j def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
- K9 _7 C4 N& [* \" D% c% F; C6 I for seq_record in SeqIO.parse(genomedata, "fasta"):
& o& c, G: K) V7 l6 B seqlen = len(seq_record)) s* v4 I8 h" n
self.genomeLength += seqlen
0 H0 d1 q3 p0 { for i in range(self.N):$ ?# i- y8 V o+ c; q7 f
# 生成断裂点
7 Y9 y; d! X! O breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
7 R6 S- a6 X! D7 a: t4 S7 i # 沿断裂点打断基因组5 u) D8 b+ [0 w5 a0 D/ i
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
1 {# n" W5 f! o7 u # 模拟克隆时的随机丢失情况9 i% ^9 y o' T# N' k& \1 o
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
3 m% m& Q* H* \8 a2 A9 N% o: I # 模拟双端测序9 G) K* w1 G' ^
self.pairread(clonedfragmentList)
- E+ T0 |/ K( S readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]/ A4 x5 p* Z) }5 }5 x
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]) ?# _3 }# g3 w; D8 Y+ ]( }: l7 F
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
. r& K; c1 z* K6 G) D" Z SeqIO.write(readsList_2, sequencingResult_2, "fasta")
! l: f* W; [0 d) o) F' a# Q# r; m, E" K: k
def resultsummary(self):, g; L: R, I9 {) g, [1 P
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
) M6 Z& U0 R0 E' ~7 J# ~ print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))3 ~6 L: N! t5 t" A
print("N值:" + str(self.N))8 w. N" p# Z- |# @/ ]
print("期望片段长度:" + str(self.averagefragmentlength))9 b- f+ ^% C1 J9 k6 l
print("克隆保留率:" + str(self.cloneRetainprobability))
- q1 e) x; V4 m8 z& v2 l( j6 C8 U) s print("片段数量:" + str(len(self.fragmentList)))
' p% h- e7 l* X5 A" [# R print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))3 F" J3 O, b+ F0 { K8 Z8 g5 b
print("reads总数量:" + str(len(self.readsList)))
, N+ u& P0 s* ] print("reads总长度:" + str(self.allreadslength / 1000) + "kb")1 n. a2 A( u) `; r/ [9 O! m& w
m = self.allreadslength / self.genomeLength `6 E1 I! G" y8 m4 O d: O
print("覆盖度(m值):" + str(round(m, 5)))
- G& ~0 p9 ~6 B* w4 s' x, ?, j. N% ` print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))9 U# B2 H$ [" |
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))% g- U6 R$ Q# f0 |
# -------------------------------------------主程序-------------------------------------------: V- B4 A. H2 B- `3 A% H! o
# 模拟单端测序
; }6 _7 u. T7 O$ i$ N6 _# O5 p+ UsequencingObj = Sequencing()
( l3 H! E1 m0 l- M# |sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")& y* @* K! Y3 p" E% e; W
sequencingObj.resultsummary()
1 [, _- B6 ^: M* Z0 ^) O
8 P. |8 U. C) s8 ^3 ?7 |* l# 模拟双端测序
E: J) y3 G; r$ {5 tsequencingObj = Sequencing()
! k* j, c! r1 h5 V8 I2 CsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")4 E0 |/ z: q. f! n1 j3 r
sequencingObj.resultsummary()
, T! w; _+ k! a
9 r, A4 F' d4 M
) k, p1 R9 b. C! N0 p1 f1 h) r w; }6 s1 O# E# W% _# E
8 l/ v( u f5 q% n: [$ R
|
zan
|