- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563344 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174226
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟1 L% q- D, L+ Q: p# l U
基因组测序模拟) _3 k% f x! Q
5 u1 t, {: G; l; U7 s! q; l Q一、摘要. t) l$ u& z# Z, K: o
; {- K( W' G* Z' [9 [* E通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
8 Z6 e) r S/ L [/ a3 g% O3 ^/ a! }2 u, v$ C3 C$ G
二、材料和方法
. A6 C5 V/ G4 F% Z& F& {& h9 L- l2 j; l2 l v, S
1、硬件平台
) H% I4 F z" D1 s! I8 L& O* F4 _: Y& f- Q* P3 G. z: v: a
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
9 T4 g. ?0 @$ r) W" r* r安装内存(RAM):16.0GB
# j5 |) B" I4 k( d; E- L4 W
% f8 f, H% A: j0 H1 T' `& }2、系统平台
4 [1 S, c) a, R% L+ Y. W3 S2 BWindows 8.1,Ubuntu: [4 {- r$ U7 X. Z |3 s. b. r
4 x ^8 D- a# ?: p0 O: e* h/ D
3、软件平台
" C" u' s$ v& Y
* W; ]( U/ B, ]( N% f ^art_454; e/ I i( y5 o h0 |
GenomeABC http://crdd.osdd.net/raghava/genomeabc/
! H6 ?1 s9 K8 Z3 X, U: A8 jPython3.5
`' d6 P1 }7 F1 g" ^- L1 q: WBiopython
7 S% g$ k; C; l7 a- a7 H4、数据库资源* {: N S3 d2 y5 R0 E5 D5 p
5 U( {7 r# D- d/ X0 ZNCBI数据库:https://www.ncbi.nlm.nih.gov/
8 q" ]9 w' f. H/ d6 v2 p4 c- U" E$ D
5、研究对象2 D. k1 N5 X/ u# I
$ V/ |" i+ k0 ^5 f1 Q0 O
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
" _9 M1 d. o6 c; G: o: Oftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
! q) O: [3 L6 n5 B) d
* [6 {' G5 N+ i2 \" A8 V6、方法
) j) C; F% x: ^, r( M5 U
( A( E$ g. ^" ~) mart_454的使用 # j8 p" y T' B$ v. N
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。& B% K1 X& h! x. `( S p! A
GenomeABC
* o6 [9 ~0 H- b" L6 T( C* z1 X" N进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。& U: s L" M+ v
编程模拟测序 # ~4 y- A) [- P( ~: x( c( t1 w3 t! p
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。( @$ y- [6 |! Z# f1 M- N
三、结果
8 O U9 h( X" p4 B0 u e9 q: V8 e, T* E9 | \! z- U( T6 M
1、art_454的运行结果# d$ t# G# a7 p- X% G7 R
1 k8 @; a2 A: u, D; |) Q5 a
无参数art_454运行,阅读帮助文档 4 R9 l% U" }* C2 n' f7 n. Z% G
/ n, |4 n) v4 R1 |/ a' [5 z
图表 1无参数art_454运行 1 z* P P' _1 G3 W5 p
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. 3 u6 j, c C5 T/ ~/ F* \
下图为模拟单端测序,程序运行过程及结果
* \1 P& `. S% w, ~: }( m6 S3 y+ s1 W
图表 2 art454单端测序 # g3 t. Q" F0 g2 }$ X6 d" C
) J0 X: O |/ J" B* Y图表 3 art454单端模拟结果
3 X b5 l0 S, k% ?) I2 P双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 7 }; y9 S% u$ n% e3 u" e
下图为模拟双端测序,程序运行过程及结果
- I/ }5 ?+ S) j9 y. q7 `/ i7 J( W' @" h/ i
图表 4 art454双端测序
' `" M& J& X2 a! k* R5 s
1 m+ {5 n% o& y; W) a图表 5 art454双端模拟结果 . B, U+ p$ W) w6 r: o, v5 j
2、GenomeABC ! l; r3 {. M" P! v& D; B$ ]
下图为设置参数页面 9 C/ K; O9 r0 }# q) d1 r
. X: U# {: x' k# Y5 \8 A+ U w0 J
下图为结果下载页面 9 C4 k8 t6 C( q) I3 T$ ~6 }
: `& Q$ p/ F8 A图表 6 结果下载页面
! M' k8 a, t5 R# v: Z3、编程模拟测序结果 ! Z; l( U8 T3 }" P! l
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
$ h( f" h S" F单端测序 6 a/ b1 T8 R" g4 Q3 a0 j5 _
& ]- R4 `% `5 L, F8 B ~
图表 7 程序模拟单端测序
) e. \- t- ]' c2 e7 S双端测序
! V+ D% s7 z0 s& s1 H
7 V: g0 l& J* D' W: K7 i图表 8 程序模拟双端测序 - V8 d3 t1 j7 [5 ]
测序结果 * `7 x- \& V9 {& a9 F+ T+ C" a
g' N$ U# S5 R0 J4 M. u7 ^ J图表 9 结果文件
3 Z( t3 ~! W* t9 Y o
G: H' ?6 D* \! u4 v0 w: l因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
% R& g/ [+ K, s测序结果统计表
3 H s' N9 T' o1 L( @- Z) ~# V1 {) f8 w) D) W
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
: ]0 d' ]% W4 c8 Y: ?) s4 l单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.466822 Z/ b1 E, O# H! W. Z; l- p
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424! y) W( O& `4 s; W( O/ d
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
% g( o% P# a5 a. E# ]$ {. `双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886. A% j3 ~4 E$ y: V* Q- `
四、讨论和结论- [, I5 T/ ]6 z( T$ W3 t: W& K8 e( t
+ A# N- {0 b; y. \程序运行方法; Z0 ~3 p, ^8 a, K4 |. d+ l% [
! u4 n- [& O! p" k' x在类的构造方法init()中,调整参数。
, j; }' t! y+ O! O0 WAveragefragmentlength为片段平均的长度;
) A) f& G, W& m) R. a: H" {; e# ^minfragmentlength和maxfragmentlength是保留片段的范围;
3 N3 p9 ?, u4 c1 G3 UcloneRetainprobability是克隆的保留率; 2 X& h7 `2 e* `4 \2 C
minreadslength和maxreadslength是测序reads的长度范围
& t4 j( s& T! _/ ~1 t5 `
% D9 h. f+ |4 c模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。7 |. h; G$ Q* w
/ ?+ F% G$ K3 d; d) i2 e e
附录1 }/ b/ o) @% ^) P7 Z: t
7 U, O7 K& q; x& v( @: z+ \/ y
from Bio import SeqIO
( V9 C' F" O, G; C; mfrom math import exp: R' ?0 f; u3 T3 S. @
import random" v# }- \& p, B: u* G4 o# z: B
. i! C- g- l+ U1 c% v. G, I! Bclass Sequencing:$ G) h* t4 p3 Y: N* O: g) {6 s+ P
# N代表拷贝份数6 ?" x+ }* o& k, F& @
def __init__(self) H, Z" G1 V5 `% ]5 g: o+ P6 R* @
self.fragmentList = []
8 {: m, `0 }: \, d4 S self.readsID = 1
6 a, C+ A) Q6 f self.readsList = []
- o2 { s6 E; L0 ~8 b self.averagefragmentlength = 6503 w/ A3 d7 \, c; q: L
self.minfragmentlength = 500' V' d- j" ^6 w" c( D' ~
self.maxfragmentlength = 800) J% c5 o1 H( k1 l/ D+ H5 v
self.cloneRetainprobability = 10 w& Y0 ^, |* y9 B7 x" k
self.minreadslength = 50. n4 R* ]6 Q8 ?, b1 m, {
self.maxreadslength = 150$ h! K! ?$ P1 F3 ]3 R8 _
self.N = 10
- Y5 ~- u$ n2 e self.genomeLength = 00 e) L% F3 d$ T4 n! }% Q' L
self.allreadslength = 0( j/ F7 Z9 D( t, y
( x$ C8 [+ y% G t* t) n( I
# 生成断裂点) M+ X! N& k5 J+ h; T
def generatebreakpoint(self, seqlen, averageLength):
% P3 V& D2 t; ~% p$ i* l3 E6 C) l4 p # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
0 O7 K% F/ K! m, W4 R breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]4 @# N! g4 ]" k+ W
breakpoint.append(seqlen)- \$ W, F) J( ?# ]
breakpoint.append(0)
; O/ ~; o8 a8 s # 把随机断裂点从小到大排序
) @, v1 ]- b1 i$ \8 G1 w/ n* X D breakpoint.sort()
: r) j7 _8 P2 ]% V2 H return breakpoint. R, r a' E6 i( g: C: u+ B
9 }+ G# ?. j1 Z/ W4 c # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp/ z K. l- K0 P# G! X
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):- N' R: q8 B. Q% N* A4 C9 Q4 Q
for i in range(len(breakpoint) - 1):
+ x& ]1 o1 `- _5 |' u fragment = seq[breakpoint:breakpoint[i + 1]]
( | T5 ?, P: W$ b if maxfragmentlength > len(fragment) > minfragmentlength:# v0 {1 y( w9 K; E9 B) M3 R
self.fragmentList.append(fragment)9 [. j" `& U. j+ d
return self.fragmentList
: c7 Z u. j( I4 v
* Z# E" V$ q. Y$ F: [ # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
9 N( R1 I8 `5 G% m) W def clonefragment(self, fragmentList, cloneRetainprobability):
/ U$ N; m6 L1 @/ u3 m clonedfragmentList = []
% f0 C4 y: d/ K Lossprobability = [random.random() for _ in range(len(fragmentList))]' b* l- J4 r; |& q) M
for i in range(len(fragmentList)):1 g$ P) d8 {; o& Y- f) H/ k- i
if Lossprobability <= cloneRetainprobability:
% O* h v9 a! e1 G7 L clonedfragmentList.append(fragmentList)! i8 y1 I2 R* T6 {- m
return clonedfragmentList& t% h3 S- e8 Q
4 x5 j1 d% L( s9 H # 模拟单端测序,并修改reads的ID号
- a) u( K0 ^# Q' D2 N def singleread(self, clonedfragmentList):
8 b& Z% X5 L: x* N3 p( n% i for fragment in clonedfragmentList:
1 d7 I1 A6 I) p* h" D8 m$ D& n fragment.id = ""
& V4 V% H/ o' E fragment.name = """ _+ G. \: q5 ?) H6 p! L
fragment.description = fragment.description[12:].split(",")[0]
1 b; u t8 H& u8 S fragment.description = str(self.readsID) + "." + fragment.description6 P2 T7 g+ |9 _
self.readsID += 1
) Q, |- h- X4 F5 I5 { readslength = random.randint(self.minreadslength, self.maxreadslength)5 k, F$ g& T/ ?- A! r
self.allreadslength += readslength
6 J- c8 a8 a* L' J: C' O2 } self.readsList.append(fragment[:readslength])
) Z* B2 B0 w8 z- k' V8 y: i& Z9 W* o0 G! W8 ^
def singlereadsequencing(self, genomedata, sequencingResult):
) o* L. c* b" p) x z& c, U% E/ X for seq_record in SeqIO.parse(genomedata, "fasta"):6 j; W& t1 B9 e2 w8 {
seqlen = len(seq_record): |% B3 ?! o$ z3 v$ ?9 f* k
self.genomeLength += seqlen. E( Z. c! Q2 F$ \+ j+ H
for i in range(self.N):
S8 Y% f! y$ A" o* \+ O # 生成断裂点
) \! A* |/ {4 R+ P, {" f" a breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength) i# S+ l) Q; N. n
# 沿断裂点打断基因组* [' A0 r( A1 ]# c" n* k i2 [
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength); S4 k9 u/ R& r0 x! O9 i
# 模拟克隆时的随机丢失情况
' u) `( l p4 s9 }8 K# A$ c1 ?& W) y9 O clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)- W1 t. |. [0 }5 G
# 模拟单端测序* ] O/ H; {( |
self.singleread(clonedfragmentList)8 x1 Q3 W* d5 h9 {- `/ p
SeqIO.write(self.readsList, sequencingResult, "fasta")- W, D' Q% O5 h% n
9 x0 ~7 B A5 f- e, F def pairread(self, clonedfragmentList):
8 R+ u, N6 l) g! \, R for fragment in clonedfragmentList:6 M: W& {( A* |0 h. S
fragment.id = ""# @$ M& e6 F- D% U- u! `
fragment.name = ""0 z5 K9 g2 g& ?5 s
description = fragment.description[12:].split(",")[0]
# M" v0 e2 u+ i3 O( Y fragment.description = str(self.readsID) + "." + description
) o$ o# q; i& a; w readslength = random.randint(self.minreadslength, self.maxreadslength)& Y( A6 j' \1 J% _
self.allreadslength += readslength
; @) k0 d3 _" Q6 U1 c8 F) q self.readsList.append(fragment[:readslength]). p W7 u7 J6 j* E6 k8 K
5 F" T- ]- E( q! f: Y# k* W readslength = random.randint(self.minreadslength, self.maxreadslength)
( W; b, k% `( d- J! ]+ D self.allreadslength += readslength/ ^+ A! Q, p* h
" D$ E2 h& H, u' C9 {; ] C t
fragmentcomplement = fragment.reverse_complement()
: w+ a+ [, T/ q8 p ]/ A3 w fragmentcomplement.id = ""( ]! u0 O, A7 }' c3 `4 D0 G- d1 b( w; g
fragmentcomplement.name = ""8 d% A/ S. V4 V
fragmentcomplement.description = str(self.readsID) + "." + description+ a9 P2 Z) D k
self.readsList.append(fragmentcomplement[:readslength])
$ W: Z& s. t& E! y: k: H! S6 D
2 j) |& i, z2 M, A self.readsID += 1* L; Z3 F: H. D8 k& o. V. {' |
6 \! w" p5 e& {" U: p0 M2 O def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
. I8 l* Q) X8 Q1 n. s, x3 k5 l for seq_record in SeqIO.parse(genomedata, "fasta"):1 I3 Q0 m: V8 v3 N& o
seqlen = len(seq_record)
; e/ q7 c( T* g9 p# H2 M4 O g0 I self.genomeLength += seqlen3 J! @5 E( C' e+ l9 D/ ^
for i in range(self.N):5 F9 X2 R$ p. I2 \' q+ W
# 生成断裂点
2 @8 ]* R1 h/ ~0 I( C9 ~ breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
' l6 t4 }% ]7 U. j% G9 `4 w& D # 沿断裂点打断基因组
; S f, N2 A# o. \ t1 c4 Y self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)" e& v6 R% l7 G
# 模拟克隆时的随机丢失情况, b1 c" e# N& @3 A8 [
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
: ]7 A* `. T n! @ # 模拟双端测序9 ?" s( X E9 r
self.pairread(clonedfragmentList)
3 W l0 S8 w+ B% g& i' a& K readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
9 E# ?5 Z; G) Q0 o/ C( ^ Z readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
1 Y6 I& J1 k* l SeqIO.write(readsList_1, sequencingResult_1, "fasta")( z! b0 z% |' w" {3 W9 t
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
# m, z' Q! i9 O1 ?
0 |: ~! n) O7 N8 y! t! r: w def resultsummary(self):
8 q: [1 H: L I% A! l print("基因组长度:" + str(self.genomeLength / 1000) + "kb") Y- G- g/ f9 n; z
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))& Q6 q' Z% a8 a, k6 Y
print("N值:" + str(self.N))- l, Q' q0 A' A8 }3 R# r
print("期望片段长度:" + str(self.averagefragmentlength))
! Z( f; w4 Z! I& X3 t: |* m. J print("克隆保留率:" + str(self.cloneRetainprobability)) J, E# F: B; U; a/ ?: B
print("片段数量:" + str(len(self.fragmentList)))+ }8 S, z! x& T
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength)). G8 @- J% W- [8 _5 m* K0 p
print("reads总数量:" + str(len(self.readsList)))
: g* Q% \) h5 @3 I/ y print("reads总长度:" + str(self.allreadslength / 1000) + "kb")8 e- {9 E6 \5 e, I7 C; B. e
m = self.allreadslength / self.genomeLength' ^9 Y% }: v4 \# b
print("覆盖度(m值):" + str(round(m, 5)))
) j% b% B9 W0 _, _ print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))6 C: s5 X) ^* Q4 S. Z, A3 H
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
9 V1 ]' z9 a: m; ~4 w8 U# -------------------------------------------主程序-------------------------------------------- L0 h& y9 B$ h
# 模拟单端测序
8 g6 d1 w2 _( z5 EsequencingObj = Sequencing()3 p' V+ L, R7 P+ B: t8 w# f; d q) R
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")$ G5 |- [" N0 ~2 s
sequencingObj.resultsummary()
& S5 k6 w' y: R3 T7 R- S) [% V$ G% S/ B4 ^
# 模拟双端测序
1 z% Z! s! T! z3 B( msequencingObj = Sequencing()4 T* a2 O& v6 G) k
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")" `7 ~4 ]" t3 C6 @, V9 b1 D/ d' X8 G
sequencingObj.resultsummary()
- E1 ^5 |0 F* r# l8 `3 Ffrom Bio import SeqIO( u& u5 Z0 V9 h4 ~
from math import exp
: t9 Z" N5 y, F) `3 v( Q# zimport random5 C. j8 h" j( r! j1 P z* o4 Y
. N* B9 t! [* |- K7 [& a
class Sequencing:5 t+ ^5 F# m( l* J; m. I* ~6 b3 e
# N代表拷贝份数
- q" g: z. E- W9 }+ ` def __init__(self):9 z- r+ E+ N$ A0 t% @. o8 @. s$ Q
self.fragmentList = []. ?3 a0 u2 O& F) v% |: Z+ x4 f
self.readsID = 1
~1 g6 b! Q! M! a$ p self.readsList = []
, U7 ?+ l+ G) K5 Z' P" k- S0 } self.averagefragmentlength = 650: q8 i& ^7 B7 S1 a: V: o2 I
self.minfragmentlength = 500- w3 s, Q& Y6 }% `" G$ E
self.maxfragmentlength = 800$ e! r8 v7 [% f2 x, @! W5 X; @
self.cloneRetainprobability = 10 q9 i) P- Q" ^/ R+ `# W
self.minreadslength = 50
* U; w; }+ E0 \, [ self.maxreadslength = 150$ p6 g6 _/ c6 G p
self.N = 106 E G/ w7 _9 Q& p/ F. Q
self.genomeLength = 0+ m8 S$ T1 s, d* d
self.allreadslength = 01 t( l8 q7 G1 s4 v) P
2 r! K4 T+ C( |* T8 M # 生成断裂点
! I% f& O1 y2 D, C. Q9 ^- m4 E4 b4 U def generatebreakpoint(self, seqlen, averageLength):4 s$ ~# V8 D" _- N8 K! k J
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
1 n9 W, G9 R% k; g% a breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
% X4 H9 T* v! `- O! z breakpoint.append(seqlen). L4 I5 g5 Q7 K x
breakpoint.append(0)/ S' P4 f& u4 o" I- k
# 把随机断裂点从小到大排序8 X% g2 P& `. j% A! C7 e. C6 S6 _
breakpoint.sort()
: C; |+ A: M/ o4 ]; ` return breakpoint
% L; c+ A4 a$ J( w7 h1 {8 C2 _! m; l; v3 \
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp7 D& ^ p& `. p6 s1 [
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
8 X, }* k; Q+ A1 r4 R* a. [ for i in range(len(breakpoint) - 1):0 l5 b1 _% b9 d0 \, m4 |1 w
fragment = seq[breakpoint:breakpoint[i + 1]]# G6 Q& x$ w5 E5 U( d! X1 B2 q! @
if maxfragmentlength > len(fragment) > minfragmentlength:1 \8 Y+ P6 U( M( B: R% W8 T- K
self.fragmentList.append(fragment)0 F9 a; A+ A5 Y1 j' f. s
return self.fragmentList
; ]% M3 `, U2 ]4 J
$ l* C& u; t6 I7 z) W # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
( U& X& J; A/ s0 R. ], ~0 L# J0 d def clonefragment(self, fragmentList, cloneRetainprobability):, s0 R( a2 q2 O V6 G. }! j
clonedfragmentList = []
) h4 G3 O7 a. G$ |+ b' M* I9 W Lossprobability = [random.random() for _ in range(len(fragmentList))]
6 Z- B) A1 j1 }+ ]5 X* b for i in range(len(fragmentList)):
5 r. \% Z8 S& t: s! G3 D# Q if Lossprobability <= cloneRetainprobability:
9 ?$ N, _& z: j8 r! i8 x clonedfragmentList.append(fragmentList)) h2 X! u0 \! Z* j3 r
return clonedfragmentList
0 y8 R, b+ s3 B) |7 t
# ^ R$ v: S9 J$ A* }7 C # 模拟单端测序,并修改reads的ID号5 w4 n2 a+ n6 }: K
def singleread(self, clonedfragmentList):
4 l. m5 S! ?' E# T8 M. j- k$ C6 B for fragment in clonedfragmentList:1 p8 z. m. _8 _9 [
fragment.id = ""
( A, L: o- q/ p: q- o" y! Z2 ^ fragment.name = ""
6 u3 t: a" b# m G3 Z( X f# |- A fragment.description = fragment.description[12:].split(",")[0]6 B+ Z7 m6 k3 T( c; [: x
fragment.description = str(self.readsID) + "." + fragment.description
S2 B; |6 s. @) q' P' @: b self.readsID += 1
4 s2 J- M+ H( Y1 ^ readslength = random.randint(self.minreadslength, self.maxreadslength)
7 G+ e7 K J' w B self.allreadslength += readslength) m# u: I+ W! v/ C) R
self.readsList.append(fragment[:readslength])
- @0 i: j6 S7 b' q6 ]3 n# }9 d/ F$ r
def singlereadsequencing(self, genomedata, sequencingResult):
9 p+ r3 g u* Z' h1 ~! H' N for seq_record in SeqIO.parse(genomedata, "fasta"):
" G0 }$ b9 a& H+ F8 H seqlen = len(seq_record)6 O$ K# t+ a; c% E$ S- t$ k
self.genomeLength += seqlen
0 |, S, h$ d7 b. t. w: @ for i in range(self.N):
) b0 H( E7 c5 d0 @+ v8 V # 生成断裂点
6 V; v4 C0 g9 B" N breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)- L' [7 G; Y5 F) A) Q; k1 ~
# 沿断裂点打断基因组 R. b& A0 q8 d0 b& X
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength), @4 r# ]: y6 \7 F2 Y
# 模拟克隆时的随机丢失情况7 Q# a) F% t/ g" d! E
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
T- Z5 V" O% I1 _! u4 ` # 模拟单端测序
9 G u. L$ z7 D: i self.singleread(clonedfragmentList)! ] @: L4 |* H) g9 i7 I' o- |1 [
SeqIO.write(self.readsList, sequencingResult, "fasta")
' F! j$ {5 T! s" K3 Y* ?! {$ P d" Q# j
def pairread(self, clonedfragmentList):& y+ U' t" _. h6 r
for fragment in clonedfragmentList:/ Q9 f" M+ B2 T. d f
fragment.id = ""% C6 }& f1 G; I7 A2 x
fragment.name = "", j9 e; Y* x0 T. g" o. S5 h0 h6 k( X
description = fragment.description[12:].split(",")[0]3 z/ Y- n. ]# }0 a: O' V
fragment.description = str(self.readsID) + "." + description
& \6 w4 H% Z/ u, k# r6 O readslength = random.randint(self.minreadslength, self.maxreadslength)
0 \5 B" W' o8 c3 E7 j0 _ self.allreadslength += readslength
1 y; b. S1 i$ @ self.readsList.append(fragment[:readslength])
. F, n! }6 s l* D8 A
4 j' T* `5 x6 N: W, _ readslength = random.randint(self.minreadslength, self.maxreadslength)
l3 [/ y- G7 d# ~' v self.allreadslength += readslength8 M4 V* t4 Y; j3 @4 b) T5 Z$ S' K
) H) M/ j6 v. Y% i1 m fragmentcomplement = fragment.reverse_complement()* v9 v% L( R1 [) d
fragmentcomplement.id = ""
: L$ u- V9 `+ I0 O/ o8 p fragmentcomplement.name = "": E( J {: D6 T' C. z: ]
fragmentcomplement.description = str(self.readsID) + "." + description; e( w. [+ j# d
self.readsList.append(fragmentcomplement[:readslength])% i- }6 G! n" m( l n. T
7 o! S' `7 `0 y; r# h K self.readsID += 1 `! \) B8 w: [8 n# J( g
1 d0 ~+ z" ? m2 J u
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
3 ]* z! V6 t3 Q: v0 B for seq_record in SeqIO.parse(genomedata, "fasta"):
0 o# w) w' A/ | seqlen = len(seq_record)- ^ @( x0 f7 w% N: G& m
self.genomeLength += seqlen
+ t4 p9 D2 o& s8 J; L) |- w for i in range(self.N):& X7 S9 _1 o9 G Z, V, r1 Q2 n5 q
# 生成断裂点
7 H8 J6 m8 g* b! Y; P$ d breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)) o4 Z% l+ H5 L1 i, h5 i5 ^
# 沿断裂点打断基因组* P- h8 H4 L: B4 I
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)0 M& o4 n2 g+ p9 I/ j, `% K
# 模拟克隆时的随机丢失情况
2 m' `; @6 n: k+ q* c+ t0 H! A clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
3 R+ C3 c& F# D # 模拟双端测序& \) ?4 f* N1 t
self.pairread(clonedfragmentList)
8 @8 f" L% P$ ?2 g, B, q7 z readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
# V/ g/ U6 t0 {4 k. t readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]4 ?+ B- Y8 a# M& o L( m
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
6 P3 Z5 _1 A; ]: P SeqIO.write(readsList_2, sequencingResult_2, "fasta")9 a7 C) \4 p& l3 r2 Z& L6 `7 D
2 C, c1 ~, u8 M/ ]1 O7 G! U: O+ A0 Q def resultsummary(self):$ ^) E9 h/ O& \6 Q. r/ ]1 T- Y
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")* v5 A2 }9 d2 ~ t
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))9 w* H# F r+ K; D
print("N值:" + str(self.N)); }2 v7 W0 m" p, t! U; L" b T
print("期望片段长度:" + str(self.averagefragmentlength))+ Z# _9 D9 D( e0 j a0 h; {
print("克隆保留率:" + str(self.cloneRetainprobability))
" H3 H7 A8 j% H; | print("片段数量:" + str(len(self.fragmentList)))
: o1 H4 y% C E1 [6 ^ print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))0 ?% B* {$ `2 e# |
print("reads总数量:" + str(len(self.readsList)))
' _" h' [, {) h2 C6 J print("reads总长度:" + str(self.allreadslength / 1000) + "kb")3 Y0 K# t, d) k; A' c. O
m = self.allreadslength / self.genomeLength1 \) j$ T' D/ {, ~
print("覆盖度(m值):" + str(round(m, 5)))4 o f; I7 u5 D3 n6 o2 N, w( R
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
/ v: X X c# x- t print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
. p2 K' ]2 {6 p/ O, W: i# -------------------------------------------主程序-------------------------------------------4 a. L& T7 q" f# c; J" Q9 z8 I- s7 o) ^
# 模拟单端测序
+ P; H4 S/ S* o# gsequencingObj = Sequencing()
+ j$ S/ `" ~; |/ \+ p4 ]9 Z' ssequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
- g# a3 @+ y- }& R; osequencingObj.resultsummary()
4 m' L: @& S a& Q; `# U
# d$ e p6 {5 O$ g( O$ h0 d. N# 模拟双端测序
8 ` B( m' w2 ?) n: j$ R* M- MsequencingObj = Sequencing()
2 p( @2 Y& J0 @5 tsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
, |7 |& t# C! L$ J7 BsequencingObj.resultsummary()
- f3 b. F9 J) j9 F. z( E4 {: a% E0 `
+ s: q" h5 k" v) ]
' o3 q9 U5 K- u) L- n( l+ w
5 V. ^- b) i2 d9 F$ R$ p8 |4 M5 | |
zan
|