- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 564627 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174611
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
. o9 X) R$ U7 w+ u9 }1 R% t基因组测序模拟
( z6 N7 d* d5 m$ \; g$ r5 J" z+ D0 ?% {; N# x% v# m4 @6 p6 i/ d0 @
一、摘要
( I8 s9 u3 q7 e4 p5 s' L
7 f Z9 s6 W4 f. _$ \通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件6 W( z1 k+ Z" m& A; A; T
& v0 g7 c! w4 x. W
二、材料和方法
4 r" X: ~# }8 Z5 ^6 Z
5 }( z' w' J& Q! H. f1、硬件平台
+ l9 }4 R0 C$ x1 X0 v+ P& A$ p" A8 [+ t( [6 w
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz , x/ H# U- J ~, H# n8 `
安装内存(RAM):16.0GB% l, n* C. a+ ?
; j6 v& f- x3 f5 P- _) N2、系统平台: } b7 T# g9 u- Z
Windows 8.1,Ubuntu3 n3 n# w0 B% w1 n( u w# T
* @4 M" n/ Z) x/ c3、软件平台
! Q" i0 R: d7 i2 W
6 E! _5 C, a/ m9 M0 I( u2 m* Bart_454' V3 @5 H+ Y5 J" v, @$ J
GenomeABC http://crdd.osdd.net/raghava/genomeabc/' d$ \+ K) y) r" H8 a* H
Python3.5
8 M) V$ P3 i4 O8 Q( a8 X9 x8 V. wBiopython
/ ]9 R3 I) q) \4、数据库资源 r1 S; ?7 D8 F- x
R6 O( V$ R- h9 u
NCBI数据库:https://www.ncbi.nlm.nih.gov/' _( {7 m5 n1 W5 r2 Q: C3 [
5 h5 b; G+ }: f
5、研究对象
" L) q0 o. i; }9 v0 ]! J5 v( _' d2 R
酵母基因组Saccharomyces cerevisiae S288c (assembly R64) 7 Y& ^9 t6 r* B9 ?/ X
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
+ P6 n& N4 }* ~. ]: J. I% F- O1 Y: E, s, ~
6、方法' R. d q! A( Q. O0 G$ f
# R/ j7 ]# X7 |9 C. uart_454的使用 % ^) ]- ^) ?- `% S# N1 W
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
: W. B; D% w5 j1 E& p3 bGenomeABC
4 ]8 H1 C: U$ B! n8 V, W" r进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。+ I3 x' F! ~: Z$ M0 }5 T: _
编程模拟测序
7 c7 a: i& y/ v下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。' r: T, _6 B c
三、结果" m+ C3 i- v5 c4 I
* E6 ^1 t4 }/ o( d0 _* h M
1、art_454的运行结果' E v9 B2 c+ g: b9 d8 } R
2 y2 _+ @7 b/ X" @3 d3 n U无参数art_454运行,阅读帮助文档 5 x1 T0 l+ b _) s
0 X K# T, f: w9 y) J3 {. c图表 1无参数art_454运行
g2 I$ p9 }- _# [; S$ T! H对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. {& c' H+ v6 \
下图为模拟单端测序,程序运行过程及结果 7 T2 X3 O9 d. o7 t1 Z( S' _* A4 U, c
* T' l8 }/ w0 w: q5 `+ q图表 2 art454单端测序
2 L( F* [; L! d2 r! _
X* Z& C8 J( W: S图表 3 art454单端模拟结果
3 \; Q! X: O) @; e双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
: |. ?1 Y" N: g4 U. g8 J下图为模拟双端测序,程序运行过程及结果 5 L7 @ H9 n; J! I
7 W ]" s) f; }9 k. f% |图表 4 art454双端测序 ; E* p7 C/ Z% [. P# p
7 Z3 k2 ^' `8 N7 `& `
图表 5 art454双端模拟结果
- [) h2 [6 a0 m0 L2、GenomeABC
( E2 q6 ~$ b/ |( V% y8 O+ y下图为设置参数页面
% p1 _) O. J; ~" l8 y/ @- _; \
- p, j3 I" B/ R! O下图为结果下载页面
' N- G- j$ `- a, [; L7 y. `: c* e1 r& S3 ]
图表 6 结果下载页面
( C3 Z. c) o+ A- b) G/ J3、编程模拟测序结果
( `8 z4 {, z* k6 F6 t4 N6 q拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
; Z. R- w* ]7 ]单端测序 ' E' M q+ h7 ? I1 `
+ W7 F+ d* A4 A R1 g* A
图表 7 程序模拟单端测序 5 j/ ?: e( T9 p8 @
双端测序
, c* }3 W6 L, h0 p8 q
& _" Q6 W. m6 c# u5 x k) P图表 8 程序模拟双端测序 + ?0 @" g) W5 e' y* W
测序结果
8 ?2 U5 Y: F: \* R1 P' ]( ~0 z
; @& ?, k6 d" A: j) }图表 9 结果文件 p" K; e; O! Y. R* y) k
+ f9 Q0 D; x& W+ f( R5 N
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 8 a P# i' O' l" F, C) U# x
测序结果统计表% c. _% T1 C, h% M4 W
* |4 U% W8 r( r- l2 \测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
( Y1 ^8 ?3 ?6 k" b单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
# Q' J- A5 e. x单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
- o: `% B9 @- p8 t. B/ h双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
2 G- \& W! R- j7 e2 E( c' f& v# s3 u双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886- G: l6 t M8 Z' g* @
四、讨论和结论
9 F' X8 K" ^3 U; \5 f) Y* M
$ X$ j0 t' w8 H) h% ?' B) Q1 I" a程序运行方法7 \( |/ I+ y' h, X* H9 X' Q
- W/ S9 ^, C% Y! ]. d1 w& W" x$ v( M在类的构造方法init()中,调整参数。 ; q* @/ S5 S# T; @1 M" Z' t
Averagefragmentlength为片段平均的长度;
3 ?. l% v5 A9 B+ I- Zminfragmentlength和maxfragmentlength是保留片段的范围; 4 ~0 A' @9 P6 @# `* g# O
cloneRetainprobability是克隆的保留率; 0 W7 v) Y2 } u* a7 @5 C: ^
minreadslength和maxreadslength是测序reads的长度范围
; ^) X5 M$ x- `/ C% ?% ?0 l, g. ?: P8 \
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
/ r9 v% L2 U5 Z! B0 P8 { Z P. S8 t
附录
i ?* E( ~ c; O7 b; L
' s6 \" a+ F0 g( N3 Z: h; {4 Ffrom Bio import SeqIO# Y' Y' b- n/ ^ _ T8 N
from math import exp* k: x" k+ h" Q/ d! }+ ]
import random
; s9 ]8 I5 E- G4 k5 Q6 s) o
, D9 S* p, c) h: {* @* hclass Sequencing:1 k. k6 Z* i0 G& K! t0 j0 E8 P
# N代表拷贝份数
7 }4 s4 m& I0 B# D& A def __init__(self)
3 P' ?4 x8 d* s, O1 M9 r self.fragmentList = []1 D+ D/ y) C$ q0 V! u, _
self.readsID = 16 q: H' @' R0 J
self.readsList = []
7 P2 `% C& c8 }$ \/ Q$ O3 d7 l self.averagefragmentlength = 650
5 j' x% `, D6 q# F self.minfragmentlength = 500
; ] I8 A. m, s0 @* ~: ]- G self.maxfragmentlength = 800" e7 |9 N9 }) o
self.cloneRetainprobability = 1
0 {: L6 x( r0 N8 X. x+ a self.minreadslength = 50' r K+ a1 ~; w9 |9 D' q
self.maxreadslength = 150
8 N3 q' X) G# V7 B( F' Q2 k self.N = 10
t1 F6 H* w6 K3 z self.genomeLength = 04 q C7 K4 Z- P+ O$ u
self.allreadslength = 0
1 F) a/ N0 @' ]6 G( E5 V8 H) @' L* O/ _# X
# 生成断裂点
, {' n/ z( ~5 u+ j0 @( s: s def generatebreakpoint(self, seqlen, averageLength):
K( Z6 ~3 l; @( x, z; S- d. @ # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
" f/ I' T2 L6 x( p4 L& R4 O) n& F breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]" P4 n8 s7 f* [% G
breakpoint.append(seqlen)& y5 s9 F2 ]2 x! y% N& c) U7 c
breakpoint.append(0)
0 }% }9 s( n- l3 O # 把随机断裂点从小到大排序) O" }. d* G, f) `
breakpoint.sort()
) M6 `9 X; Z5 N6 ` return breakpoint
6 z! B8 l R9 X6 z5 a* P# q% k- I' ^) n
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp% H5 W. h3 z. Q+ e n: X( L, g& t
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):% v o8 C. N( Q8 M" `5 O
for i in range(len(breakpoint) - 1):0 M5 G( h! b# ]+ Z2 H
fragment = seq[breakpoint:breakpoint[i + 1]]5 o5 q7 }% d7 f# |
if maxfragmentlength > len(fragment) > minfragmentlength:
+ p) r3 |/ B/ j3 x" w self.fragmentList.append(fragment)
7 b# I' d$ \1 K& W: V% Y return self.fragmentList
- F5 m' I, g$ r- s: B, u) e3 Y* ]/ f
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率! q( D! }% O/ C) `% _
def clonefragment(self, fragmentList, cloneRetainprobability):% W+ x9 b) c) r& Q. F1 r; H& n
clonedfragmentList = []5 S1 b$ Q+ t- P: z Q8 c0 R
Lossprobability = [random.random() for _ in range(len(fragmentList))]
7 |: y* y+ K; r8 U9 q, z1 H! E: \ for i in range(len(fragmentList)):
2 O2 ~& i' [/ o# O# q* x6 _/ z if Lossprobability <= cloneRetainprobability:
2 H6 r/ ]% h0 R3 s; W( Q clonedfragmentList.append(fragmentList) Y! f0 P1 M- V. P7 Q k Z, `! k$ A
return clonedfragmentList
0 U1 _9 X. F- r0 j: f' w) E" N4 {7 W1 W0 N i! w% e
# 模拟单端测序,并修改reads的ID号+ s1 m2 W/ q( p4 A* c
def singleread(self, clonedfragmentList):
* ~; n6 @, u; t" k for fragment in clonedfragmentList:
) y6 o6 j9 o6 G fragment.id = ""! v& ~2 v% ^% G3 n4 }
fragment.name = ""9 d) }, _/ p2 j0 ?5 @, l: N, f3 m
fragment.description = fragment.description[12:].split(",")[0]
j1 v+ [6 @8 M& r3 c fragment.description = str(self.readsID) + "." + fragment.description
- U1 Q% x- W4 W. @7 E8 G$ i self.readsID += 1
# N2 B! c( Y. f readslength = random.randint(self.minreadslength, self.maxreadslength)
6 L, x k3 r7 V+ r, @+ r self.allreadslength += readslength+ ]8 `6 v6 ?, c; X
self.readsList.append(fragment[:readslength])% }/ Z5 J8 x# i, v. @" Y
6 F9 {0 d+ b2 J. J! j: }5 n o
def singlereadsequencing(self, genomedata, sequencingResult):, Y r m% t+ K0 d( N
for seq_record in SeqIO.parse(genomedata, "fasta"):5 {1 [$ _2 ?$ ~1 I" P; P/ q
seqlen = len(seq_record)
- N9 @& P% {: j7 M/ z- q" a$ o self.genomeLength += seqlen
8 m9 o* ]& X( J! r2 C for i in range(self.N):' e( @- Y- N: J) q
# 生成断裂点3 X) e, N U2 u3 g5 T9 c
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
r- t( ?" O. h9 p3 }) b # 沿断裂点打断基因组& _: K- p* f3 Z
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
" P" P# |# r4 ? # 模拟克隆时的随机丢失情况- h( z- p3 x- p6 d7 I0 V8 x
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)/ m r, G. o \: _: \* \
# 模拟单端测序
) w" @+ M* w" [* v, S3 j' B self.singleread(clonedfragmentList)
. Y# p' O |' d) F/ D- J1 ?3 ?: n SeqIO.write(self.readsList, sequencingResult, "fasta")# O% |+ Z5 P: v
; k% b5 b9 n( M; \7 E def pairread(self, clonedfragmentList):
. r5 l. D9 r) H8 {1 T for fragment in clonedfragmentList:/ F. ^) t; A2 R8 n0 c0 k3 T
fragment.id = ""# z9 j" _1 z! i9 }
fragment.name = ""4 }* d1 q) Z6 V! C% B% [* `* y/ u
description = fragment.description[12:].split(",")[0]
* p: K8 J& H# r: e! U% W: v ^ fragment.description = str(self.readsID) + "." + description4 c( Z \$ t* ~5 {
readslength = random.randint(self.minreadslength, self.maxreadslength)' O$ v7 u7 e8 \3 G2 Z! i
self.allreadslength += readslength
3 Z- c1 A9 z3 X+ W; u8 E+ `& S' R x self.readsList.append(fragment[:readslength])" {2 P3 P! v/ {$ i+ C* ~# K! `. k, Y
% M, }. z1 g+ t' W1 v$ [$ s9 B readslength = random.randint(self.minreadslength, self.maxreadslength)+ A; A, m G e+ @; Y$ L* I
self.allreadslength += readslength9 t3 p- n/ k1 V' w6 ~* E
0 I% u' X; b# k
fragmentcomplement = fragment.reverse_complement()2 O2 h5 W! L0 n$ y
fragmentcomplement.id = ""
' N( l: U/ L# I. B7 q9 c fragmentcomplement.name = ""8 c2 u) b1 ?& s+ B! h3 M8 j
fragmentcomplement.description = str(self.readsID) + "." + description
9 z; G! q; |+ n3 x5 E& L8 P; f! b) P self.readsList.append(fragmentcomplement[:readslength])1 D9 |0 a: ~# G/ g8 I: @
8 F* O9 g4 o& @" z8 A3 l( \+ n
self.readsID += 19 n9 C3 H& n/ e K8 i6 F
. T# X# _( J1 K def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):$ F' Z8 k: f8 o$ Z
for seq_record in SeqIO.parse(genomedata, "fasta"):
1 k% w; c* L% }- I: W seqlen = len(seq_record)
: C1 p# G* ?' d self.genomeLength += seqlen
7 @$ d; l% l& n) I for i in range(self.N):8 q7 d, q1 U. Q7 ~6 i
# 生成断裂点% P: C! s3 b- G( ~9 z* L x
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)2 w; q% P+ N" ?! E' T& L
# 沿断裂点打断基因组* x% y; ~/ i$ ~) T: T$ L
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)! w$ u" F9 S6 m& d1 u
# 模拟克隆时的随机丢失情况% \& @- ~- S/ L; x# _7 F
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)9 W% J7 R( K8 ~( L C5 m5 w& ?
# 模拟双端测序' _' V7 r) Y- o1 s
self.pairread(clonedfragmentList)) G s: i) F! F) w
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
7 O& t' ~7 J( d0 a8 N% x* @5 { readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
% \/ q+ w; O0 k( W0 |3 B% E SeqIO.write(readsList_1, sequencingResult_1, "fasta")% c9 p. i7 `8 w; ]) o% r. H
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
, @% X" V3 C4 g" |( b; N9 t: d9 j& m. y' o
def resultsummary(self):
- s# o z% E% c2 {( u+ F& H print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
: X$ b, v# m" } print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
, n8 b5 t& k9 x3 X print("N值:" + str(self.N))
2 ]$ c! w2 I* C/ p. x print("期望片段长度:" + str(self.averagefragmentlength))
9 K; D4 D" x5 y' u print("克隆保留率:" + str(self.cloneRetainprobability))9 Y8 K7 e- G- v0 F7 D# y; v
print("片段数量:" + str(len(self.fragmentList)))( V5 N8 T6 p4 J, P
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))8 z* k5 l. x0 _/ K; F2 a/ m- o2 n
print("reads总数量:" + str(len(self.readsList)))3 {+ @) v! g7 i i' \( u5 h# ?
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")% M z- ^. n8 _8 D8 m8 ]$ y+ H Z
m = self.allreadslength / self.genomeLength& c5 U1 R' x$ X4 @: h
print("覆盖度(m值):" + str(round(m, 5)))
r S- w3 E8 n/ S print("理论丢失率(e^-m):" + str(round(exp(-m), 5))). ^9 R) O4 b5 z( o2 F
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
" ], |+ Q; e2 s1 H' U+ @( g# -------------------------------------------主程序-------------------------------------------
! b c' J5 x% m0 e# 模拟单端测序
2 E/ r3 {% B( S) s* r$ xsequencingObj = Sequencing()
0 u* f. g# \* YsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
7 I; k$ T/ m8 ` k9 N% C) ~- p7 asequencingObj.resultsummary()
0 X8 N! ?4 |1 M! C2 F+ ~# A" I% S G, q" B
# 模拟双端测序
! N$ ^- [- C/ h8 ]sequencingObj = Sequencing()
! ]3 t. X( v \5 P2 x5 {# ~sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
' \$ N, U e7 z1 c0 g* }+ AsequencingObj.resultsummary()) j: \/ b' Y8 i2 |
from Bio import SeqIO
5 Z- ^7 V/ j C: bfrom math import exp
+ }9 ?" T) ?4 t- ]& F! s9 g7 @import random
. Q2 {4 V: y5 H
2 B) X5 a$ j @' m1 ]class Sequencing:, m' S) T J& s# v
# N代表拷贝份数) c6 H* p$ N% `. T
def __init__(self):
5 C4 s0 X; w3 n3 S7 m1 A7 R R1 L self.fragmentList = []
& B6 d) x$ h- q* U( v8 R. C0 u self.readsID = 1
2 W) v; u# d* v2 J9 @ self.readsList = []
) h! c) K, m. Q& a self.averagefragmentlength = 650' r( n. f y* A, ]( f
self.minfragmentlength = 500. E/ E# F- T J/ e7 u% E* a) o9 _) s
self.maxfragmentlength = 800
2 G( X) c1 z4 [( b self.cloneRetainprobability = 1
2 q$ i- l- Z& ~- h self.minreadslength = 502 |/ a2 v" `& C2 h
self.maxreadslength = 150& U% R: I3 D0 o1 J# z! Y8 U
self.N = 106 ~4 s" r* G" G% w) @$ m
self.genomeLength = 0
6 l3 C) ]1 q7 J& X$ Y( v% J self.allreadslength = 0
. E) K* J6 S: b" s% ]4 z
( ?( G3 ]' P/ s- M' Q. e0 O3 C # 生成断裂点
& ^: X9 r& f# I def generatebreakpoint(self, seqlen, averageLength):2 `* s+ g c/ T7 g2 s( f, h1 }
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
6 f, l. J+ h- P& q9 l breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
/ \, [% L6 X! U c2 ] breakpoint.append(seqlen) U0 y7 D/ D N
breakpoint.append(0)
: F# ?' p+ S1 O7 l0 K # 把随机断裂点从小到大排序
6 G5 @9 _, V ]" K( _! M breakpoint.sort()7 y0 V* B) v6 t8 P
return breakpoint
' K* c. O1 I; t9 ~5 S
* g; o' y- q9 z+ a( [+ }% a # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
$ Z h( N) |4 n1 e _. }& }4 ~ def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):3 B3 ^. u [' M; `3 V9 g
for i in range(len(breakpoint) - 1):. u7 x0 e9 s1 _& I
fragment = seq[breakpoint:breakpoint[i + 1]]
8 n ~+ i S! V2 m. u* t if maxfragmentlength > len(fragment) > minfragmentlength:7 P9 J7 d( k' b1 i- _+ b4 m
self.fragmentList.append(fragment)
, R7 H N3 F# D. n3 L) _# ~ return self.fragmentList
9 D% O' z0 n) b' s9 z) x0 H' | l! g H5 l# {; l
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率. v& j# M% h% y1 q6 _6 U
def clonefragment(self, fragmentList, cloneRetainprobability):
, w; P) i6 y6 {) G/ G clonedfragmentList = [], _2 ? n* ~# L# ^/ O. l/ l
Lossprobability = [random.random() for _ in range(len(fragmentList))]' R1 \* f* V0 U! p( ^4 c
for i in range(len(fragmentList)):: U( U. J) ]! z8 r, t* L' [8 a
if Lossprobability <= cloneRetainprobability:
5 N. s, @) ?4 d2 j( D5 Q9 m clonedfragmentList.append(fragmentList)4 T9 o/ L: ]: Z7 Q
return clonedfragmentList
2 A* g8 h$ l# {3 S, V8 ]$ f
2 | X; ~; ]) ~5 U; t # 模拟单端测序,并修改reads的ID号* Z0 v# ~$ x3 e
def singleread(self, clonedfragmentList):9 I5 R8 m7 ^0 s5 h
for fragment in clonedfragmentList:
# P- u u' y. u) J1 z fragment.id = ""/ h) d- i& P- ^# k
fragment.name = "": G% i. Y: n" f6 {6 s8 Z
fragment.description = fragment.description[12:].split(",")[0]
, l5 [: F" ?% Q fragment.description = str(self.readsID) + "." + fragment.description" p9 \5 p9 ]) I$ u3 }
self.readsID += 18 }( A) L! {! e$ ?
readslength = random.randint(self.minreadslength, self.maxreadslength): F/ E, k! h4 e$ z8 }# Y
self.allreadslength += readslength
* |# d a$ O& }/ Y( g self.readsList.append(fragment[:readslength]) }# M) f0 C. r' n5 x
; o% X2 L! K9 l% G def singlereadsequencing(self, genomedata, sequencingResult):
2 Q) c& W' b; T" [2 V( n! e for seq_record in SeqIO.parse(genomedata, "fasta"):) n8 @( Z& K8 s- f, G8 Z
seqlen = len(seq_record)8 o/ ?; D) v) Y( h
self.genomeLength += seqlen
/ W! j$ \8 p" C9 R0 S1 V for i in range(self.N):
) I" }. H- n. l0 P/ ` # 生成断裂点
/ a. E4 L6 L" B% p# Y2 {0 X( v breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)% y% D2 q" W! @3 Q. [9 g7 I6 P
# 沿断裂点打断基因组
* k) L0 K) y; e self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)* U$ m4 F' S6 J% z5 P' @
# 模拟克隆时的随机丢失情况
! [! v5 x8 h" Q4 k" K clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
& f/ U& a0 Q' ^5 a. K$ u& I # 模拟单端测序2 h6 u# A( ~6 a3 q! i
self.singleread(clonedfragmentList)
% O3 n8 S2 y$ }; ^4 S SeqIO.write(self.readsList, sequencingResult, "fasta")
3 Z/ |; x: z$ d+ ]2 ~, \0 D) S9 ?
: Q! A U7 t) i/ k8 I n3 H def pairread(self, clonedfragmentList):* X9 H; G/ s' c) Z$ p
for fragment in clonedfragmentList:5 T( \% ]9 i$ A0 U
fragment.id = ""7 I" Z' b2 h# J9 o" W3 u
fragment.name = ""( h) \- ^( ]" ~8 Q) ?8 ?
description = fragment.description[12:].split(",")[0]
' |0 _4 v& a3 M fragment.description = str(self.readsID) + "." + description
4 g% I5 ]$ O4 {4 X0 r readslength = random.randint(self.minreadslength, self.maxreadslength); l* I% _' F: O! y) {
self.allreadslength += readslength
9 i* w k5 j6 B, W; e9 w* ~' a self.readsList.append(fragment[:readslength])
. O1 {% I( U% W- ], B. O
8 J4 ], S6 l5 J4 H. y3 s" Q readslength = random.randint(self.minreadslength, self.maxreadslength)
% U0 b& g( \8 V3 F" z self.allreadslength += readslength
' Z0 E3 W1 f+ S* B* I: j; g/ V$ r& R' Z6 `! B
fragmentcomplement = fragment.reverse_complement()
/ k& _( u: p9 F- r1 @7 J fragmentcomplement.id = ""& }& k: I. _' P
fragmentcomplement.name = ""
7 z1 M6 z, q1 X3 n1 k& i fragmentcomplement.description = str(self.readsID) + "." + description2 q* C# g* b1 q! g; Q
self.readsList.append(fragmentcomplement[:readslength])
* h* r% i9 x5 |5 ~" ~" X" V V9 A% S7 c; ]! b+ S2 c
self.readsID += 1
5 y) O7 g: ~2 i3 x" Z0 E) A" I% J1 M, K# A2 j. T9 L4 E1 t0 g
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2): m$ [! p8 X3 _% n6 N
for seq_record in SeqIO.parse(genomedata, "fasta"):6 r2 ~7 g. G U9 X1 {+ T( y
seqlen = len(seq_record)
+ Y$ I5 T/ J$ Z( j3 g self.genomeLength += seqlen
& V; q* u% o M- p8 u for i in range(self.N):9 r, ?+ L0 h. `0 x4 `* G" V
# 生成断裂点
8 F# ~% y6 m o5 x' J breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
5 z1 y6 k& l/ |$ [8 w! S! p# \ # 沿断裂点打断基因组
) x& B4 O/ x2 c1 @/ u self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
5 `; t. e/ a# ^0 Y2 i3 h # 模拟克隆时的随机丢失情况, |: `5 g( U5 w3 G
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
' Y4 e# h8 N" ~/ e2 U( {9 s # 模拟双端测序
: U1 R" i" ?' c) X* N self.pairread(clonedfragmentList)+ P! @ c2 H* O+ Q* c
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]# q5 X& J0 _* p
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]; I3 I9 P$ r& {: i; l; y
SeqIO.write(readsList_1, sequencingResult_1, "fasta")+ P I4 D$ w% B, U5 p
SeqIO.write(readsList_2, sequencingResult_2, "fasta"). V5 K$ r3 D8 l- Q
+ Z' v3 L5 P6 [. d1 G/ D' M
def resultsummary(self):
2 Z& t9 Z( Q+ B8 s, g& D# g$ t print("基因组长度:" + str(self.genomeLength / 1000) + "kb")/ P* z; _' ]& [
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))- d* g5 ?+ A- m- N& K
print("N值:" + str(self.N))
" R. X. ?# z8 v' X6 a print("期望片段长度:" + str(self.averagefragmentlength))+ g6 S7 p \4 U4 y( j( T
print("克隆保留率:" + str(self.cloneRetainprobability))$ U+ A2 |. R( K- | B2 M
print("片段数量:" + str(len(self.fragmentList)))
& `7 V9 f7 T; z print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))! C$ ]8 W) Y. d, i& ~( @
print("reads总数量:" + str(len(self.readsList)))3 i9 G' T0 j+ \( C
print("reads总长度:" + str(self.allreadslength / 1000) + "kb"); |# `+ H8 w6 ]' V
m = self.allreadslength / self.genomeLength
# g+ z# Q1 `( l8 P+ Y v print("覆盖度(m值):" + str(round(m, 5)))- a: ]5 Z3 q$ }6 ^7 L: `& ]8 J
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
l/ T- o* x! |" R* W* p print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))4 ~9 J/ b; c+ y8 g+ c# I6 P
# -------------------------------------------主程序-------------------------------------------
- `+ L9 K0 S5 L5 H4 v! _" N# 模拟单端测序! c( W) u& y( x3 b: D5 l# F
sequencingObj = Sequencing() g" v% O$ w/ x* S0 J3 x( l
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
5 M! n$ A% S% n% WsequencingObj.resultsummary()7 a8 u5 j) @2 H& e
5 f/ n- j1 S& W% m3 Y W* K# 模拟双端测序 |2 K& @1 V+ s% D2 ^/ _' n! \
sequencingObj = Sequencing()3 _, h! p! h8 h( p1 n( y0 n
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")! R( Y' Z+ H6 M5 l
sequencingObj.resultsummary()/ ?( m+ I. e/ w
1 L$ @) J! p. E8 B/ s7 u: u
* U% ]6 b# M e" e
, H9 E+ H' V0 E $ C3 [3 L- N$ }% H
|
zan
|