- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 561420 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 173799
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 18
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
& k( o; `0 I. G: Z0 ?% G& q! D" p基因组测序模拟# H: R9 m e8 l
/ d, S# t% k) m' D5 N0 `一、摘要
% w: F5 d; K4 o2 ^, m
4 W* k& e1 G: U( e通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件% B" Z% B5 |- Q: L$ I
( t( p+ y, g _4 g二、材料和方法
8 g0 V( I9 f+ I% \% \8 E2 J) e; m* M8 c. T
1、硬件平台5 @ P7 m# L) r6 t0 n( _ E9 T4 l
7 x0 U% j1 e3 V% d( l3 E9 ?
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz $ \8 Z6 S, T; e E# `) G4 W
安装内存(RAM):16.0GB
# q# f! f V9 Q% @$ E, i+ ]; g4 q ^ [
2、系统平台$ T0 K# g( @& w3 u
Windows 8.1,Ubuntu1 p) t4 f9 K+ ?' W( v( Z
$ O$ I! Y7 C, |- J4 L+ V
3、软件平台 K1 m- ~7 e- X
2 W* H- p2 V8 x* O0 l
art_4540 t I) t0 p1 r+ c4 t& k" x* I
GenomeABC http://crdd.osdd.net/raghava/genomeabc/; a+ I2 R( t/ T, o F& M
Python3.5
5 C1 K/ h: k7 ~! G0 P+ ?* x. lBiopython
. ?9 R& j: h: r: t" O" y y4、数据库资源
5 v0 Q- e( ^: A) Q* S
: \/ f/ i7 p% S& N# f) x) ]NCBI数据库:https://www.ncbi.nlm.nih.gov/( ?! Y- [6 g& F% U5 P
6 X: ^, \2 O8 l' @7 p& v5、研究对象
' a) X5 r9 t$ _0 t/ P1 M
. z1 F% J9 Z! u/ E* ?* v酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
& s" V* Z& }- a3 ^4 [ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz+ r# ~ W. u( b: ?2 Z4 `! v+ K% k
0 H$ V- g* F3 S0 E9 p4 V: ]6、方法9 e: f- P( W" s# \
; ~; }- u' w, g( W9 t% a
art_454的使用
0 B% J! k- x$ X6 A' p首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
( U5 D3 q* X# J( LGenomeABC
3 ]/ B0 g* a6 R5 h4 B% q进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
i3 V7 M6 k2 G编程模拟测序 ; j5 o5 L* O7 W( S2 k
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
; z7 N$ Y" I+ a! P) t三、结果
8 D. |: V" G: B5 y0 m. h6 e* p
+ K7 Q; |' @: g& U+ L a# d9 z1、art_454的运行结果
& J, u% w0 k3 O2 C# Y4 W
) g2 Z. o1 y8 e2 e% ?+ C7 {) |无参数art_454运行,阅读帮助文档
& O' F6 M# k8 g: h' ~" t* V5 s/ ~- I. I% M1 }9 }9 l$ i& G
图表 1无参数art_454运行
* Y) m8 u& U" H5 F4 b对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
& O( g/ @8 _ I" P6 V下图为模拟单端测序,程序运行过程及结果
$ d }8 y: n( o( l0 T$ u- z
$ a `* U' e5 z图表 2 art454单端测序
) }# ?$ ?3 q9 g8 Z) e, F
! Z+ A' l$ j) h1 z. u图表 3 art454单端模拟结果 - v; S" Q# R1 A4 L8 g! s W
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 , I4 p X2 }) N& i
下图为模拟双端测序,程序运行过程及结果
5 P; l# q! i `& U1 }2 z, Q( o* i
6 d2 Q }! ?. z. K1 i' `& k& R- Y0 A图表 4 art454双端测序
1 T+ i5 e& I& ?
! e5 D; {/ u w) l2 J. y图表 5 art454双端模拟结果 5 K- |$ z ]" J1 V8 l# M$ I+ M
2、GenomeABC
- q) ?- i% A; L下图为设置参数页面 6 x @3 m8 }) U, d, i
" i% |$ z5 |+ q7 [1 a
下图为结果下载页面
- r* T" f5 l6 X- S- x; t2 e% v/ z) t6 ^* _
图表 6 结果下载页面
N5 I& c; w# j2 y+ l/ S3、编程模拟测序结果
) ]# i2 K& \. l, E- _4 H# @拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
! x& S: }2 S+ Y# l单端测序
" w& x3 p0 E, `4 d2 C
& a, K- l6 N2 Q) n+ Q3 r* k- l图表 7 程序模拟单端测序 6 f9 O7 u+ k4 w% Y. O
双端测序
* i/ x F# P/ g+ n
1 r; L3 ]! H! y) ]$ \4 b8 F图表 8 程序模拟双端测序
0 F+ U2 V& M8 ^1 z( ~* _测序结果
7 T* P/ M2 P, S8 E r" r* ~! z, B$ v8 X' j6 `7 A# v0 `
图表 9 结果文件
' d% p1 b! L q4 l. @
: O" m& A8 n/ G) \& J6 r因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
4 t: e3 J. M9 q. u u测序结果统计表; D7 D3 q) _& G: M3 M- S2 y) C+ ^
P9 `( z% z' j, k% ^6 l" T
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)9 Q8 d) \# r. e. n8 `- s8 @. U
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682" N3 g; [: U& q, P8 N% H
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
( I5 Z/ f9 G6 a( K双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
: K/ N- e" Y5 G$ q9 c. K% B双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
$ ]5 d$ D* d9 @2 Y1 P四、讨论和结论
% {' B6 r, z# `+ y3 D+ Q- m E
* k# {/ ]2 v, ^3 B _* z程序运行方法* c: e& k$ Q: j' p4 ^- f9 `; x
1 D! X5 G. v% f7 _5 X& K1 n8 w
在类的构造方法init()中,调整参数。
6 O, {) p! V1 c: y0 t8 P' i3 EAveragefragmentlength为片段平均的长度;
1 u1 ^6 H( ^% X4 h- tminfragmentlength和maxfragmentlength是保留片段的范围; 1 c, {0 K+ ~1 K" V4 S9 m8 E( A! U6 S
cloneRetainprobability是克隆的保留率; 1 }$ C( h4 i# p( E+ h Q, m( H( k
minreadslength和maxreadslength是测序reads的长度范围+ f* P, G4 u+ b
% d$ @2 O @5 W0 Y" i
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。3 ?# j/ m3 t! P& n1 m
0 ]% s+ ?* e5 X$ U* v
附录: I* t6 Q3 \! R; a+ o8 R/ ~
P( {6 o! [: t# W4 D2 F
from Bio import SeqIO2 u7 B7 @8 c8 W0 ]: c" n0 h# u
from math import exp
% E: [5 X W6 w) e3 d; }import random
. i0 W) T b, {" I9 C: J; J% x3 @8 m# J! b$ t
class Sequencing:
# a6 U g6 _' u # N代表拷贝份数
* v) i6 a d" s: X% V& k# S5 F$ U def __init__(self)
" [. c* h* Z G self.fragmentList = []1 x& `# K2 L9 ~+ b
self.readsID = 19 G5 ^2 s; c l* w7 X/ F, _
self.readsList = []" a, x6 M, ~ _* O7 Q3 X. l8 h
self.averagefragmentlength = 6508 e. ]1 y B' [; S; ~1 w, g
self.minfragmentlength = 500, }1 h+ _# }. B! U' Q5 a
self.maxfragmentlength = 800
: R( g# n/ U+ l; L, H! v9 Y self.cloneRetainprobability = 1& Y9 H/ Z: f5 I( `
self.minreadslength = 50/ D/ i$ z0 l* h: u, ^
self.maxreadslength = 150
- I n P8 i# s k7 @% A self.N = 10" G7 {7 [8 x: J* p7 \
self.genomeLength = 0" ]% `/ z! C& J% ~
self.allreadslength = 0, v% \8 g( m q8 l
& {8 X! b' \( R) {6 _7 u # 生成断裂点
" L' `% }) r: {" x" \5 s8 {# h def generatebreakpoint(self, seqlen, averageLength):
: A; T, Z5 G( q! r4 y8 r # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
2 X, c& {& b$ X: F breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]7 K# o+ @& o2 ?5 D$ H
breakpoint.append(seqlen)
; }( Z$ E0 P! m1 u3 W. j& o breakpoint.append(0)! g1 F2 ]+ s3 s1 c9 d
# 把随机断裂点从小到大排序
, u e3 c& Z; ]# n2 y! ` breakpoint.sort()
0 E* B% |# @) G* ?% s: m; N$ F return breakpoint
: z. O% x* v4 t/ r' s/ \ x) S$ w
) G$ Q+ j; t. Q M% l6 `' g # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
9 J7 K6 v5 M6 ?) b+ f def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):; U) W2 M1 l& A m
for i in range(len(breakpoint) - 1):. d0 N @1 D9 n0 w! [! t
fragment = seq[breakpoint:breakpoint[i + 1]]
8 v8 r# d' g. m c, V! Y if maxfragmentlength > len(fragment) > minfragmentlength:
8 t( ]& z/ m; X9 ~" H self.fragmentList.append(fragment)9 w/ P# `4 ^- d b1 p
return self.fragmentList1 ~5 B9 R% N* T' J, m: h& V
2 J: `0 n# {; r7 ^& J# F' q( X6 d # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
i. ] V! r0 q. T def clonefragment(self, fragmentList, cloneRetainprobability):; o4 a9 r( Q; f" y7 y: z3 K, z
clonedfragmentList = []
0 Q, y) \- O9 e( w- n; y Lossprobability = [random.random() for _ in range(len(fragmentList))]
5 y; N6 S/ t0 [0 _ h7 o for i in range(len(fragmentList)):
: F B0 [: \6 o) _ if Lossprobability <= cloneRetainprobability:
$ T# j. v/ I8 e6 v' }9 o clonedfragmentList.append(fragmentList)
' x$ Y" j5 p+ [1 Q: y return clonedfragmentList
' b% T- @0 o4 {4 N: t8 Q O+ |. l: m) H. \ H6 @+ l3 s/ G( S* p
# 模拟单端测序,并修改reads的ID号% v6 O, @- ]% c |9 k+ r, ?! |
def singleread(self, clonedfragmentList):
3 P0 n( S O- b. R# A9 G4 T for fragment in clonedfragmentList:
0 b. K8 Y7 B. g1 T4 ]. [ fragment.id = "", j6 B) Y; b7 f) v
fragment.name = ""
$ H9 `" a2 n0 {9 M: o4 m fragment.description = fragment.description[12:].split(",")[0]& x2 _& n. V$ O6 s. x9 r
fragment.description = str(self.readsID) + "." + fragment.description+ s8 ]) `2 v" d( [4 U B4 p1 l- ]
self.readsID += 1' o1 ?( p- ~% x* U. u) q( R0 i
readslength = random.randint(self.minreadslength, self.maxreadslength)
, N/ D7 z1 l4 m5 J self.allreadslength += readslength
6 D, q* G6 q0 g0 w self.readsList.append(fragment[:readslength])8 a0 K9 N" Y" X4 z
* R; u* e6 A9 W) n: w/ M def singlereadsequencing(self, genomedata, sequencingResult):: I' I& X3 e7 n+ w& R4 Y" S
for seq_record in SeqIO.parse(genomedata, "fasta"):
3 H7 B- R" S' ]4 Q. n- x9 n seqlen = len(seq_record)
! [ w: V# c: R) m+ z* F self.genomeLength += seqlen
' B. g% e6 p" }1 X& ^; s for i in range(self.N):' f( ?4 E8 q9 ~ i8 F
# 生成断裂点" u m0 K/ `0 m, q4 l5 E
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
$ c: S- W, m! f3 t& O # 沿断裂点打断基因组
8 y# N, i! Q# }$ F' d self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength). a# C' ^# b' y# ^. ^9 ~ s4 o
# 模拟克隆时的随机丢失情况) U' o% v$ X A# a
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)! t( y* R' [4 A( M% f
# 模拟单端测序
& T: t, U c- y+ Y' w! @ self.singleread(clonedfragmentList)& x& Y) o; l' U
SeqIO.write(self.readsList, sequencingResult, "fasta")7 I+ F0 T2 ~( h9 f) {& E
" l$ d% M5 c% w9 m+ g# ^ def pairread(self, clonedfragmentList):$ |, v/ z1 a) ]3 r
for fragment in clonedfragmentList:( y3 V+ W2 z! _/ O5 m% X, R
fragment.id = ""/ `0 U0 r2 g q2 S' u
fragment.name = ""- m- n$ ?+ n8 e$ f4 t
description = fragment.description[12:].split(",")[0]
, e/ [% U L+ p( Y2 h9 L+ O fragment.description = str(self.readsID) + "." + description
+ F! Q9 n/ ?4 c readslength = random.randint(self.minreadslength, self.maxreadslength)5 l; y1 d& Q9 E; B
self.allreadslength += readslength& b' F$ j8 |# \
self.readsList.append(fragment[:readslength])% j3 l" A* u6 A
/ x' k7 S! ~, F) H A: u
readslength = random.randint(self.minreadslength, self.maxreadslength)
' x9 k0 [# ?1 ^$ T) {: i3 {- ?. s; ]6 }/ Y self.allreadslength += readslength
* B$ v" H! V$ X- n2 K$ ^% e
1 [' S2 r" Y# }: ^ fragmentcomplement = fragment.reverse_complement()
6 o3 \' d' w: L9 x' E$ s2 Z fragmentcomplement.id = ""
0 m+ R: _$ f! {" b# R) t, i; O; a fragmentcomplement.name = ""$ Z f7 b' M6 v' C
fragmentcomplement.description = str(self.readsID) + "." + description
' C3 \- N$ o. S! p4 k7 l self.readsList.append(fragmentcomplement[:readslength])
& S0 j' _4 i4 @ u& l6 A( J1 K4 r
0 ]& @) ?# E' q6 J$ B self.readsID += 1
( }( k9 `. v, W) \8 D7 v& }# D5 b
& m5 p# t+ V2 h. i def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):/ R, s' K8 v, v4 Z k6 Z& S
for seq_record in SeqIO.parse(genomedata, "fasta"):
. K4 H4 z- H2 X; i h% E: O seqlen = len(seq_record)
1 y$ Q \5 Z5 K+ z- L self.genomeLength += seqlen
1 L. |2 D# v# Y) ~4 g$ G7 [9 U" c for i in range(self.N):+ [* V; \& A6 l& H
# 生成断裂点
7 ^" ?+ O) g4 c9 E' S, E breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)" n* v7 a! c; _0 @ f0 K, ~
# 沿断裂点打断基因组: G* s. D& ~2 k* x: `- Q$ ]* c
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
, R/ a# D% W1 f( y # 模拟克隆时的随机丢失情况
! G5 L2 W2 ?* u1 I n clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
8 \; Y: m; w2 x # 模拟双端测序# s" G2 r Y; K1 g; @2 Y
self.pairread(clonedfragmentList)
2 p! B7 o0 h Z3 Z readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]; N) H% u8 @: K4 @0 ?. N" e
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
% y1 p: O4 n( l# d$ q SeqIO.write(readsList_1, sequencingResult_1, "fasta")# n2 Z) D( V5 Q& ?+ w
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
" ]/ N8 s8 v g7 n8 J
6 x6 u5 J9 x2 p/ C" q" u( Q" | def resultsummary(self):
) X. K) B6 U7 ~6 o1 {5 W) n print("基因组长度:" + str(self.genomeLength / 1000) + "kb"): S4 M: D3 z2 \& Y- J
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))1 Y" x" B& m0 N) a2 Q4 [* P
print("N值:" + str(self.N))
/ }- E' v9 Z# J9 x8 p2 b print("期望片段长度:" + str(self.averagefragmentlength))
8 k$ U3 g( Z( Z5 G print("克隆保留率:" + str(self.cloneRetainprobability))0 o" H. c! y. t9 ~7 Z
print("片段数量:" + str(len(self.fragmentList))), Y2 z2 K; P' @$ `" y" s
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength)), J; Q. P* _5 Z- [9 @0 E
print("reads总数量:" + str(len(self.readsList)))+ o g. }/ u: w# f9 y
print("reads总长度:" + str(self.allreadslength / 1000) + "kb"); l& a, d3 j* a } u
m = self.allreadslength / self.genomeLength
+ d7 P/ b" L6 e$ \9 p0 ` print("覆盖度(m值):" + str(round(m, 5)))4 F* j7 E3 h% O0 N9 ?2 {; o0 K
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))) a1 b O+ H1 h
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))( _# |! `2 x; o' F* }
# -------------------------------------------主程序-------------------------------------------+ O0 ?1 r: Y; c( ?! g+ Z
# 模拟单端测序
. R2 {1 G. |6 `sequencingObj = Sequencing()
/ S) I$ E- h7 Y7 O; h: v; n1 asequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")7 s0 _ R2 S( X7 z: |4 h8 i
sequencingObj.resultsummary()2 o6 F* k" ^/ l$ ^7 F
! R& m. j; \* p1 I4 S# ?
# 模拟双端测序+ j8 A) V2 P8 ]$ x
sequencingObj = Sequencing()! x7 i. v# ?9 r( j' M( X! ?
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")- h: c q: J0 Q: C1 W' E
sequencingObj.resultsummary()
q" j1 t4 m9 c- e- P5 R, Bfrom Bio import SeqIO7 b p1 a' I7 j8 Z7 q: S
from math import exp
+ b( J' c6 B) v" S$ |% `7 Y; Z6 Mimport random, T" h0 {$ K ?5 h
, `! e2 B) h+ {9 x& v5 e/ K
class Sequencing:; l u/ P7 R2 A, w V) J
# N代表拷贝份数
! j$ X& V! X! h& d2 i* y( f* m. i7 Q def __init__(self):, ?% z* }3 ~& Z6 {( _$ Y# D
self.fragmentList = []
- D* y' O# @2 @$ X3 n self.readsID = 1
) C# C3 H' A# F' Z self.readsList = []) R5 [2 W: u, [+ m% j
self.averagefragmentlength = 650
8 Q8 H) L- A! z4 g7 h self.minfragmentlength = 500+ `: @2 v# \& Y; [# J
self.maxfragmentlength = 800
* k5 a; l" y4 n+ {# @ self.cloneRetainprobability = 1
; ?& I) b) g" Y1 d4 u% j; v' N6 r self.minreadslength = 50! f' p3 p9 g! F. u' K/ g
self.maxreadslength = 1509 d; ?2 D. h1 u c {; q& h
self.N = 10
3 P+ ?& Z! j' x% t N7 n/ B3 ^) o self.genomeLength = 0- R4 P( u3 P9 a' ~% M+ g
self.allreadslength = 0
* U) ]% F x) e K( F- y g( @& j7 ]. |" F
# 生成断裂点
9 N/ n6 u! S/ |3 |" @2 e def generatebreakpoint(self, seqlen, averageLength):
' J5 Z# F( e! [. r- y; w4 @5 V # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)( R5 M/ S2 b7 V: b4 j( m
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]7 h3 W* `0 m3 K0 } B9 y9 J
breakpoint.append(seqlen)
- v: L3 y1 P( C. O, {; e$ _ breakpoint.append(0)4 y/ @# }# ^. ]/ x1 G
# 把随机断裂点从小到大排序
1 C3 s( m9 ^% [. a8 i6 J breakpoint.sort()9 `& ~% d3 v" o8 ~; @8 j% a
return breakpoint
) P; }6 }1 X* \2 t% p9 p4 P5 u) L& B2 E' `
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
% T' p4 k" k" _1 w6 F+ v) F$ B def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):7 l- j- o, o4 M" w* K
for i in range(len(breakpoint) - 1):: I8 D" }9 a0 L7 E" x! ]% K( I9 ^" K
fragment = seq[breakpoint:breakpoint[i + 1]]
+ V( ^& P n5 c3 E! i if maxfragmentlength > len(fragment) > minfragmentlength:- R p5 s. v& _
self.fragmentList.append(fragment)
7 Y0 s9 \: Z3 ]9 z% h/ r. Y# M return self.fragmentList
/ {9 e: ^1 C' {% J* K- e
6 _1 \! ~" |" w # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率$ c# a. |2 f k$ D
def clonefragment(self, fragmentList, cloneRetainprobability):
; a2 B& M' P6 }2 \6 n clonedfragmentList = []
; p% e8 \+ |5 D3 W3 R- B Lossprobability = [random.random() for _ in range(len(fragmentList))], @, d3 v3 F l- b6 B- ]
for i in range(len(fragmentList)):
+ S; h( V2 V# {! W if Lossprobability <= cloneRetainprobability:
4 W# ~$ e8 E& V6 K! X3 l clonedfragmentList.append(fragmentList)8 x b# I( O5 @& R9 O
return clonedfragmentList- T# L: A* ^, y6 N# Y3 \: `
' n4 s3 ]3 _) t& F# |! S1 p0 C # 模拟单端测序,并修改reads的ID号
8 c5 s* ?1 K! [ W/ ^6 P1 } def singleread(self, clonedfragmentList):
2 o- W+ C2 h1 H8 Z# k, u for fragment in clonedfragmentList:
0 w7 L; m6 G2 `3 v7 [* _% } fragment.id = ""& ~5 x6 P. b% M& H% Q5 I) F- l
fragment.name = """ P! U \; S8 D, g5 x
fragment.description = fragment.description[12:].split(",")[0]
4 e$ a+ E- {, q( q! w3 ]' j8 R fragment.description = str(self.readsID) + "." + fragment.description% Q$ D' {; \6 L8 P f8 b
self.readsID += 1
$ k% A( y) ^* l* A readslength = random.randint(self.minreadslength, self.maxreadslength)5 U) ?! y1 U; l8 p H" X
self.allreadslength += readslength4 |8 b, ^* B! w8 u3 J* i v
self.readsList.append(fragment[:readslength])
: J% Q/ }. b5 ~8 h& K v. D; G6 H7 P1 l( a0 ]0 {* ~ T- V
def singlereadsequencing(self, genomedata, sequencingResult):
. G w6 P- [4 O+ p6 z/ r+ e1 v- o; F for seq_record in SeqIO.parse(genomedata, "fasta"):
0 t ?4 O: R9 |7 K( h+ r4 _ seqlen = len(seq_record)
% L& N: t- R" l+ V3 v self.genomeLength += seqlen! }2 b7 C0 k) @3 H% ^4 ^5 K/ v
for i in range(self.N):
2 ~$ m9 E8 B* x8 I* |* ~5 z # 生成断裂点- q* J6 g3 k7 F N- @
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)+ e/ g1 I3 b9 ^1 `. A# o2 h
# 沿断裂点打断基因组+ o9 G0 a* t, T* s1 F
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
7 [5 Y6 d! q+ |4 W0 X # 模拟克隆时的随机丢失情况' y0 X; ?7 U" O4 T1 \% c; o9 C' o
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)* }$ K( R6 |8 o! ^
# 模拟单端测序
+ }4 U" k- D, W self.singleread(clonedfragmentList)
5 N- D8 m% ^+ E SeqIO.write(self.readsList, sequencingResult, "fasta")# |' o- y( y9 {- U" T4 f4 z. o
* m0 b1 C, m. K+ p5 Z- b6 h3 S def pairread(self, clonedfragmentList):6 l! v2 b. n- O- R3 D
for fragment in clonedfragmentList:0 J. d; e0 ^& G; x
fragment.id = ""
* b# u9 g9 S) {7 \. @; P fragment.name = ""' J; F. y0 K* s- P- @1 P
description = fragment.description[12:].split(",")[0]
/ w( t- l0 K( @4 m fragment.description = str(self.readsID) + "." + description" Q7 Y' |3 m9 e3 _7 p/ D" h
readslength = random.randint(self.minreadslength, self.maxreadslength)! j/ O$ Y5 r- _4 l. `5 R
self.allreadslength += readslength/ V( ` @3 w" c7 n
self.readsList.append(fragment[:readslength])
# N, | H/ ]# n9 E9 N) W+ i
: [- X6 B* ~0 f5 R i& m readslength = random.randint(self.minreadslength, self.maxreadslength)
$ i; R- i& J/ q/ ]3 G$ j self.allreadslength += readslength
0 Z, \5 o1 I1 l {
/ [) d- v" B2 Y) j fragmentcomplement = fragment.reverse_complement()2 w3 D( Z, ?0 s& y" o2 c" L; B
fragmentcomplement.id = ""
^" K9 w$ ~* H/ m2 `7 \4 C! S: A fragmentcomplement.name = ""
8 J( M) f) H& M7 P; I! b) O: [5 U fragmentcomplement.description = str(self.readsID) + "." + description
7 \! A5 W* K2 Z" g self.readsList.append(fragmentcomplement[:readslength])! I' m4 P9 P- w9 {
& s! R0 j: _. w2 B+ i. y self.readsID += 1' g/ J4 J: `& [' W$ C$ V. F% ~
2 T# L" ?5 v: }- m" P
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
0 \) E7 s8 }2 y+ ? for seq_record in SeqIO.parse(genomedata, "fasta"):
% v% W) |. w4 f* a# v1 z seqlen = len(seq_record), E" H, K2 O1 J. v3 |: X5 Y
self.genomeLength += seqlen
) k% n& ?2 x, ]% [. ]7 ~4 { for i in range(self.N):
* j5 o+ Q/ t9 n) [2 |* ?7 G8 d1 x+ L # 生成断裂点
4 U% G b0 C. T8 V$ Q2 n6 w breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)# _5 A1 d4 [6 n8 N7 s
# 沿断裂点打断基因组) q! y l5 O& d' d+ F0 i, E
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
3 s6 s6 l8 u" n. N" E4 p # 模拟克隆时的随机丢失情况
5 w$ s2 v1 x1 r! u$ T! g7 { clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)6 {0 n+ x* j; @$ I) j7 c S
# 模拟双端测序* j6 x" p/ i! v6 H0 J% f+ Q
self.pairread(clonedfragmentList)
1 p. k; o1 ~7 Z* k! F% q' Q2 ] readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
+ j1 L1 U; m& Y8 h# i: x! i readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]3 G- ~$ c6 L) E9 l6 K
SeqIO.write(readsList_1, sequencingResult_1, "fasta"). w9 @" g* `/ [3 }% ?: }/ ~
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
3 }% s7 V6 p2 C0 I E6 }. z, C; J( _! u. N/ M; {1 h7 {3 ^& w2 A0 }/ |
def resultsummary(self):
& T. _ o8 X6 M9 c# U4 r. u print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
$ Z7 Y) F) m; V6 M+ ] print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
7 }& M" x$ Z# K print("N值:" + str(self.N))
# [* l3 T- z" T print("期望片段长度:" + str(self.averagefragmentlength)). |- h2 b" J1 b( U
print("克隆保留率:" + str(self.cloneRetainprobability)). R- I: ` p/ X1 T w
print("片段数量:" + str(len(self.fragmentList)))4 ], _4 |) g$ ^& _9 a; f# {" o9 {
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))/ g& H. v! s) Q& o# W/ J
print("reads总数量:" + str(len(self.readsList)))
) `% D3 D6 ^2 n0 Z0 ?+ a. t print("reads总长度:" + str(self.allreadslength / 1000) + "kb")0 a" w C5 S7 J0 x% u9 Z9 z- j
m = self.allreadslength / self.genomeLength
( @5 s" v$ x! v! J4 J print("覆盖度(m值):" + str(round(m, 5)))2 w* U/ M# W- \% G2 J
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))6 v5 c9 Y2 y% |8 ]! V6 D
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
# Q1 G. X4 W, Y2 e* c# -------------------------------------------主程序-------------------------------------------* C7 _; m# C# L9 W( C: [
# 模拟单端测序- G ~2 I' j. f I7 E5 G; x* W
sequencingObj = Sequencing()$ @ [- M+ o, r0 C
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")9 B: E& j7 k$ A Z" [
sequencingObj.resultsummary()
& W4 o6 f, z5 Q/ o/ P& ?2 B% T, X" Z( A: Y+ X3 z4 f- X
# 模拟双端测序
- F$ f5 g3 {5 hsequencingObj = Sequencing()5 @ u) g3 D% @/ _
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
0 I) D% e4 o# H- @! ysequencingObj.resultsummary()
0 c9 u2 m$ ^5 `$ i+ G
& P2 c1 W% X6 a' Q9 W
/ @+ n, f; i# P+ [7 i: q+ ]* W( b$ n8 p% i; x
% t3 N& U3 u, k( O* x) }8 [6 K6 x# q |
zan
|