- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 564624 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174610
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟 R5 N. w8 ~" W, f1 B
基因组测序模拟$ S4 u4 \* Y$ ^2 C
; {6 v9 U/ a A, J
一、摘要
' c6 m& t1 U& l6 d' `. I7 G
* J2 }: k- v, ~6 ^通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
! |& L( X1 x: g& M- z5 |, A4 m3 V# V. c& l
二、材料和方法, D) f+ z f1 \' F7 T
% I! z7 J; U& K6 O8 B8 Y3 F
1、硬件平台
. F, d5 `$ {6 X8 i3 ?3 q. ~3 q- F U S( X. X$ j# |8 H
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz - ?9 Z) g- D; i9 u
安装内存(RAM):16.0GB/ i6 x' p" K" Y
5 f9 e, @9 O& l* E: a6 i3 e
2、系统平台
" F3 O: o k q( n2 R: O: H. pWindows 8.1,Ubuntu1 z/ P$ t# `+ x- r8 ~6 c9 `5 D
" Q* ]4 e# ?0 D! |5 v. p. {7 \
3、软件平台
; \+ G( s7 w1 ]' K" N
/ r" E2 I7 {, u1 L+ `9 v$ [4 uart_454
2 l2 I, u' Q/ ]. Q: S7 KGenomeABC http://crdd.osdd.net/raghava/genomeabc/
( g! K. D* o( hPython3.5
' h( {) d; |- x' S- o7 b4 nBiopython: `5 b$ V) w) ^$ ?; q. {$ l8 v* x
4、数据库资源7 U+ L+ [! Z" j5 `
& i, _% N& f9 ~+ mNCBI数据库:https://www.ncbi.nlm.nih.gov/- X v/ p3 R: d% p: r
3 ]1 [0 V a! \5、研究对象1 r- @+ Y$ y% l/ d% S M3 y
1 Z) V0 C0 B) F, v9 x/ |
酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
$ ^( y: t _/ i6 `& m3 Nftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz% g# U5 X: C0 D. _/ k6 ?
/ }7 A# w: O1 u
6、方法) o' c- _$ _, Q2 T) j
' P% p% r5 r3 ?
art_454的使用
) x- q2 \/ F5 F首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。/ `! M; A4 D# ?6 g( j0 r$ |1 U
GenomeABC
1 I& ^, O2 ]$ y( Z- R进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
" w! p, h1 x4 R! @$ |4 i/ U2 u编程模拟测序 & H: }1 ^1 k" Y/ Q. Q* i3 Y2 Q
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
. j& e" Q8 r, p三、结果
% {: d) N, O" z% N3 i' [$ E7 [$ X8 k2 C! v4 ~. ]( L
1、art_454的运行结果
) p" K" B, ~' _* i$ v
$ _4 i: J2 s' [ s( D n0 a# D无参数art_454运行,阅读帮助文档
: {8 |# [; j w% ~8 Q. q- W: {) e1 Q" Q* R( m, ?" ]5 T
图表 1无参数art_454运行 2 h, w! |( V+ q5 ]9 R* q/ e
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. : F9 R% S- ~; r9 u0 `- T: Y& y' x
下图为模拟单端测序,程序运行过程及结果
/ I" X9 A# K7 H
4 D5 H+ K, u% B图表 2 art454单端测序 " {2 G( S9 c3 d, n3 F% |; R' N, u. s3 o
% U7 f: ]& }* s
图表 3 art454单端模拟结果
$ b6 H! _& n, Q3 ~7 J, ]双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
: a7 Q' H6 T8 N% s$ }8 Z$ _) d下图为模拟双端测序,程序运行过程及结果
: X( I3 v' H9 I+ a! }) H* |% R3 j) g* M" K3 Y$ z7 T
图表 4 art454双端测序
) X8 r6 e% K" u6 V1 C
5 E. H1 U; J5 P7 L8 W x; Q2 X图表 5 art454双端模拟结果 4 `, g) Y4 M2 c7 ?) G6 y
2、GenomeABC * x0 o2 C9 m- l+ n5 V
下图为设置参数页面 / g6 D, ^: w8 m# \3 @3 X7 t8 ~
& {) }( @8 { g8 k0 R6 \& ^7 q下图为结果下载页面
6 h; V$ y, g( u- c6 e* z1 F2 t
$ C' `% X, j6 |! S+ L: C7 h图表 6 结果下载页面 * H- r1 G+ D- T1 N( F
3、编程模拟测序结果
% f* }. ~' R) j. U+ p! j M$ ?6 I拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
3 p8 q/ h! |6 R; q( Q7 ?单端测序
8 N* C: s+ W/ s8 z; C
8 m* ?6 V: D `- R5 @图表 7 程序模拟单端测序
& s( ~; K9 l6 E9 a双端测序
d' l6 L, l8 P. @, o: \' U: t. _! D" y
图表 8 程序模拟双端测序
6 a# }) b4 n1 B% h: k测序结果 ' _0 R0 ^' [, Y' [% U3 t/ Y7 W
% E) ^! g& E3 i8 f; i
图表 9 结果文件- b! B+ Z; G4 }; [0 A/ p# h
( h' a# `, k0 E- \0 N
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 5 k% @( ?) A0 q6 k7 ] m- _
测序结果统计表
" E, X3 \5 \( Y4 V9 v7 s" l2 Z3 v9 f- f# E' D0 ^
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)0 J) \2 d: j/ T4 x& E v7 q
单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682* C# a$ }- `2 L$ @/ m
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424. s5 k# B3 i- S1 q+ J4 U
双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
: p2 j6 c" ?* I r双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
8 l) a, [% ^5 Z四、讨论和结论3 D7 D) Y/ w1 m0 l7 {3 ]
6 T( [$ y6 W# l" o& e
程序运行方法$ I" z: N5 ?! k8 b3 ^" {) O
+ g" ?2 w( @. A0 i4 r
在类的构造方法init()中,调整参数。
& z: @ V- r: ]0 j q1 LAveragefragmentlength为片段平均的长度; 9 s, J/ T9 ~& N9 W- Y9 o. \
minfragmentlength和maxfragmentlength是保留片段的范围; 7 M8 m7 p- w2 W! U
cloneRetainprobability是克隆的保留率; * ]% F4 @' ?$ |6 s& c9 ~ p
minreadslength和maxreadslength是测序reads的长度范围
0 r2 J5 X$ E. {
, L, W3 r v, r2 t1 O模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。' e) t0 w0 [/ |/ t4 F
" S/ a' `8 D) D附录/ O: B* e. ], `3 t7 C- N
7 ?4 N" k0 L% jfrom Bio import SeqIO
D4 X `& [% \4 z& w1 o2 pfrom math import exp
* w% |+ C$ a3 E! i) X0 L& E/ Oimport random# l7 g3 I7 _; ]8 N7 V
4 \" r$ \% g5 p }
class Sequencing:$ y# m: g9 E! p/ `/ f$ t
# N代表拷贝份数
5 e' r$ k6 I4 j def __init__(self)
+ M, K0 r S1 E& H self.fragmentList = []
* ?4 x% s. n( j1 c6 B3 ^) \6 h self.readsID = 1+ C. K# O0 w/ W
self.readsList = []
: O( e, S- O; Y& C. w7 B5 | self.averagefragmentlength = 650- W9 P: `( i! h2 r5 [
self.minfragmentlength = 500* B1 {4 c' j9 [% S, d- ^! F
self.maxfragmentlength = 8008 J, q$ f) B6 u
self.cloneRetainprobability = 1
7 k4 Q/ s; D3 U7 H* F self.minreadslength = 50
- [5 \$ N) ?. P3 K+ _1 {3 `% B self.maxreadslength = 1507 n+ {9 g# Z, U
self.N = 100 R, Q; u- J" R) g5 v. Z
self.genomeLength = 0/ S: F+ j" @/ B3 E+ `6 y! r
self.allreadslength = 0
4 D4 w. Q* t) T4 H9 r$ K- e1 D" o$ z+ q5 A; e' F5 a' D: u
# 生成断裂点
# n. G `! p- u/ ~ def generatebreakpoint(self, seqlen, averageLength):
' M5 c2 P2 y+ d O5 I0 M0 G8 o # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
% O: T: p, H+ h; d( _ breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
" l6 \' x! X. P7 N$ k% V, k" V! v breakpoint.append(seqlen)8 v m. V! n c/ O/ C5 A
breakpoint.append(0)
# Z0 H& V# S& L9 x* C # 把随机断裂点从小到大排序0 v7 t; R5 H2 R4 i5 o. s( a" o
breakpoint.sort()
( [0 ~* O5 c! ?$ Q+ T6 f1 {/ ~ return breakpoint4 P" q' F/ J6 e8 w8 }! I- T
8 x. Z0 T6 i% O
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp7 i0 w7 w4 {0 P I! f
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):7 \; b1 K% V* H2 [ _: h3 b
for i in range(len(breakpoint) - 1):2 Y0 }, V2 N6 U+ ], l( ?6 Y
fragment = seq[breakpoint:breakpoint[i + 1]]# z, N) n y9 o/ A5 I' _
if maxfragmentlength > len(fragment) > minfragmentlength:8 P6 @% ?5 B* X" P
self.fragmentList.append(fragment)
2 R+ S3 B+ r! m8 u return self.fragmentList
. k4 d: F( H% R9 P' f' M, w0 I0 Q; O8 B @
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率, T' y$ g6 s. {/ e& u$ j; B
def clonefragment(self, fragmentList, cloneRetainprobability):: Y H6 d% `2 F& i% ?6 [; c2 k
clonedfragmentList = []
* D* M( }, j1 A. V9 E Lossprobability = [random.random() for _ in range(len(fragmentList))]0 @- [: { F8 Z- z
for i in range(len(fragmentList)):) i' y/ l, B0 D
if Lossprobability <= cloneRetainprobability: q$ L" H0 y: _* ?& _
clonedfragmentList.append(fragmentList)+ f) D/ L, K- f# y$ C$ f0 i
return clonedfragmentList
0 F% z5 h7 B6 f. T3 G) b$ d) j# H; E
# 模拟单端测序,并修改reads的ID号
, |' Z. T# B0 o' J) a def singleread(self, clonedfragmentList):
& k% e" ^) L4 l+ t for fragment in clonedfragmentList:5 j, h6 U; q X) X3 K
fragment.id = ""
8 ?8 W" C$ v' m5 z% Q5 f fragment.name = ""
' D) `0 S% Z: Q fragment.description = fragment.description[12:].split(",")[0]+ k* c# m5 m3 j5 k6 m/ C
fragment.description = str(self.readsID) + "." + fragment.description3 N5 Z' j" T% I
self.readsID += 1
$ h& M! d$ I& z7 S readslength = random.randint(self.minreadslength, self.maxreadslength)
! R9 q- ]+ ]( k% \ self.allreadslength += readslength
' F/ Y7 B* D# s$ }* H G self.readsList.append(fragment[:readslength])
2 C A. M) c t" l8 D* D' ?# E) ], M6 f7 _8 w9 a- B# k2 W/ [3 Y' T; _
def singlereadsequencing(self, genomedata, sequencingResult):. K: Y/ [0 W+ D' l* C* A
for seq_record in SeqIO.parse(genomedata, "fasta"):
3 N0 H2 g" d$ b seqlen = len(seq_record)
" D& L7 C3 C6 X9 G, H, I2 R+ ?. J \ self.genomeLength += seqlen
' k9 i6 U+ q! X1 L* k0 L for i in range(self.N):
3 X2 i; d# T# i& T& Y1 Z5 f; }' k # 生成断裂点1 T& f5 j6 g, Z) p
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)% u, ]- G9 `9 K- J
# 沿断裂点打断基因组$ ?: l) h! _8 X, ]) z/ F4 {/ x7 m! x
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)! i( M9 j* Y' F! I" n1 i, R6 Q# p
# 模拟克隆时的随机丢失情况
L2 p4 @3 \: C% `7 X' B/ b$ a clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)3 u! Z2 _( s5 y6 Z9 Z" a; m
# 模拟单端测序' N2 j$ S. ?3 u$ h0 {
self.singleread(clonedfragmentList)
- G4 i9 p/ C' l+ r1 a* l SeqIO.write(self.readsList, sequencingResult, "fasta")
0 d2 w" N ]1 Y6 k$ ]7 [ r" N4 e, h0 `- j
def pairread(self, clonedfragmentList):
+ F7 I. v/ d6 H) @* z for fragment in clonedfragmentList:
; B5 q: N6 }1 ~4 c* D/ v! ~ fragment.id = ""# n/ H! S4 R. i, R9 ^* D
fragment.name = ""& Q1 u/ F s: o6 V# w7 [8 o5 L0 ]4 ]
description = fragment.description[12:].split(",")[0]9 s J8 j! D$ T T y$ a* \7 T
fragment.description = str(self.readsID) + "." + description* } B/ W& ]2 _ D# ^. p! A3 j
readslength = random.randint(self.minreadslength, self.maxreadslength)5 P$ O9 U8 y: ]1 R
self.allreadslength += readslength
E% S# |1 @2 r* O, y self.readsList.append(fragment[:readslength])+ t2 B8 r9 C; X& ?. u) D
3 i: `* |' H5 v9 f1 N readslength = random.randint(self.minreadslength, self.maxreadslength)
7 N/ \& Y# \& K! z self.allreadslength += readslength V# e# I6 G2 d2 T# b0 s+ K
+ ]% j% H8 x& S" J% }: ^ fragmentcomplement = fragment.reverse_complement()/ O- u0 b5 v& W* M
fragmentcomplement.id = ""7 Q6 n) Y; l* T4 P1 B1 g, y' t3 w
fragmentcomplement.name = ""+ W' a$ L# H: W3 @4 B& J; D
fragmentcomplement.description = str(self.readsID) + "." + description
! d% y+ k. ~+ d6 W self.readsList.append(fragmentcomplement[:readslength])
7 `8 D; }& _ d: X6 p6 T: Q, V* X5 ^ o; R- u' z
self.readsID += 14 @- L3 K _4 }4 T+ D/ g/ d5 b
" \$ V4 P' V1 P def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):+ M3 v& i$ L0 R
for seq_record in SeqIO.parse(genomedata, "fasta"):# }9 o4 h1 I* }8 j n) z6 l
seqlen = len(seq_record), C& K) T, c3 `% K* K
self.genomeLength += seqlen- N% q$ V+ C* M& b. t2 @/ U
for i in range(self.N):
+ b5 f' ^- H) i* C # 生成断裂点
. d0 `5 u; t7 h' t- }/ c i breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
F7 o# e0 D( @. I# b6 A: t # 沿断裂点打断基因组4 }- w Y+ K0 c
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
) Q& b2 L4 f7 d( I% Z/ F, S # 模拟克隆时的随机丢失情况
* D1 ^! _8 f5 G$ J; X) m- v8 J clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)6 W: X$ ]' B- t- ]8 G
# 模拟双端测序, G: W1 M9 O E2 p9 i
self.pairread(clonedfragmentList)
1 O% p" U- E4 ~) ]7 A- f1 G- d readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]" s5 S4 o* c3 Q/ V7 o
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
: N7 a5 U/ ?- s9 c SeqIO.write(readsList_1, sequencingResult_1, "fasta")& {1 K6 a( h+ |/ |
SeqIO.write(readsList_2, sequencingResult_2, "fasta")) O4 a; ^. x4 f2 l" c- ^7 v# @2 A
/ ~8 g, N2 Y9 p$ X' q- G6 ]. V
def resultsummary(self):: F5 f% W- j. k7 a) k9 T
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")/ t- X3 v2 C- ~6 [ h
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
: m4 r3 \8 a2 s6 @$ c print("N值:" + str(self.N))) V% ^" ?9 k' W# h" F
print("期望片段长度:" + str(self.averagefragmentlength))
+ ?, |, z" ?; l2 ~. f+ f, V print("克隆保留率:" + str(self.cloneRetainprobability))
- n! o0 U8 y; ?/ [) R. p4 B$ h print("片段数量:" + str(len(self.fragmentList)))8 `5 z- i# v' D. @ F
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))6 s4 e& o) a0 h6 K; W
print("reads总数量:" + str(len(self.readsList)))6 A/ C% ~% x/ G0 H: m( x5 i- e" [
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")+ @6 I/ d6 {$ x I2 c9 g
m = self.allreadslength / self.genomeLength
/ }8 F @7 w$ v8 }/ J% @* V) N; T print("覆盖度(m值):" + str(round(m, 5)))
/ q6 Q$ T# E, b1 B+ e B' s print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))( p; P1 @. `6 B' W$ S
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))$ [7 A4 o/ b J' U3 ? Y
# -------------------------------------------主程序-------------------------------------------
# I8 {. c: @5 D' ~: j$ T) Z& z# 模拟单端测序
* X7 j% x) D! @, HsequencingObj = Sequencing()
0 k3 x3 R% d# esequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
) G5 f r# v0 {( MsequencingObj.resultsummary()3 k' C6 w/ e+ Z$ G5 N0 p8 q
# F4 d$ O: L5 V; ~' D# 模拟双端测序, Q1 i4 [/ `; I7 ^" b
sequencingObj = Sequencing()
{, z3 u1 d9 j {) V1 K+ jsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa") F! s: b" I8 Q6 J, e
sequencingObj.resultsummary()
' z, w6 M L0 v( \7 }* \3 L2 ifrom Bio import SeqIO' ]7 x0 [) f8 j9 E1 U' ~9 P
from math import exp+ m, C9 j0 O( W0 |# a
import random
# Q) Y- f/ K$ M7 ?5 g; K
% q4 o/ W9 e; O, o. v$ Yclass Sequencing:
+ q6 e( I+ b) K8 a! I # N代表拷贝份数 C! K/ J* P9 A e5 y
def __init__(self):
/ \% `% X" m' v/ |4 Z1 A% J5 W self.fragmentList = []7 s, |( j; Y- J- l
self.readsID = 1
0 E! v! _0 i) ^ self.readsList = []
8 B* l3 h8 n4 g$ X# l self.averagefragmentlength = 650
9 w; y5 Y3 `2 b& ^8 o self.minfragmentlength = 500
0 X) f) S; I4 d5 n; t/ @ self.maxfragmentlength = 800
) Y6 Z" {5 A" ]5 W: g self.cloneRetainprobability = 1- G4 l, z6 |% I d9 @
self.minreadslength = 50
( }. j* W5 {% T% s& C# z self.maxreadslength = 150
4 A4 P+ x. V% X- i( R, E9 S! P self.N = 10
5 L2 Z4 k) E1 I' B self.genomeLength = 0/ u6 t% J) ^+ l8 h
self.allreadslength = 0
; v, l I. P% p5 Y& r2 z2 M3 B$ Y2 V2 M
# 生成断裂点
: K( q) R' S9 z3 |) B def generatebreakpoint(self, seqlen, averageLength):
- N" K& S+ z: p! r& s9 Y5 U # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)$ ~, K L( m- p, K: n
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]4 ~' l) _, Z- z' a3 L0 C1 U2 U9 g: q; `
breakpoint.append(seqlen)6 n1 s" y9 x# }
breakpoint.append(0)% {- a2 d W) B) l( ?; a
# 把随机断裂点从小到大排序
4 W5 b" I" g' M. c* H3 v breakpoint.sort()
: h5 o1 N8 e7 b* V1 o+ v4 ^ return breakpoint# B5 n1 E% K( p, s; s0 `- W
! N( F: k0 D; H( T- W( g; W W/ Y7 `
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
! p$ m9 I. l. i" N def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
) E/ R6 m! M: w) Z+ E( m for i in range(len(breakpoint) - 1):
" P4 j6 w1 @. h3 M8 s, y' C$ v0 ? fragment = seq[breakpoint:breakpoint[i + 1]]% D+ n r3 i% x
if maxfragmentlength > len(fragment) > minfragmentlength:
/ z3 P$ n. d3 U) b) F2 n$ a self.fragmentList.append(fragment)* n$ n3 H n+ q, k9 m) t% Q
return self.fragmentList! K" Y9 ?+ u$ P* [
7 B" c' ~9 ~3 d! B7 t) ?& q0 X # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率( ?+ n5 F6 h& P; r1 x" a( J; G
def clonefragment(self, fragmentList, cloneRetainprobability):9 J1 b; Z5 S2 x' P
clonedfragmentList = []
" ?5 I* [2 G+ O' U Lossprobability = [random.random() for _ in range(len(fragmentList))]
8 S! q/ }3 Y' v) L) x' A% j for i in range(len(fragmentList)):- b' @6 w6 a4 l. h' J% V
if Lossprobability <= cloneRetainprobability:
) V) ~1 _& t6 g% {& y, `9 u clonedfragmentList.append(fragmentList)
' c1 [) R1 z$ P2 R$ a' i1 U( ` return clonedfragmentList% h" b. e+ |- Z
- o* e$ r' x6 [9 {6 d
# 模拟单端测序,并修改reads的ID号" d1 m& v% u K: a
def singleread(self, clonedfragmentList):
" V+ h0 F2 G" {* R for fragment in clonedfragmentList:' b7 R$ Z2 k7 R/ Z9 b
fragment.id = ""
" ^( i x( ]0 m" ? fragment.name = ""
$ k/ [6 O4 s6 c6 [% h9 R. Q fragment.description = fragment.description[12:].split(",")[0]
& b+ V/ U, _! O' c0 g( }9 B4 u fragment.description = str(self.readsID) + "." + fragment.description d( G& H8 o1 t: j. r; Q
self.readsID += 1+ z3 |: [" Y+ O0 n' E
readslength = random.randint(self.minreadslength, self.maxreadslength)
8 _2 |! } c% l! Q/ h; @ self.allreadslength += readslength6 a2 R! o# b$ b8 @7 V h
self.readsList.append(fragment[:readslength])1 o5 c; D) ]6 s0 ^* g/ |" T" j
! F4 m& ^. A7 j. ^6 ` def singlereadsequencing(self, genomedata, sequencingResult):3 u3 r3 X3 I6 `" Q" A/ O% K5 d% D% s
for seq_record in SeqIO.parse(genomedata, "fasta"):
1 C2 @: w) J0 \/ k$ l) ~6 } seqlen = len(seq_record)( x. l+ |6 M) @; m C! c, |4 z7 y
self.genomeLength += seqlen
+ I F& T+ q% N" Y for i in range(self.N):4 b W5 H* A% C5 U- [' N
# 生成断裂点
8 u+ p8 S% l$ e6 I5 c+ n breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
, n1 f# j/ B0 W- N% X, k7 b: z* R$ a # 沿断裂点打断基因组
" P* L( j/ M) A( Q" e1 ^" S% ] self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
% ]+ T% w/ E+ j! Q! L # 模拟克隆时的随机丢失情况
. m0 K: P# D6 f) {& E' x+ W clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
7 i s1 ~ {% |+ i7 F # 模拟单端测序% _( }, Y+ f$ ?* b1 I- e5 L" _
self.singleread(clonedfragmentList)0 I1 l' U8 J3 z+ {# r% Q5 \1 F
SeqIO.write(self.readsList, sequencingResult, "fasta")
. w+ W; S& Z# G% a) G, {
5 ?5 T o, T" ^5 `: r def pairread(self, clonedfragmentList):
* [$ n7 H: L7 i* j for fragment in clonedfragmentList:
& |1 p: C+ K. o" ^/ B1 E& w: p fragment.id = ""7 o( [$ w. h, t2 x$ R: r1 m( ?. d! p
fragment.name = ""
# _3 {* E" y0 c0 G description = fragment.description[12:].split(",")[0]
. G5 T3 b$ m$ A$ t. r( M8 A fragment.description = str(self.readsID) + "." + description5 m3 _4 e3 S0 m/ O. ^
readslength = random.randint(self.minreadslength, self.maxreadslength)
$ P6 _6 R$ c$ f1 w self.allreadslength += readslength' @: @4 y- Q6 k# G7 N V, u* w. o
self.readsList.append(fragment[:readslength])+ ~9 ?0 j9 d1 T+ G9 F0 Q
$ a, g# w( ^" P1 e, N! ^ readslength = random.randint(self.minreadslength, self.maxreadslength)- {1 X% L( ^) \. F: h9 z
self.allreadslength += readslength# q w3 c9 U b! M, N4 w
3 s% @0 w6 O4 Y* g fragmentcomplement = fragment.reverse_complement()
6 e% g$ p e, c8 z2 h/ \ fragmentcomplement.id = "") A9 G1 B3 a7 o9 `: W, W; N& D/ A
fragmentcomplement.name = ""
7 ]' j) f% N3 j" p: t. C fragmentcomplement.description = str(self.readsID) + "." + description9 T; |* L4 @8 O$ R4 \6 I
self.readsList.append(fragmentcomplement[:readslength]). W$ f5 ]; `% f' J9 l5 Y) x
4 W5 g) I. i! b8 X2 B, V
self.readsID += 14 J; p1 u" c9 g
/ ~5 E8 i b8 C) p; ^ def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):1 w/ X1 l7 c0 y) k8 v# m0 f
for seq_record in SeqIO.parse(genomedata, "fasta"):
( P7 g- n# v4 F% ~ seqlen = len(seq_record)
8 [% P8 v% i. N5 Q6 A6 V* | self.genomeLength += seqlen
6 Z6 K% X0 b( t9 B8 o for i in range(self.N):& O/ @# P" ^( r* @& O" X
# 生成断裂点* F. s1 u Y2 H; z: d
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
$ X/ d# ~% J1 M( p( l # 沿断裂点打断基因组, D/ I. U0 r' J" [: F- U
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength): {1 \4 h# a% s; K0 F) j5 u
# 模拟克隆时的随机丢失情况
* e0 V* z, w2 I2 V1 V, a clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)( r7 {4 X9 V" a) h
# 模拟双端测序, d Y$ Y* i2 a" J
self.pairread(clonedfragmentList)
6 s$ ?0 a8 L( r9 v4 V- i# _ readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
* L! k5 x/ J# e% M readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
8 l4 ~; D& U: n! A; f SeqIO.write(readsList_1, sequencingResult_1, "fasta")
" H K4 H4 ~, x, n+ R SeqIO.write(readsList_2, sequencingResult_2, "fasta")% W- E/ J: ~8 X6 S1 }( P
. Z6 u% c) Z2 o
def resultsummary(self):/ X, g E; l/ p& K
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")# t. M! J: `# `
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
+ N- U+ K L/ ?! ` print("N值:" + str(self.N)): K) }/ A5 e, p& h
print("期望片段长度:" + str(self.averagefragmentlength))
7 S* W9 ]/ O8 P T4 E5 h print("克隆保留率:" + str(self.cloneRetainprobability))
- m+ w. N Q; h% `6 {7 V1 @ print("片段数量:" + str(len(self.fragmentList))): L" ~ L; e' n* p2 U$ R% o
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))! J* w* r j" D
print("reads总数量:" + str(len(self.readsList)))9 t7 z" K. q- M3 \
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
0 r) t# E. m& B m = self.allreadslength / self.genomeLength
! e4 ~) |2 S* E# b. u" R print("覆盖度(m值):" + str(round(m, 5)))
|9 x V; |4 r print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
( s- n: u2 c$ s print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5))), m# V1 s, a2 O0 q8 E
# -------------------------------------------主程序-------------------------------------------
% x9 ^3 g2 X3 K8 J9 v Q6 Q# 模拟单端测序2 A5 X* j. ^4 M! ~
sequencingObj = Sequencing()
, A6 T& j; d% N+ J- `: nsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")) U( r$ P! g$ f8 S/ t. g1 {- _( \
sequencingObj.resultsummary()! \% y+ D" ~) ~
1 i3 p2 }" A* g+ u6 Y0 P( y
# 模拟双端测序
5 O" K2 {$ [) {4 o5 P4 |5 j' lsequencingObj = Sequencing()
' k& `' N% S7 R% LsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
, Z1 y9 s, r: dsequencingObj.resultsummary()0 R/ S- ?- c4 l, y' \& d1 K( f
7 E: I$ `/ p0 S, H& i( T! o
8 v5 O6 N3 L' Y! W7 o. |. @5 @+ K$ b" q
! k, X8 E; e5 _6 L |
zan
|