- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563408 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174245
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟2 R1 K% q$ Y% ]+ f7 b4 y
基因组测序模拟7 p9 |1 P# J# `* {
& j6 D# T# T. E* D4 k! O* `
一、摘要8 o' c- z0 [0 h6 O& K
- ?1 A. \5 R6 [" a: W
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
/ B6 }/ H$ `3 o' f$ s$ C9 K `3 Y
* y: U* w) B" ?$ \9 B二、材料和方法1 ^2 z3 w' b# }
+ ~ Z3 c' t& P3 x5 ? O% Q1、硬件平台
- } n% H! u* L1 L/ B+ O% K3 \- I9 `4 X0 @- G V/ g3 z
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz 0 G$ H0 s( |/ L" e$ b/ ^1 e
安装内存(RAM):16.0GB: ?/ |4 D% Q1 }+ M$ }
; L6 S" Z5 Y+ `7 j9 H% [6 U2、系统平台) W$ H8 j1 W! ~8 |
Windows 8.1,Ubuntu/ V9 r! o2 r7 I: i' N( d
* G! H7 {7 `0 |; R ?' T. N0 T3、软件平台
+ O" I) K; P9 G' y+ o$ f8 @( H i* Z% z" c U) |
art_454
1 U- j) B& ~& E/ P: w0 P: EGenomeABC http://crdd.osdd.net/raghava/genomeabc/9 ?- N/ M8 o# p6 G, c9 n9 C
Python3.5" n1 ?$ B+ u& X4 v _- X B; R
Biopython
4 u% f2 `5 k; J' w4、数据库资源' s! ~( I* e8 S/ R* C7 Y
/ k. R% [! `' D( Z) H4 W
NCBI数据库:https://www.ncbi.nlm.nih.gov/( ]2 ~" G! }& c; |* ^& j+ W
- t- l0 [. I' u( h4 ~
5、研究对象4 [) l! M# T$ c' x* U* B
3 W% \: w- j; Y5 t/ t2 I7 U酵母基因组Saccharomyces cerevisiae S288c (assembly R64)
& H* x ]. Q2 _; n! M( J5 Uftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz: a: o/ s* ?9 }% Z$ ?
2 ^9 n. \* M6 l' A: G% V6、方法 E8 }( [$ v& R% U, ~3 B& k/ d$ L
" m$ w2 w. r& p8 N. kart_454的使用
0 t' k$ S* {5 S: x: Z3 U ~. s, [首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。
' ?$ B& @, w/ K; rGenomeABC
; X: l' ^ B, h" Q6 u- a) Z0 {进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
. c4 L" a9 x! l# R' ^编程模拟测序 ' d5 |, n D9 E' t( c. B, x; P Z: K
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
- ]5 E5 }9 i. \! J7 R1 ]- H三、结果
9 N" V* j0 x$ Q P9 C( ^3 ^* V
% X2 c; B- k- s* V) v5 l* w1、art_454的运行结果
5 f6 a% k8 u/ C- _- i0 B' Y6 x4 x3 z _8 @2 C
无参数art_454运行,阅读帮助文档 0 j" \& Z, T) u& y7 b2 w0 X2 H' R
6 K( s: j2 Q8 H
图表 1无参数art_454运行
1 I% I% u, I* L+ D, e对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. 7 W G- v* |' ~2 N
下图为模拟单端测序,程序运行过程及结果
) P( j4 ?) T; F# Q0 v/ _9 J; i
0 l% x! Y3 ^0 c/ n+ q$ M/ s. e5 P图表 2 art454单端测序
/ r% j8 M& k; h3 v4 }! x
3 U, h9 ]2 _; b图表 3 art454单端模拟结果
/ P! v0 u; w0 X+ J" z& K2 A+ S6 |/ E1 [% v双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 , a; t1 E; ?! ?/ a E# h
下图为模拟双端测序,程序运行过程及结果 8 Z+ q' s% v) @2 i
' ^, I+ I0 D( [# {
图表 4 art454双端测序
* K2 y- q" _! D. }5 B+ o6 J; A A8 {. v3 @
图表 5 art454双端模拟结果 ! R- F' k' C8 g2 _) e
2、GenomeABC 8 z. H4 @! u- u; h2 Y
下图为设置参数页面 $ i7 p) y- |) F" l. k
% e. H" t) W$ o3 Y
下图为结果下载页面 G& v$ v; E! q: i8 k+ U b
5 H2 H2 u0 q) p3 t0 T6 L
图表 6 结果下载页面
% m7 J3 W2 A2 V& o2 t q9 G$ T6 F5 o3、编程模拟测序结果
; H8 s( t* E- Z! {* C拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。
! N: F: k7 i' D5 u. p单端测序 . f% z5 K! W }
- `5 Q- u7 G @" b" X$ d( T1 o
图表 7 程序模拟单端测序 6 P+ C, J! f3 X" b2 ` D4 m
双端测序 & P5 X5 [1 X1 t# z2 H) }/ s
- z5 J6 Q1 l0 c& n, D% s图表 8 程序模拟双端测序
. ~( ^- {$ q4 I! `* `' Y) F测序结果 . I8 ~" S" E7 n9 F
1 g9 ~( _( W h% @( G图表 9 结果文件7 K+ g: {2 T9 X3 K2 z+ w# C# u
& N! L! }0 |/ y" g# X因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。 U5 u9 Q9 O; X9 n; w5 w0 X, Y
测序结果统计表
% A4 n' i9 \) h$ J- M! E9 [, j
! s2 f R4 _+ Y. Y测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
0 {, |* Q3 S+ h5 K2 E& z) U& c9 Y% W单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682$ `- ?3 K% W; X# _. _; \; I4 F
单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
% M; A8 d8 R9 O6 s双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
$ @! |& H8 u5 r+ G0 T双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
7 K9 m# e: ~# `: [) L- _8 X! ]四、讨论和结论$ i% m( Y5 J5 B% _4 y4 M0 {
& [/ [: r/ d/ c* b( ^% H5 S5 \+ m
程序运行方法4 L# |; l; w3 W9 ^, M7 \5 f' O
: \9 A! O4 A% a2 t9 n" A! Y- g
在类的构造方法init()中,调整参数。 3 o" [6 F: T3 B) @
Averagefragmentlength为片段平均的长度;
( e& h& k$ P3 xminfragmentlength和maxfragmentlength是保留片段的范围; 0 F& h. k! i+ G* D, D6 G2 x0 B
cloneRetainprobability是克隆的保留率;
) F' E9 ]# O8 Ominreadslength和maxreadslength是测序reads的长度范围
& T8 S7 {# Q( Z! s! N; [$ x0 h- W; s6 i% p
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。0 ]) z* t2 [+ U F; J, Y" _
9 O+ K1 o$ J4 c4 o
附录
/ @, B, z+ Z' Z3 C5 B; I: g7 H3 ~" w3 z' d# G0 a
from Bio import SeqIO
; f j4 q9 ?) K, K( q7 {( C; R% nfrom math import exp+ H$ v' w! I5 F5 H8 D+ `
import random" M9 \: k8 r' F- h: W
, I9 Z" o" M9 @# r5 G4 S- Hclass Sequencing:5 i; a" P; F5 H; y
# N代表拷贝份数3 l# _% ~, _% b& z1 t$ d0 i% r; H
def __init__(self)" }4 r9 J/ A7 U Y( L" y
self.fragmentList = []" o% w! A0 ]/ X D( j' {% O' F u
self.readsID = 1
/ \% T4 K: w9 g) R self.readsList = []( I' A; p8 z- ?& k
self.averagefragmentlength = 650
* Q- m" M- l( [ f% w! f self.minfragmentlength = 5006 J, B. [( \% E) }" O2 I4 N
self.maxfragmentlength = 800) F* m, a6 @" Z
self.cloneRetainprobability = 1
5 G" T$ A' Y5 N1 o+ G self.minreadslength = 501 N) L/ v! { j7 T* B# `6 b/ X
self.maxreadslength = 1509 F2 K# M& k; _
self.N = 10+ Y1 W3 w5 @* M5 ]
self.genomeLength = 0/ y9 C* b9 w$ v
self.allreadslength = 0
: @/ t' O% v7 w$ f9 Y* z7 t7 t" i1 |( l9 y1 l
# 生成断裂点
. c* n7 p! H0 K$ B def generatebreakpoint(self, seqlen, averageLength):
+ u) m* `. s: e& Q2 p# [; H" \0 j # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)6 v' r& A7 O! a+ \
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]* Z8 F- ]+ L! Y" B
breakpoint.append(seqlen)
1 d' B4 j! F7 {' @) L5 D7 `4 R; Q breakpoint.append(0)2 o' S7 z! z# G- r9 E; |! A4 x2 `2 l
# 把随机断裂点从小到大排序
w5 A- E# |0 g$ K breakpoint.sort(). c, ?1 X1 Y% M9 [
return breakpoint
$ W' x0 H7 \' X) X) I1 ]$ ~9 a: K- V! X, v+ s l! G3 V" y
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp9 N+ L- k$ ~* ?. \- A+ d# S
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):* o/ U/ w0 M" {- H0 O5 p7 r) F
for i in range(len(breakpoint) - 1):
4 \+ r8 Q# R- S: r fragment = seq[breakpoint:breakpoint[i + 1]]1 b- p M% j+ y8 l R
if maxfragmentlength > len(fragment) > minfragmentlength:
8 C8 Y0 g3 A2 p& `& O self.fragmentList.append(fragment)$ S0 y& J2 [3 L/ x6 f9 ~, `2 x
return self.fragmentList, ~6 E5 B: Q" h$ o
' p* ?7 d0 c! o9 b
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率# O" V: L: B4 N' C+ a
def clonefragment(self, fragmentList, cloneRetainprobability):* I/ ~. I) V! x9 ^* d
clonedfragmentList = []
, \8 R' g! E' L5 f& p' e5 B% t1 \ Lossprobability = [random.random() for _ in range(len(fragmentList))]4 E# q# U) ^: K- f9 c+ ]& E0 B" Z
for i in range(len(fragmentList)):
6 D' z$ f b( \5 D9 L8 ~. Q if Lossprobability <= cloneRetainprobability:
3 |9 |* Y& M9 m. F* ] clonedfragmentList.append(fragmentList)( r6 w0 g- T8 M$ \& I
return clonedfragmentList1 N" K: R1 ?, [9 q, q+ {
& O. Q! P8 Z8 M6 m0 }. L # 模拟单端测序,并修改reads的ID号
3 Z5 j0 m# } z2 E- F. _ def singleread(self, clonedfragmentList):
^0 ~2 P% S: A" V1 f5 V! N) y$ p for fragment in clonedfragmentList:5 k# ~0 y4 b1 t4 N. Y3 b& X G0 M- y
fragment.id = ""# m* q. q- Q2 D& c' @- x
fragment.name = "", K2 R5 E" _2 h. B" [
fragment.description = fragment.description[12:].split(",")[0]
' f Y E0 v2 g3 R8 O4 v fragment.description = str(self.readsID) + "." + fragment.description" @3 Q4 r+ ?$ j& C
self.readsID += 14 G' |+ p+ U4 Z$ ~4 r+ }. n7 M
readslength = random.randint(self.minreadslength, self.maxreadslength)
& m6 D. Z6 b" X self.allreadslength += readslength: \8 }$ w$ a' A' P3 ^
self.readsList.append(fragment[:readslength])8 }! F/ a' x; ^
) l7 i- a3 u) s. P# L* V/ F def singlereadsequencing(self, genomedata, sequencingResult):. B2 T5 |* {; s, N7 F
for seq_record in SeqIO.parse(genomedata, "fasta"):
/ |4 A4 \8 \; S8 k* R2 Y; l9 D seqlen = len(seq_record)0 N. F4 l9 p9 l
self.genomeLength += seqlen1 {* I) ]' d7 v( F4 F u+ c
for i in range(self.N):
/ I f8 y% ~- d. y0 U) q! i, E # 生成断裂点( g+ K* i' ]4 o( u
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
& [, Q3 I) `* s # 沿断裂点打断基因组) K f3 |0 T/ ]% l5 R' A2 Q
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)5 Y5 A) m' p# U$ B/ ?% H
# 模拟克隆时的随机丢失情况5 A& x, `! K( g+ k: \& `3 i
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
! I$ K8 y! E" @: K5 M6 F # 模拟单端测序
& i' s* h: H r# s- T' g% ` self.singleread(clonedfragmentList); b; r' M7 G* b3 W1 M
SeqIO.write(self.readsList, sequencingResult, "fasta")3 Q) s+ M7 x" z7 n
' p$ A& J' A1 b
def pairread(self, clonedfragmentList):' o, f& c6 h' H+ K" o
for fragment in clonedfragmentList:
( _" m, P# A3 s3 I% N) Q fragment.id = ""* o2 g- T3 X1 K* t- o, Y b" y
fragment.name = ""* n; t/ l) s& |/ k$ c
description = fragment.description[12:].split(",")[0]9 V, R. X6 N, ~' a
fragment.description = str(self.readsID) + "." + description
. W$ \8 ~. l2 D5 q7 K readslength = random.randint(self.minreadslength, self.maxreadslength)5 S7 q: ]* R& \0 P& n3 v
self.allreadslength += readslength% t+ Z3 k# J/ p* S) F, t
self.readsList.append(fragment[:readslength])
9 R6 {5 ]2 V/ L+ i: g/ `$ J) ~' q/ y5 Z9 X) \7 o) P" T! k
readslength = random.randint(self.minreadslength, self.maxreadslength)$ ]. r& c+ ]: \3 b) R4 F
self.allreadslength += readslength
" j7 I) i0 V& F0 n! l, r+ n" d; J. y5 H
fragmentcomplement = fragment.reverse_complement()
* Y* Y+ |8 H- s9 l fragmentcomplement.id = ""* ~5 `8 N3 j1 K
fragmentcomplement.name = ""
, u. J6 _& j0 a8 @2 ] n fragmentcomplement.description = str(self.readsID) + "." + description
8 _& F# ^4 X. y) m: T self.readsList.append(fragmentcomplement[:readslength])
* [+ E9 L' e: Y, \3 c& O1 I/ @
5 u2 j# O. K) N7 T+ X8 X2 v self.readsID += 1
- R l# q" V; o3 U8 e
% U; G. t: h6 C. A7 B' B7 B9 ] def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):! w% s% ^# T" b4 \
for seq_record in SeqIO.parse(genomedata, "fasta"):7 U- w5 i: b5 r4 {8 W% M
seqlen = len(seq_record)2 E, Q, A0 v. d
self.genomeLength += seqlen& P' L' b2 N D
for i in range(self.N):3 t1 R# A6 _6 F4 ^% m
# 生成断裂点( y0 [% k" o8 |$ \
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
$ T% m) J+ }% { # 沿断裂点打断基因组$ C. C% \' `2 P% e, j# A" z
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
5 ?3 L7 ~1 @% A. z # 模拟克隆时的随机丢失情况/ W6 ]3 ^, d6 U7 ^* I0 O( J, i
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
9 A- @; o3 e2 A/ Z; w( A) @1 A # 模拟双端测序
4 O p' [2 l, [/ \ self.pairread(clonedfragmentList), r# i7 \1 S5 d. V
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0] L" _: S- X; P5 c% P4 p
readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]$ H+ p! F; W6 ~; W: N. R6 k$ }: f8 D
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
% \5 {5 U# s" f* _7 j Q% Z SeqIO.write(readsList_2, sequencingResult_2, "fasta")
: k6 {- } y( w n. Y" r) M3 `
* A2 B8 k7 f: W6 H# d def resultsummary(self):
% y$ }( p* ?1 r print("基因组长度:" + str(self.genomeLength / 1000) + "kb")+ h* [8 e5 y& A" ^
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength)); v3 E8 `* W# P9 }2 }7 `
print("N值:" + str(self.N))
+ {8 U9 R! n8 f, p8 y) `8 l; z% y print("期望片段长度:" + str(self.averagefragmentlength))2 C% n( l/ o* @7 Y/ \/ G0 ]" @
print("克隆保留率:" + str(self.cloneRetainprobability))6 A% m. N4 I& T8 ]
print("片段数量:" + str(len(self.fragmentList)))
6 ~3 E# Q: c1 h print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))9 ]! P' i' L9 O, z* y
print("reads总数量:" + str(len(self.readsList)))
; Y; K' a# H# q9 G print("reads总长度:" + str(self.allreadslength / 1000) + "kb")5 D8 ~0 m- N% z; o; h( U# f
m = self.allreadslength / self.genomeLength
9 ?3 t( e& g* z9 c# I+ T7 G print("覆盖度(m值):" + str(round(m, 5)))1 i9 |( A7 P3 x/ n8 }! E+ Z
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))) A, Y. e& O3 h$ v2 a. ~
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))
4 _! A& c! P8 G+ l6 ^, K. d/ D# -------------------------------------------主程序-------------------------------------------
/ @. a3 Y) M. ?6 e. d# B7 a; @# 模拟单端测序7 A0 K0 l5 i3 q( n: n2 ?
sequencingObj = Sequencing()
* M. l- Q( u; y i4 }9 KsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")& Y4 g7 E' B# P6 w' @% @( v
sequencingObj.resultsummary()* b# q* X. R4 P: C
9 ` q2 Y( |" u6 d5 D- T# 模拟双端测序
$ O; U9 e* c, F! GsequencingObj = Sequencing()/ Y: k6 m- Q F$ W8 i ^6 k
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")/ i; z( l* a' K( V
sequencingObj.resultsummary()
* F+ F% F: i% r2 y; I( ^from Bio import SeqIO4 e1 l4 y# H) ?, F+ U1 r
from math import exp
: l, |/ C% x4 l0 I5 `import random
0 ]) ~; s! G+ h8 J5 [! [: E; V6 H! m$ B8 ~. U
class Sequencing:% u$ z( D) q+ o5 _" z4 x/ S3 ]
# N代表拷贝份数3 W y& |1 a" S( k/ c" j! p
def __init__(self):
- X+ R! Q; f( p8 l) R" Y5 Z1 Y( | self.fragmentList = []: R+ g; z z% a2 |
self.readsID = 1/ h; H8 Z+ V5 _+ Q$ \
self.readsList = []
1 J4 N g' J- r$ T, i' X) ~ self.averagefragmentlength = 650; X& ?$ B& `6 |8 z4 {
self.minfragmentlength = 500
* ~3 S6 i$ A. R" Q. Z! ]1 x self.maxfragmentlength = 800
9 q9 x# S: F" v# o& M# y, T1 b self.cloneRetainprobability = 1; M8 r2 O: r* j2 M! D) [# R
self.minreadslength = 50+ G b! }0 c0 m5 P
self.maxreadslength = 150
. x O& O V" e self.N = 10) [" S: }. n5 k
self.genomeLength = 0
9 `$ J6 U8 X3 J ]! E! ^; U# e self.allreadslength = 0! B6 l- q w ^% a* _5 w! Q( g
) m4 v. m- {* U2 d R6 b6 H
# 生成断裂点
6 y- f) a. K; G8 w: `0 H9 {3 y def generatebreakpoint(self, seqlen, averageLength):
2 J% H# q3 x) `9 M* A # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
) k9 N- y1 q9 x- w breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
5 a+ x6 K* V) g breakpoint.append(seqlen)1 V/ m- X1 K7 H8 X! \% l
breakpoint.append(0)
! v1 _3 ]9 {" w6 @" u' U0 x0 Q # 把随机断裂点从小到大排序
6 i' J6 N6 d7 k$ ~ breakpoint.sort()% k+ ^" E, x$ }( i! g& t9 R
return breakpoint
, P+ k1 M1 b+ j+ b6 ?# z$ o% C& X+ K! A" P
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp) R5 Q+ o I9 a& h+ t( s( z! V$ \: j8 K. v
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
# J4 u' h, L+ g# L$ I+ ~ for i in range(len(breakpoint) - 1):0 c$ W4 t' }8 x0 w" e, e
fragment = seq[breakpoint:breakpoint[i + 1]]
1 D3 _7 Q- O( Y! D7 }2 z1 P2 s if maxfragmentlength > len(fragment) > minfragmentlength:
* v. j/ F2 e+ a( V: p self.fragmentList.append(fragment)
( ~7 D+ Q7 w$ T return self.fragmentList" q5 P" h: @, y6 F0 T! D# {
) s( o, A- u! P. z2 g1 K
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率7 m6 z& r: P7 K$ o3 l8 b6 \
def clonefragment(self, fragmentList, cloneRetainprobability):
0 W! Y' Z9 K8 D F: k clonedfragmentList = []" y$ h" ~) T+ j+ N: y' J L' ^
Lossprobability = [random.random() for _ in range(len(fragmentList))]
) E; G" Z: R( t; p for i in range(len(fragmentList)):
: R t! G4 L" |0 a! g' {9 x8 K* d if Lossprobability <= cloneRetainprobability:4 E' c. c; ]. _5 @6 t5 `8 L! s
clonedfragmentList.append(fragmentList)
: ^3 A/ z# F# { return clonedfragmentList5 [- }( @1 {, w4 V! T
/ O% b! o% F2 k6 D
# 模拟单端测序,并修改reads的ID号6 @6 f- e- }5 {- m. `' ]% y
def singleread(self, clonedfragmentList):- y. u# n5 e( L9 U! X$ W' y
for fragment in clonedfragmentList:0 I) ^$ F# ]# d1 s3 S
fragment.id = ""0 H) ?- V) F- O& W
fragment.name = ""
+ W( q& i1 v0 T) Z" B fragment.description = fragment.description[12:].split(",")[0]
3 z6 O9 ?# H8 X3 _7 H: P1 x fragment.description = str(self.readsID) + "." + fragment.description+ I0 C" X" d# {( _ G7 |* E6 {
self.readsID += 1# J- |9 `2 E5 ~ Z1 d
readslength = random.randint(self.minreadslength, self.maxreadslength)" ]9 u; H( |7 |8 T
self.allreadslength += readslength
6 A9 F- H! s" Q& `+ }: g9 N, X# r self.readsList.append(fragment[:readslength])
/ m* P$ d X; j6 a: \% l& q) |% F/ }; G* I8 q8 M
def singlereadsequencing(self, genomedata, sequencingResult):
* k3 N7 H/ E% z3 n# g8 | for seq_record in SeqIO.parse(genomedata, "fasta"): m6 M1 t' V) i, E
seqlen = len(seq_record)
2 _: O8 N) @; p8 }1 W+ V: Q+ o2 {7 q0 { self.genomeLength += seqlen
5 f t6 b4 u1 D* F for i in range(self.N):
( e7 o! `( g) h5 [5 ~ # 生成断裂点' R1 s5 g" \; {7 ^5 H
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)3 G3 D- p. t& r, G9 f2 L( x) N
# 沿断裂点打断基因组
8 A5 {' U( [: b) Z! C7 W$ e4 p: c self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
9 d% x; Q+ _) l0 R # 模拟克隆时的随机丢失情况0 e+ z: {; T' H$ R* V$ I% v: J
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
$ r2 [1 R& P! R; y" q3 b2 |" y # 模拟单端测序1 M4 S* i0 e+ X& [
self.singleread(clonedfragmentList)
7 ?, b$ Y1 A% T) r h. o( E' S* a SeqIO.write(self.readsList, sequencingResult, "fasta")
: W" _. s$ g; N9 F7 U6 }6 a) ~
6 q) `! x! u: Z; B$ r def pairread(self, clonedfragmentList):
- L) u) o1 U. }& m for fragment in clonedfragmentList:6 z: R6 M9 D2 p& k$ r( R3 Y; i; I6 Y& z* _
fragment.id = ""
0 m" u. l/ b. n0 j H" v fragment.name = ""2 P+ d* X, b$ g( d0 ]6 j
description = fragment.description[12:].split(",")[0]
% h& b) f9 }* K$ n' }) f& q, a g# r9 Q: ] fragment.description = str(self.readsID) + "." + description/ g1 p4 Z/ f9 O. H- N
readslength = random.randint(self.minreadslength, self.maxreadslength)4 y6 B9 L/ p8 W, e8 Y
self.allreadslength += readslength
! ]: Z' s) A6 i self.readsList.append(fragment[:readslength]): ]& k( ]! u% w7 c
E$ M; {3 n/ @+ Y
readslength = random.randint(self.minreadslength, self.maxreadslength)$ t( a0 {0 \: |6 e8 ]: b* Z6 b
self.allreadslength += readslength% g: h* s( ]5 U) L; ?
) I; U7 O. Y* Z% j' j+ r" i fragmentcomplement = fragment.reverse_complement()
* ^% b8 D7 c- e7 m8 M0 F fragmentcomplement.id = ""+ Q* h9 V. l0 m3 I
fragmentcomplement.name = "") \9 e) A5 g, I) a
fragmentcomplement.description = str(self.readsID) + "." + description5 I( ]: w& e6 { X: w! u: m
self.readsList.append(fragmentcomplement[:readslength])* f3 o; a- j* T& }9 E0 M# M+ F
|# n2 c5 |3 u: {: Q
self.readsID += 1
0 n3 U( W- O3 D7 N- k; F. Q$ R% S8 m5 `! N
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
% @0 \$ l: Q& n. S4 v for seq_record in SeqIO.parse(genomedata, "fasta"):
, r/ T) l( v- y. i8 r seqlen = len(seq_record)
" Q1 {6 e' C; z1 R self.genomeLength += seqlen* ^2 S6 H5 Y* h2 s" T. I& o
for i in range(self.N):
/ q* U2 s K. i$ C- j* u7 B # 生成断裂点' x3 a' D- g2 e! J2 V& S5 r
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)$ r7 W* I J1 J! ~! Z
# 沿断裂点打断基因组' {" T' {5 a) O
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
! h. t5 ?5 W2 p S: O9 Q% k W # 模拟克隆时的随机丢失情况7 ?3 v1 n' K; B4 a, b+ h( {. f
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
& R% d" b/ T/ |6 t( m# T8 E # 模拟双端测序) W' q$ h# P$ o" c0 d% y7 I8 t$ q
self.pairread(clonedfragmentList)! g8 P& }- K! z( j* S$ M$ ^
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
/ V! L. [9 Y! K9 n' I+ m readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]
' o$ z: M/ H# q$ l# ~) J; `; A SeqIO.write(readsList_1, sequencingResult_1, "fasta")- ?* g6 }2 b* n- o
SeqIO.write(readsList_2, sequencingResult_2, "fasta")* K5 j" S# I1 J( Z7 ~+ P
% f5 i* E, m a [. J% \" n
def resultsummary(self):
2 C9 M) Z5 G3 A; [( R/ f% G print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
; y2 p6 r W. g. ] print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
) U$ I$ f8 }1 F" p& [ print("N值:" + str(self.N))
, H0 q, A7 g) s print("期望片段长度:" + str(self.averagefragmentlength))
* Z4 x I W; L/ A8 F print("克隆保留率:" + str(self.cloneRetainprobability))
; K$ w- m% l+ F5 e* v! y& _ print("片段数量:" + str(len(self.fragmentList)))
9 f' F; F Y3 a0 c print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
. @# C; `" C! I6 K1 z print("reads总数量:" + str(len(self.readsList)))' ?: r/ t, ]& Q& G+ S j& n
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")5 g. N, P- ^. |
m = self.allreadslength / self.genomeLength
$ v7 s3 [( B0 Z6 ]4 i3 G5 T( N print("覆盖度(m值):" + str(round(m, 5))): ^; c& Q7 M7 \- ^
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
9 b+ ?: `) k& C4 Z print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))! l1 `, g- l3 ^4 a5 j) ?3 U5 u' Z- Y$ N
# -------------------------------------------主程序-------------------------------------------
; W, t% |% ^5 T1 z z# 模拟单端测序
) o) x/ q% k w+ @sequencingObj = Sequencing()
) S# P! _! y- v+ F* {+ |sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
1 Q+ q% ? O/ x$ U( EsequencingObj.resultsummary()
# l, Q3 @! ^7 g. ?2 P* D- m: l; u3 B
# 模拟双端测序
1 K7 ~2 J- j: usequencingObj = Sequencing()
/ J: ]* S0 M0 g4 l1 O s% ~sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
$ k# `% [3 D7 Q9 jsequencingObj.resultsummary()
( K7 p5 H7 |! b4 e3 u" @( b. R+ w2 @! Z
$ i' s8 v' ]$ E w) o/ r! A
& F. K* K$ g! k+ f6 ^ w5 u$ K
" ^6 ?; m2 `1 R: T3 H/ e g( n |
zan
|