- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 564624 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174610
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟: C( o0 d" L, X2 D8 y% r5 i1 {
基因组测序模拟- Y: i6 b; v5 i- U8 u3 q# o
I. w! A* N- _
一、摘要
% z% x# Z' A2 S# w% a; [3 q4 w3 T9 F8 g
通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件7 _- }, l4 k- j. @; w( [" q$ o
5 c# e" l h. D1 @二、材料和方法
- s, O ?' y3 _, W7 `, p8 G$ o4 b8 k3 {8 i3 |. f1 H( f6 V; }
1、硬件平台
" q4 F0 A; T; Q/ B- [: m" F5 S. ] \9 U
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz / B8 m* z' y4 m: m6 v4 R& ?
安装内存(RAM):16.0GB
) C2 v- A+ k- u( K$ Y- N; M8 y2 V8 q# E B+ b( ~) }
2、系统平台
( V. x5 r- H0 ~0 N, l/ d( [) b8 BWindows 8.1,Ubuntu; S5 _: w! m4 \) o
' l a7 R9 h2 B' P
3、软件平台
5 O* n# B; }: N! |& x4 G3 g
; F* J7 F# s+ x' ]3 b8 s4 x3 [art_454
$ B0 s8 \+ ?' m/ S' P6 Z: WGenomeABC http://crdd.osdd.net/raghava/genomeabc/
6 m8 E: Q G% F' FPython3.5& X- @/ R7 v6 e g9 L8 V
Biopython
& H9 h# P7 a9 y6 t, u8 w& U4、数据库资源
$ O( }3 N; R q- m' _
' U( E1 x" H" f: y! ~$ }& VNCBI数据库:https://www.ncbi.nlm.nih.gov/
; [- p2 P( e! }
! n4 z' U* h4 l* K7 t( Q5、研究对象3 a( I4 a( g5 p: @& e5 ^* n9 R
5 F( [$ J9 I8 A# j8 j
酵母基因组Saccharomyces cerevisiae S288c (assembly R64) # D7 A' p5 w2 [! v' m
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz" N! Q* D2 k+ a9 V# V: W i
$ P7 l9 `, G& h$ V2 w6、方法
) J/ M3 w: K$ ]; k2 {
& u/ ^. {0 F% i1 A5 o& i) A- tart_454的使用 7 B7 F- S: v' Z
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。" X& M+ }/ d. Z1 b' W
GenomeABC
+ b+ p$ @, Z( {# O/ r$ m( L进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。4 @# L4 q/ v9 R' _; D% _1 g% H
编程模拟测序 1 J- I ?5 a6 I
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
K" L) i8 U: G5 u三、结果/ H0 [9 b$ [! h+ |2 x# h
3 F' g1 w( E$ H8 G1、art_454的运行结果$ w! t8 e$ e$ N t; R$ g6 ~
9 q$ L/ k0 G& a+ P4 u) r
无参数art_454运行,阅读帮助文档 * q- h" |3 \1 u. ^
. R6 S4 h' O) r1 U) q4 T; C" \0 x图表 1无参数art_454运行 + D- A* U! u7 ^, G4 ?5 P x$ w% E) i
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20.
1 E% Z( o) d3 T/ [, h. K下图为模拟单端测序,程序运行过程及结果
0 k! F8 F5 C4 H0 Z$ |+ p3 N, z+ h* a& e9 |
图表 2 art454单端测序 # p. ]' d) K0 Q9 ` {) |0 b
4 x2 F t% _& e( ]6 A" T0 c; m图表 3 art454单端模拟结果
" v ` A' p1 e! G% T! a+ I7 X双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20
4 Y. v) E+ ~+ i( @ L下图为模拟双端测序,程序运行过程及结果 & V: P% _2 {; l: x! n: P: R$ u8 T) w
' N' @. F( s' Y9 p8 T
图表 4 art454双端测序 * \' @1 h( X; w3 P- `8 S
, ?' E5 N0 {6 w$ N9 L# r- {5 h
图表 5 art454双端模拟结果
" W' Q6 t: W' A6 g9 L- E/ g; ~2、GenomeABC
5 b9 Y1 I$ e" m" d X下图为设置参数页面
- G- ^# D0 K4 j6 O5 M/ T) L8 S7 ?0 {7 Z, a
下图为结果下载页面 4 W+ y0 V: q' |1 S( r4 c: e& A
6 X0 F- |7 W: Z' I) v n8 }图表 6 结果下载页面 / Y' P* C- }4 A2 f1 t3 d
3、编程模拟测序结果
( u l, D2 R- F: h' A拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 & a" |. {- h: ^ A. E- k6 i6 b$ o
单端测序
( @3 F- z, u4 d2 I5 w7 J
' L) ?- t, S2 e4 n图表 7 程序模拟单端测序
% g" _" H' L A% ^7 [4 q N- g双端测序
& N$ p9 g& r3 ]/ u- E* f# n$ Z. k$ z8 I. v; |! `: @
图表 8 程序模拟双端测序 2 D, ]! d* Y" e8 U$ t
测序结果 / N* e$ G" z' D7 S6 L9 a
% M4 a c& }, C. k/ l" A图表 9 结果文件* \6 W# U, I5 i% G8 M
" ~# J( j. r2 a6 o, d( j$ p. @
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
% T! e! q) ~9 g8 e$ E测序结果统计表& ?8 D4 i$ Z; l( ]% p
1 a- R6 p, W$ x( M5 N( Y- R" J% V
测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
4 M) m7 h& K8 v |单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
/ L0 v/ c* d5 N4 t/ x单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
- ^, z; m0 o& q4 f双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
+ V1 Z" S( B8 K! f, L2 F) t8 x+ U# P双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.91886
+ u0 x1 P" g! A5 {" ~6 @0 z3 x四、讨论和结论
0 B* o2 k+ X* d ~' d) k1 r' X+ i' `6 J3 {# p0 c0 q
程序运行方法( B# H4 m& q P/ S7 U
* y6 m6 q* ^8 v
在类的构造方法init()中,调整参数。 - E% D; x! L" H( A0 s
Averagefragmentlength为片段平均的长度; 6 C) |! x4 g$ U; J
minfragmentlength和maxfragmentlength是保留片段的范围; . J4 T1 j- W. \+ G" E W
cloneRetainprobability是克隆的保留率;
4 F7 |; Q# Y x: `7 A+ Zminreadslength和maxreadslength是测序reads的长度范围% n% [& u/ ?! o
% b# m/ p5 K; ?$ r
模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。
0 d8 P' b$ ~( d, F
' I: Y$ q+ b- U3 k9 W附录9 ?8 s$ o4 v/ A! _4 a' |
- I+ |, @$ R" w2 z( Afrom Bio import SeqIO' h( h2 F* W0 q' T
from math import exp
+ ]1 X* U, e {" {7 w) zimport random
! o* e( \4 ~) {- K7 P+ h! w
1 F) U7 n+ U/ c9 ]; e, {) o+ Y; _class Sequencing:4 Y( L& B! `' W% o
# N代表拷贝份数
2 O# \& [$ W$ [: i def __init__(self)
" o3 J2 a# T' h+ v( j+ h self.fragmentList = []
; f. ? a- P" o* W l; R self.readsID = 1
4 M3 r. l; ~ z- C4 p' ? self.readsList = []
# `6 t4 }2 C) B! L% Q. r( \ self.averagefragmentlength = 650
! M/ M1 u5 Z0 y+ g7 v self.minfragmentlength = 500; h/ A$ z! V4 b( O
self.maxfragmentlength = 800# V0 o7 |& n! X
self.cloneRetainprobability = 1
0 W5 z7 F; j3 \" y) C5 z self.minreadslength = 50
" f! q4 v5 C; ]! _$ \! X8 \ self.maxreadslength = 150
_' w% L! B6 K self.N = 10
) e, g. n' k5 r. d: ~1 A self.genomeLength = 0
5 O+ ~9 I! x3 ], } self.allreadslength = 0
: C7 b7 j& k0 L% J3 X! [
6 w* L9 _, G% y5 V+ U. |* n # 生成断裂点
% m# A, D: G L$ f8 U4 U def generatebreakpoint(self, seqlen, averageLength):5 e m" F O4 X1 h
# 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)' N$ T0 e6 b5 G% V. c6 \
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]
" o0 A) M& @3 R, T, G breakpoint.append(seqlen)& B5 ]: r1 L4 U+ a8 F
breakpoint.append(0)
[4 x8 Z* B, x0 j # 把随机断裂点从小到大排序5 d4 H% _) L5 W: T# G# I
breakpoint.sort()" `# f* C, z! _! V; F
return breakpoint: g( \1 D% x& B( `8 T, j
2 L8 J; L$ N5 M. j# P
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
% K, {/ u) R% _; U% b def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
/ s, o8 {* H" Z, w for i in range(len(breakpoint) - 1):2 y8 N, c5 r% f' |( s
fragment = seq[breakpoint:breakpoint[i + 1]]7 \/ z7 y2 Y- Z/ x
if maxfragmentlength > len(fragment) > minfragmentlength: G, E- J7 C8 H
self.fragmentList.append(fragment)
# o; ?; D k7 Q6 ~& _: N: c return self.fragmentList" X7 w9 ~6 ?6 T# z
2 j- z3 X( `+ s+ B
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率8 B6 I2 w3 v/ K9 B! v0 A( l
def clonefragment(self, fragmentList, cloneRetainprobability):) Y4 s3 R# l7 m% k6 E- ~4 t) ?
clonedfragmentList = []
$ I# ]& `9 C, }; o% Y1 |; w Lossprobability = [random.random() for _ in range(len(fragmentList))]
% q) f7 h5 P1 p1 _ for i in range(len(fragmentList)):* t- c% N7 Q: `- o2 _
if Lossprobability <= cloneRetainprobability:
. P) y% ?# c. T5 a7 m/ d# v clonedfragmentList.append(fragmentList)
' X6 S* ^# Y, K% K3 G return clonedfragmentList% U8 D1 t# K# k# S: c1 a
7 D- M; M& g: w1 K+ C' J9 Y5 ^ # 模拟单端测序,并修改reads的ID号6 A& a9 L( Q; r* H
def singleread(self, clonedfragmentList):2 s1 v) s) Y% {( N& p$ u
for fragment in clonedfragmentList:
! {5 _- ?$ A$ X( q% w& Q3 Q fragment.id = ""
$ y, B% M! K( ^( h) ~0 | fragment.name = ""5 E' S0 n- R* o' O% T, Z! q
fragment.description = fragment.description[12:].split(",")[0]
" B. b0 b8 e7 T# Z fragment.description = str(self.readsID) + "." + fragment.description
7 o9 c5 z# ~0 G1 v( {) | self.readsID += 1
; [0 a9 ]9 p: M7 s+ b readslength = random.randint(self.minreadslength, self.maxreadslength)
{# `0 i1 w) }/ E self.allreadslength += readslength
; v0 j1 f1 Q* z& W self.readsList.append(fragment[:readslength])
! ?% g5 d; ~1 \) e( E; B: A4 Y; R+ P( q- m6 b. u, f0 k( d4 W+ n$ T; X
def singlereadsequencing(self, genomedata, sequencingResult):
9 b3 W+ T1 F* |( m% v [ E5 L for seq_record in SeqIO.parse(genomedata, "fasta"):0 F- a7 g+ h0 d- _; Z( A+ ]) t! P S
seqlen = len(seq_record)
, V" l# u. y5 J; T self.genomeLength += seqlen/ c5 Z: W1 {$ C2 R, a" R
for i in range(self.N):
$ L/ m V8 j/ t4 Y/ L # 生成断裂点( C7 p! \0 ^- |6 m5 A' n2 {$ G( n
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
0 K0 f+ |, W! Z, [% f # 沿断裂点打断基因组
0 P; ~* a" u, r2 } self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
2 h( v- k$ s( F2 K! v" [2 E# g7 K # 模拟克隆时的随机丢失情况9 Z; K' B2 H2 U) H& C+ F# @/ L, V
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability); }, ~$ v: f# l! [
# 模拟单端测序. [: H+ t: n7 z' G ^& q0 v6 i ?
self.singleread(clonedfragmentList)7 `. R- o% b% O7 r( G
SeqIO.write(self.readsList, sequencingResult, "fasta")4 b D- I) c0 T: @! U* ?' ?7 {5 Y
5 V2 M' B' X. v6 _) X. j, m' @2 B* n def pairread(self, clonedfragmentList):( R- `) D5 [; T- D0 @: C
for fragment in clonedfragmentList:+ Z0 m/ z# z6 v& @' y6 l
fragment.id = ""
) ]4 s4 }& c. P* k fragment.name = ""
( k% w5 S1 t5 n' c+ F6 g' M7 O description = fragment.description[12:].split(",")[0]
7 _( c0 y: {) X E0 k. B, @ fragment.description = str(self.readsID) + "." + description8 n6 Q" G6 X9 I" V
readslength = random.randint(self.minreadslength, self.maxreadslength)
# c- R# a" x! C self.allreadslength += readslength# h; t1 ~8 n1 }% k: c9 @) i
self.readsList.append(fragment[:readslength])
! v8 b. | m! A2 q | Z2 L; B4 T. ^* `9 E+ a) f5 O3 R& u L7 `
readslength = random.randint(self.minreadslength, self.maxreadslength)% t2 |/ i" d, E0 e
self.allreadslength += readslength( E' S1 w6 z. H5 M y# q4 }
2 K' t& s, j$ a) l, M+ z; g
fragmentcomplement = fragment.reverse_complement()9 a$ A. @5 C' }" e7 P
fragmentcomplement.id = ""& f: L* L. z' D; I6 W
fragmentcomplement.name = ""
7 q' @0 H8 F5 P* p fragmentcomplement.description = str(self.readsID) + "." + description
* r% M! ]' A2 C$ ?8 J$ J5 } self.readsList.append(fragmentcomplement[:readslength])6 G3 I1 y2 ]" u0 A- v
$ _: [6 _, Q" v' D& C* Q self.readsID += 1: H$ a- b( H. U% V, D& o
9 \) L8 S/ K% W
def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):: W) u: t/ N% w& c5 e
for seq_record in SeqIO.parse(genomedata, "fasta"):
2 ^/ l' A& ^8 G4 t: {9 \ seqlen = len(seq_record)( ]* {2 F& I L% O
self.genomeLength += seqlen
$ k$ E5 o! |- E$ u for i in range(self.N):
2 \ B; Q( z/ G' c # 生成断裂点4 o: R. v, f+ }& n# r, x# I; F% @
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)4 J0 q% `; R8 G6 W9 O/ k
# 沿断裂点打断基因组7 {& R: m) [2 J* ^/ m! H# S* d
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
) U1 X+ D, M/ U # 模拟克隆时的随机丢失情况
" J+ D5 L1 J9 n4 E/ Q; x/ H5 Z clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)0 g& v0 R/ p) ]; ]8 `6 g
# 模拟双端测序7 o, p0 Y$ X C
self.pairread(clonedfragmentList)
8 [0 | r! w# F readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
; O. u0 t; T) b, Z readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]( O9 e2 Q, x/ G7 |
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
! ^& G/ I2 Z- \) j8 F1 T SeqIO.write(readsList_2, sequencingResult_2, "fasta")
. f4 M, l6 V& S- D' z
5 O' S ^7 T+ ]3 H, B def resultsummary(self):
3 f4 L9 J6 g+ p& d# @' F print("基因组长度:" + str(self.genomeLength / 1000) + "kb")
4 f3 t7 |- ^: l4 O! a print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))% ~9 \- \' O$ u# V/ K
print("N值:" + str(self.N))8 E- _' c* q- a# _5 Q" M& R
print("期望片段长度:" + str(self.averagefragmentlength))! `# n" w* _6 w4 n0 C0 D
print("克隆保留率:" + str(self.cloneRetainprobability)): f, i/ T, Z1 w8 V% {
print("片段数量:" + str(len(self.fragmentList))). P/ F. C4 e ]. n7 X% c" }8 F. \/ d
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
" X3 [% |% n1 E; q2 a print("reads总数量:" + str(len(self.readsList)))
0 x" J w! s/ h, ?2 O4 h0 O print("reads总长度:" + str(self.allreadslength / 1000) + "kb")9 }2 \. m; r! K: S9 `
m = self.allreadslength / self.genomeLength$ u+ W- {4 S0 j
print("覆盖度(m值):" + str(round(m, 5)))5 r. F+ R Y0 \0 n' }
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))/ Z C- j4 S5 e' |7 C, p
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))9 ~. S7 ^4 {" Y2 v0 i/ Z
# -------------------------------------------主程序-------------------------------------------+ A$ P. A4 x) R9 U8 P# z8 R
# 模拟单端测序
& G& B) g& @0 osequencingObj = Sequencing()
% G) u l) D' h/ ]) \3 `7 S P: l' K. msequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")
4 L I7 ?7 l+ h, P3 l/ t& ?- isequencingObj.resultsummary()
- X% G+ ~ m0 A: T
% i( k( \* @3 F9 J3 H& a/ s# 模拟双端测序+ }5 a. `: D k0 X3 X
sequencingObj = Sequencing()
0 m/ z. z& S8 B: m. usequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
: g4 ?: `( P. K/ w; U3 M" OsequencingObj.resultsummary()
! u9 i& J# Q% F6 ^/ m8 mfrom Bio import SeqIO
- Y5 ?4 W& u& a, Sfrom math import exp6 n* p/ w1 k) l
import random
3 H2 _& Q$ |% n/ }3 M/ ^9 |% T5 O, @ J0 h' r, ]
class Sequencing:$ i, H6 W- V& O
# N代表拷贝份数
1 Q3 j% c# r- J. A6 R* i def __init__(self):; R; L% P: }3 h/ h+ c
self.fragmentList = []- _4 w" D3 C7 {; P; L
self.readsID = 14 V- m# ?; V$ ?) P9 {/ B2 d5 v2 n1 s
self.readsList = [] s" L4 c8 B3 D) j& ]& i
self.averagefragmentlength = 650
2 ^( c7 x& `0 G7 R Y self.minfragmentlength = 500$ K/ M8 c# a! n% b; Y$ N4 x S
self.maxfragmentlength = 8009 x; |% a: v1 h" M, t8 T
self.cloneRetainprobability = 1
: C y3 m' Z& o c2 \ self.minreadslength = 50* m: X$ x2 K% i/ p
self.maxreadslength = 150
# }% l0 D5 f% ] |8 ?( x% A self.N = 10
% q) D4 d; M5 o# U# j self.genomeLength = 0% R' r4 \/ b- O4 K" {$ K7 F
self.allreadslength = 09 e9 u' X5 b+ D [2 C
7 X6 E: L/ K. t" X
# 生成断裂点& i0 k4 h& c4 [2 a+ U* T! T8 a, }2 l6 \
def generatebreakpoint(self, seqlen, averageLength):
+ h# {' J, m" W! D$ E% \) p # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)
. K( Q7 h: | _+ g9 s breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]" ]$ [6 x5 P% T8 \) G& x
breakpoint.append(seqlen)
5 Z! o# D% a4 B, A1 U breakpoint.append(0), {% [7 W$ e5 x( ?
# 把随机断裂点从小到大排序
7 N, D0 E& d; r* B breakpoint.sort()- ~' i7 n" \* h8 J4 j( S/ U
return breakpoint
% ~! E! M* d0 c8 P. {2 A2 ]4 W9 u$ H
# 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp
b. A( @; p* H' m6 G, ]9 d/ N def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
9 O% x+ `& ?1 e for i in range(len(breakpoint) - 1):
; l! e" X& m. b1 | fragment = seq[breakpoint:breakpoint[i + 1]]
1 k% ^3 L& S8 W2 S; J% T1 P if maxfragmentlength > len(fragment) > minfragmentlength:3 ]: ^! k) G+ G$ c8 q" P
self.fragmentList.append(fragment)
0 {! [! U# x' ]; \: ] return self.fragmentList
3 t" O5 d5 x# s/ x" g, n/ G9 w- a8 @
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率9 p! s1 U$ H/ ]. N
def clonefragment(self, fragmentList, cloneRetainprobability):3 F3 \3 `1 m1 N" N* t6 I# v
clonedfragmentList = [], w6 K: n7 \. ~! c+ x- I$ N
Lossprobability = [random.random() for _ in range(len(fragmentList))]
5 c8 K3 h: k; O% i" N for i in range(len(fragmentList)):
/ c. K+ c% W0 ?& | if Lossprobability <= cloneRetainprobability:; Q! r8 L# x0 m5 _4 f* N9 _
clonedfragmentList.append(fragmentList)
' m+ y9 ?* t d/ _9 ] return clonedfragmentList
" a$ u, t% H$ d* s
% p, e0 Y- E4 V: [ # 模拟单端测序,并修改reads的ID号, `! ^5 a4 {7 G
def singleread(self, clonedfragmentList):
' D; o% }$ A v3 J8 Q0 j0 ~ U! l for fragment in clonedfragmentList:
6 N* @& y+ W& J M6 x% d fragment.id = ""
) J/ z- Z6 o, } fragment.name = ""
3 i9 b1 Y+ }* n) |3 p0 g( n fragment.description = fragment.description[12:].split(",")[0]
9 J; G: c" A& l, Z$ z fragment.description = str(self.readsID) + "." + fragment.description
! Q% I1 h# Q: v$ s4 o( B+ K; r( R# ~ self.readsID += 11 s8 A, ^' v8 V/ R& `& Q
readslength = random.randint(self.minreadslength, self.maxreadslength)
$ O- c, u( B) g$ Z, v) Q self.allreadslength += readslength/ t3 Z5 X' i1 f5 r1 S1 J& m
self.readsList.append(fragment[:readslength])7 a- D& X9 A( J2 M+ R
, J- Y2 O) W- n( {
def singlereadsequencing(self, genomedata, sequencingResult):
' E# h6 E: U& r2 c2 i for seq_record in SeqIO.parse(genomedata, "fasta"):
. l/ s K2 e$ e) Z seqlen = len(seq_record)1 A' o+ b* I" Y+ I$ D) r
self.genomeLength += seqlen
2 a2 |; u3 `- H5 y3 k+ ^ for i in range(self.N):6 W% L( I- a5 J" L7 d4 H! D1 k8 m
# 生成断裂点- @) K$ c9 L8 O3 Q& A; q
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)9 L2 G% a' a* {9 y% e
# 沿断裂点打断基因组
8 e$ U9 R, q8 y, W- U self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)" j& G( H" F/ q9 L. d
# 模拟克隆时的随机丢失情况+ Q9 f2 }' ` i$ v6 h. m9 }
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
/ [) i9 _& O0 c5 K4 | # 模拟单端测序
* e, y2 F6 ~* W7 Q6 R* V; g R self.singleread(clonedfragmentList)
& r: t y/ c) ^ S SeqIO.write(self.readsList, sequencingResult, "fasta")% E0 X6 L7 ]7 D) o! `2 z# v
, o/ C9 h/ W5 s, ~
def pairread(self, clonedfragmentList):
& x. D' o. h, C+ m$ I7 g$ o for fragment in clonedfragmentList:2 [, J4 H( ?0 W3 i
fragment.id = ""
( C7 \, G* N8 g% n, N fragment.name = ""9 f o- t1 `- E8 N) X
description = fragment.description[12:].split(",")[0]6 y/ @4 T- `! z3 }7 c9 B/ |
fragment.description = str(self.readsID) + "." + description* J ]( [3 s- y+ S5 H4 y- g
readslength = random.randint(self.minreadslength, self.maxreadslength)1 U/ X' y# u' s7 g* d7 w/ {
self.allreadslength += readslength3 y* f; q9 N/ u) j
self.readsList.append(fragment[:readslength])
" Z$ U' o0 W" q! d! d. ]9 k3 V) n7 z+ Z) b7 o
readslength = random.randint(self.minreadslength, self.maxreadslength)
* w4 W5 K$ A8 z% l+ i9 V2 n self.allreadslength += readslength; f( E" P$ a6 q
, f( C& q& \) z: [4 W& d fragmentcomplement = fragment.reverse_complement()
1 u7 S# I, ]7 X* k9 M* [- i7 Y$ J fragmentcomplement.id = ""+ _+ @$ G* a! O4 h3 U
fragmentcomplement.name = "": D/ f A+ w% p& O4 E, a R
fragmentcomplement.description = str(self.readsID) + "." + description
( q) n, o$ c$ H8 o self.readsList.append(fragmentcomplement[:readslength]). c+ C7 E5 N6 p2 M0 s/ _
9 p7 }) F! a2 e7 `
self.readsID += 1
6 _) q! d9 f, K- Z
: e" t3 k- a, N9 Z( l def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2): Z" n8 }: A5 |' S7 r
for seq_record in SeqIO.parse(genomedata, "fasta"):
@( a( h {; S* v seqlen = len(seq_record)
7 T4 q% p% {) b/ R0 p, e# H self.genomeLength += seqlen' ]0 F3 d8 H O1 u+ B' c# m9 E' W
for i in range(self.N):. y$ {0 ]' v& Q# V6 V5 o
# 生成断裂点5 a) e) b0 E! m5 Q7 d
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength), L0 c7 k* {4 Y! {3 c& e
# 沿断裂点打断基因组2 ]/ }; d/ h# ]
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength); z( b }. Q. u; d7 B
# 模拟克隆时的随机丢失情况
7 p4 s$ o# }8 M/ i clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
) D: t& ^9 S# j # 模拟双端测序
6 d$ @ @2 h" [ self.pairread(clonedfragmentList)
, B0 K t! s: b, `; E readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
8 T" z& f0 L- ^2 ] readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]; v/ E( d0 b% i: [
SeqIO.write(readsList_1, sequencingResult_1, "fasta")' y. ^# R" V" S2 r
SeqIO.write(readsList_2, sequencingResult_2, "fasta")9 M: ^5 ^% X+ E* C h5 v; Y
8 n5 X8 p: T+ a+ a4 [! O def resultsummary(self):
P0 }) @3 b7 B) `, M print("基因组长度:" + str(self.genomeLength / 1000) + "kb")8 S4 Y2 u& n0 H% i
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
9 d. J, s* J$ Y2 _2 e print("N值:" + str(self.N))4 s* h! Z! k9 t/ Z
print("期望片段长度:" + str(self.averagefragmentlength))
* S! ~. y5 `/ {& p print("克隆保留率:" + str(self.cloneRetainprobability))1 s4 J4 u, q! u F! B# u5 H7 Z( p- _
print("片段数量:" + str(len(self.fragmentList)))/ Z+ b! o5 G8 ~
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))8 i! C9 R; D& B4 z
print("reads总数量:" + str(len(self.readsList)))! V- L/ k1 s: V4 b8 k
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
/ L6 @' m) f y" J m = self.allreadslength / self.genomeLength
* S3 h. {3 Q; e2 f print("覆盖度(m值):" + str(round(m, 5))). a9 p+ g0 m2 m# w
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
& ?& j* i% }# x# ^' J print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))- n2 `' J" b" s, G0 j
# -------------------------------------------主程序-------------------------------------------; @2 Z3 t; P2 k
# 模拟单端测序
$ k" t8 `# _% ssequencingObj = Sequencing(). M; }9 d+ ] W( o( x
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")% U4 D9 C5 a0 A4 c. c5 g" I% O& x
sequencingObj.resultsummary()
* a3 U0 |. N$ I, ]' m" A: c& s% V( C# M; Y% R9 @
# 模拟双端测序
/ {7 U: X& r4 ?# j$ S( lsequencingObj = Sequencing()
% n% y2 o4 E' O4 t, vsequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
# j9 f/ |, c" ]7 l) tsequencingObj.resultsummary(), X' B2 D O E+ o3 H. \
1 I7 C* {. l: n$ [2 H2 N- T. Q
3 n- E5 S) ~$ u' b
. B; j2 h8 ]: v5 i% V& i% g 3 V' }5 \& j4 M
|
zan
|