- 在线时间
- 1630 小时
- 最后登录
- 2024-1-29
- 注册时间
- 2017-5-16
- 听众数
- 82
- 收听数
- 1
- 能力
- 120 分
- 体力
- 563415 点
- 威望
- 12 点
- 阅读权限
- 255
- 积分
- 174247
- 相册
- 1
- 日志
- 0
- 记录
- 0
- 帖子
- 5313
- 主题
- 5273
- 精华
- 3
- 分享
- 0
- 好友
- 163
TA的每日心情 | 开心 2021-8-11 17:59 |
|---|
签到天数: 17 天 [LV.4]偶尔看看III 网络挑战赛参赛者 网络挑战赛参赛者 - 自我介绍
- 本人女,毕业于内蒙古科技大学,担任文职专业,毕业专业英语。
 群组: 2018美赛大象算法课程 群组: 2018美赛护航培训课程 群组: 2019年 数学中国站长建 群组: 2019年数据分析师课程 群组: 2018年大象老师国赛优 |
基因组测序模拟
5 P+ [5 W* h! s/ o1 L# o' j# C基因组测序模拟3 j3 v" A% X1 C" w* x) ]! q
* c; \ [- |- y, U/ d3 X
一、摘要
1 O9 P+ F" g; H! ]# T$ f/ C; h( {
( @$ W2 p3 g. X- F; l4 ?5 ~9 s通过熟悉已有的基因组测序模拟和评估程序,加深全基因组鸟枪法测序原理的理解,并且能够编写程序模拟全基因组鸟枪法测序,理解覆盖度、测序深度、拷贝数等概念,设置测序相关参数,生成单端/双端测序结果文件
- o1 O; [ t, _4 c/ f( ^& z0 T/ l! @; Z
二、材料和方法6 u9 a! m7 z3 i, y/ j* n" }. @& R
' D1 S3 y$ C( c1、硬件平台9 \- M7 l0 u7 u2 a
- ` e5 \! o& J2 o% n, i
处理器:Intel(R) Core(TM)i7-4710MQ CPU @ 2.50GHz
; ~" I$ @/ x, d' r. O7 o) |安装内存(RAM):16.0GB+ K$ r8 x u" s0 M; x: E/ M
% N2 ~" ` w( V1 {) U# o; q
2、系统平台
; F# N) o3 n( v4 U4 k( XWindows 8.1,Ubuntu
2 K% t0 z, v' N' G# x
/ P4 U0 }- L* U0 k- ~3、软件平台8 ]5 h3 I' C% C+ s$ U2 ]2 h
6 z% e- e: k% }. [% A: uart_4541 w1 q% K, A& n9 v9 O0 j
GenomeABC http://crdd.osdd.net/raghava/genomeabc/3 N+ P' t2 E( S' B3 a
Python3.5
& p. c) U- `0 V6 @3 I6 t4 nBiopython% b: A5 S/ @- S
4、数据库资源' Y& O; p& W( F0 \- u: f0 \& n
: f! J: ]- S1 e. n' A, L5 V* qNCBI数据库:https://www.ncbi.nlm.nih.gov/
2 `. b- [* T# B: Z4 o
# T9 \6 u! _$ B4 _6 {5、研究对象
2 v# f; C6 B0 V% s0 T, ]% k6 ]& i; X' e7 s
酵母基因组Saccharomyces cerevisiae S288c (assembly R64) & ]* W& F* N$ x6 ~* s
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
: p/ P# Y4 u- H) \7 r+ X9 @- z+ {: I# p b
6、方法
" G% }: }9 F0 L- x2 ?( b2 H" s% n3 b" O4 j% I, P4 C
art_454的使用 5 M% k. }3 r/ m- F* `
首先至art系列软件的官网,下载软件,在ubuntu系统安装,然后阅读相关参数设置的帮助文档,运行程序。' E# [5 L3 `+ ^2 u( d; _" G
GenomeABC # ^. Z* [3 e$ {) w G/ `! n, z) n5 |
进入GenomeABC(http://crdd.osdd.net/raghava/genomeabc/),输入参数,获得模拟测序结果。
: A3 ^/ p3 ~/ }( W/ c8 T/ g5 F! r$ T编程模拟测序 - l" C9 Q( [/ o$ ~% y3 ^8 j0 n( k) \. t
下载安装python,并且安装biopython扩展模块,编写程序,模拟单端/双端测序。
% T# O- m: ?$ b3 {7 Q' k三、结果) y9 l" i! r" s# d
0 j- r5 Q1 _( a0 w
1、art_454的运行结果: K$ J5 L6 {1 _8 b4 T
; y7 t4 n' `8 g; m3 l
无参数art_454运行,阅读帮助文档 2 U6 [& \9 U) A9 e2 v- B0 S( x
: M) @( M6 |& J4 F' b
图表 1无参数art_454运行 4 r) _5 J+ e; G
对酵母基因组进行基因组单端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20. : ]% \5 w& J! W- L2 C( z2 M/ [1 I& r! L! i
下图为模拟单端测序,程序运行过程及结果 - G% E9 m0 J4 {2 n6 t& k
/ H2 P# D. v( e4 }
图表 2 art454单端测序
5 N3 h" D3 @8 s6 o9 A5 |+ A( v5 E5 f
图表 3 art454单端模拟结果 ) N- I; {# u: W' j3 c
双端测序模拟,FOLD_COVERAGE设为20,即覆盖度为20;MEAN_FRAG_LEN设为1500,即平均片段长度为1500;STD_DEV设为20,即长度的标准差为20 / v) m- u0 W$ S* U- {+ ^
下图为模拟双端测序,程序运行过程及结果
2 @, E' B6 R- R# c$ L& D
# A/ D) y4 g. ^" Q图表 4 art454双端测序 7 i- ?+ f. Y( I! m5 i
. t% W, v# n3 S8 E2 [0 \
图表 5 art454双端模拟结果
8 R8 J- j2 r7 j0 `; K4 u2、GenomeABC
; Q1 j7 ]6 S, D& o. d下图为设置参数页面 : E6 q8 J( i; ^, M" Z; ^
! |! X0 _1 A0 Q$ l. t6 @1 c
下图为结果下载页面
7 \5 O/ y9 Z; G/ W) I/ E- h7 {; C8 L0 d
图表 6 结果下载页面 - h y: V# B. X0 @6 h, o( D
3、编程模拟测序结果 * `0 _, x2 x& P5 o& U
拷贝数是这里的N值;覆盖度是m,测序深度是宏观的量,在这里与覆盖度意思相同,就是测序仪10X,20X。 ; M0 Q6 ], \6 y/ J& W% O
单端测序
" J7 D$ @% x% d% T2 \& h1 t* {3 k$ j, O
图表 7 程序模拟单端测序
% n' \. \9 _% V# U2 U双端测序 0 P6 e- @" y8 c. ?0 e6 l1 x
1 ]# B I2 O! } F$ ~图表 8 程序模拟双端测序 3 S, V P( E0 l" l
测序结果 ; j/ Z1 i1 h. J3 I4 b4 E
4 g$ h6 \' ^' H; f* x
图表 9 结果文件
" u: E) {/ N+ I A# Z; {: \, ?0 ? W/ O9 B0 H5 ^+ ?9 w+ r1 R6 K% K: _; d( _
因为期望片段长度是600bp,在片段长度区间200-1000bp内,所以大部分的片段都没有删除。
% U2 V# d; {) x$ V0 n, k1 i测序结果统计表
1 W/ R# p8 n2 e2 t& T# p+ m+ Z" E
$ S! i- a* g r c测序方式 基因组大小(bp) 片段长度区间 (bp) N值 期望片段长度 克隆保留率 片段数量 Reads长度范围(bp) Reads总数量 Reads总长度 覆盖度(m值) 理论丢失率(e-m) 覆盖率(1-e-m)
5 M5 ]# U( n* F: }; c6 }单端 12157kb 200-1000 10 600 0.95 107378 50-100 101968 7645.541kb 0.62889 0.53318 0.46682
+ q0 T$ t! }7 d; ]+ I; b& z单端 12157kb 200-1000 20 600 0.95 213722 50-100 202996 15227.882kb 1.25259 0.28576 0.71424
* j9 U9 o# I: A! E" Q9 y/ L双端 12157kb 200-1000 10 600 0.95 106704 50-100 202770 15212.662kb 1.25134 0.28612 0.71388
, z) k: T. m# l: V3 \$ F0 E% I双端 12157kb 200-1000 20 600 0.95 214212 50-100 407186 30534.265kb 2.51164 0.08114 0.918869 T$ Y7 [0 U/ E+ z
四、讨论和结论6 s4 X. J# n3 j/ o0 R; r
2 Y5 v/ A1 K' W* K
程序运行方法: a1 x- C) a8 P( H8 c0 M# _
5 }/ _! T7 U0 O; d( I& O |在类的构造方法init()中,调整参数。
& G& J) I% F: eAveragefragmentlength为片段平均的长度; * K- q* k, p1 X8 k' w
minfragmentlength和maxfragmentlength是保留片段的范围;
5 b1 b1 R/ k! b/ qcloneRetainprobability是克隆的保留率; 1 m* E9 q6 U$ w% a3 L( {4 l
minreadslength和maxreadslength是测序reads的长度范围
* X8 P A+ W' D) C ~- @: Y
% c/ {, D$ o( Y模拟测序的诸多方法都封装成了Sequencing类,只需要创建类,并调用singlereadsequencing()和pairreadsequencing()方法,传入文件名的参数即可。 B/ e7 O& L" L0 `5 _* o" w
: A' t6 l0 d0 b$ x E
附录6 G: [% j; s. U# g' \( A' ], Q( r
, V2 y5 y- T2 N3 w1 a: }from Bio import SeqIO
9 X+ {5 ^$ c' L- w2 J8 |2 Jfrom math import exp4 C. s3 |: m# T5 R7 V& e2 H! r$ y8 {
import random" W! k F8 }5 u0 u( Y
9 t2 H! n/ R( j6 u* i. `
class Sequencing:: |- L: q6 i& l) t: s9 l
# N代表拷贝份数
9 W1 e- K, L* ~7 C def __init__(self)/ o& i7 s) J5 }" x$ R& q3 e
self.fragmentList = []) @- j: K% ]' H: `, h
self.readsID = 1; Q) U0 E) E* y+ B
self.readsList = []. f/ K% u6 }+ U8 D
self.averagefragmentlength = 650$ L* }- R+ x1 h1 ]" O7 C0 s/ \8 j
self.minfragmentlength = 500
8 Z# O* g6 ~, k self.maxfragmentlength = 800& }, a+ Y5 {- n+ A
self.cloneRetainprobability = 1
2 R) `' G0 y7 J2 I$ p P self.minreadslength = 50
7 i: e6 H1 `# X3 u y self.maxreadslength = 150 B; T/ ]* L' t$ m7 A% z3 t+ N
self.N = 10
& S4 T% Z) b, a+ n% X self.genomeLength = 0
4 F7 D4 J/ C7 {0 K' I; r" M5 C self.allreadslength = 0
0 L/ p/ q5 V) a
& s) b! H1 P _# P # 生成断裂点
+ |. F1 \8 {! E( L def generatebreakpoint(self, seqlen, averageLength):
* G- D2 Q' J" c' y4 ]2 b. H$ Y, H* C # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)$ }4 V; y$ k# R: p
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]$ D: m- l7 R" a
breakpoint.append(seqlen)7 y0 E* s/ y M3 D$ _# |# ?1 _+ a
breakpoint.append(0)3 W2 U2 `) I+ ~) f
# 把随机断裂点从小到大排序# P. V. i( |; o: R7 Y# T( v8 ]4 Q
breakpoint.sort()
* M) m7 Q9 s' A+ c/ D- u return breakpoint
2 @7 a S/ \- s& i, g. n7 m- A
# j7 }- A5 y2 o% R! O7 [# }+ i # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp; U0 \* F4 u- n& C$ q7 R
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):$ D& M0 v# o8 x! u
for i in range(len(breakpoint) - 1):$ P! t. a" Y2 o
fragment = seq[breakpoint:breakpoint[i + 1]]
g7 v% |' n6 L% L6 B if maxfragmentlength > len(fragment) > minfragmentlength:
& @. a' x! J* x$ F1 k w( { self.fragmentList.append(fragment)
3 a( @" Z& Z8 O2 T' ]. @7 p9 L4 L! e return self.fragmentList
3 A' F: Q( {$ b: x9 m: \% z
/ Y# T# [3 ]& `- @7 [: a # 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率 N& w7 i. m6 ]6 k5 D
def clonefragment(self, fragmentList, cloneRetainprobability):% R. ]' p% d3 H. K
clonedfragmentList = []
; Y! P a$ E) K# U8 f; ~ Lossprobability = [random.random() for _ in range(len(fragmentList))]
7 K; |0 e$ T5 ]7 [# N G# v$ h for i in range(len(fragmentList)):
2 D9 o7 l" G8 S if Lossprobability <= cloneRetainprobability:: {3 U1 Z V2 `0 h1 i" g- z
clonedfragmentList.append(fragmentList)
) O( o6 J* b, Y return clonedfragmentList
6 m+ x& _! p0 S4 s E: F' K& A; _; Z/ E1 Q, L6 R
# 模拟单端测序,并修改reads的ID号5 [# S1 R! O, K6 `8 Q7 u4 A
def singleread(self, clonedfragmentList):! p: H1 J+ z4 o- C# t) m! y
for fragment in clonedfragmentList:# o/ m. G5 `( g
fragment.id = ""- l3 o/ V2 g1 f" Z: r! |8 ^7 }1 W
fragment.name = "". Q& S: o" t2 x
fragment.description = fragment.description[12:].split(",")[0]7 {7 N' J9 p2 M9 |( H: J
fragment.description = str(self.readsID) + "." + fragment.description
% N1 X: m* J- D+ Q self.readsID += 1) m( X8 E; ~" O7 v* T: _. V
readslength = random.randint(self.minreadslength, self.maxreadslength)
) x! j$ g1 Y# b) y( b8 n self.allreadslength += readslength
/ p3 E! w& W4 V# N# c( J self.readsList.append(fragment[:readslength])
( r- X, n/ T" N/ S1 _ a' U* Z9 Q- R2 }4 r- W/ o D: V& ?
def singlereadsequencing(self, genomedata, sequencingResult):' R. p1 b7 u- u% M2 P& @# _9 z5 }
for seq_record in SeqIO.parse(genomedata, "fasta"):
/ W: U9 s- {5 }( R& L& ]( { seqlen = len(seq_record)
+ Z3 q/ U" K: H4 T self.genomeLength += seqlen! T! s3 m( c- m1 d3 |* ?; E
for i in range(self.N):1 o: Y$ D$ S+ e( O+ [1 l
# 生成断裂点
5 X7 h- L5 X: J8 c& m8 s% e! ^7 Y breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)3 |$ v% K# U! c* w: ^8 A
# 沿断裂点打断基因组, b% f" {; I4 w
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
3 F# A1 _1 r6 y! }6 b Y # 模拟克隆时的随机丢失情况
' e. m1 D' U8 w clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
' C: ^. B) a5 W8 ? # 模拟单端测序
% N0 C: j& l! u( t self.singleread(clonedfragmentList)8 q. h! r5 N# @$ z2 D
SeqIO.write(self.readsList, sequencingResult, "fasta"). K3 f" }% d( L% r; E+ u
% D# {3 J( ]& ^! J8 N
def pairread(self, clonedfragmentList):/ `2 E4 Y# d6 ], d
for fragment in clonedfragmentList:
B0 |4 i, `# C' X7 v! F, k fragment.id = ""
v, V6 w5 b' i/ @ fragment.name = ""
5 ?$ @. r8 S1 F description = fragment.description[12:].split(",")[0]. [6 y. `3 e9 C/ e! V
fragment.description = str(self.readsID) + "." + description
( T" r- C3 C/ n8 j, c/ r# t& v8 ]- V readslength = random.randint(self.minreadslength, self.maxreadslength)3 ?% i. d2 }, [1 u; _1 X+ u
self.allreadslength += readslength, w( @. I" `6 h8 s5 P0 a% {
self.readsList.append(fragment[:readslength])
+ V! S+ @' h4 O. r. A
! \( y1 p( S( q- N* r readslength = random.randint(self.minreadslength, self.maxreadslength): n7 W3 m7 Q8 I. v0 _' C* j/ c3 w; l
self.allreadslength += readslength
7 ~ E( E( t' L) a6 i {( {
/ p3 I6 I* h8 P+ A) \ fragmentcomplement = fragment.reverse_complement()9 T$ K( h2 x4 B5 A) f
fragmentcomplement.id = ""
# a8 g% k6 \ x8 m$ \' E: q fragmentcomplement.name = "" [/ V+ A0 G8 [% A8 c1 m
fragmentcomplement.description = str(self.readsID) + "." + description2 J6 b7 D L' H5 z
self.readsList.append(fragmentcomplement[:readslength]), {) I* U& C! {! |$ \' }+ N
- }+ h: H2 @& d; b5 C P
self.readsID += 1: s$ z- v# k- U
$ W: c# R3 v+ Y" f+ y def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):& _# E; J6 T }8 _) K ^
for seq_record in SeqIO.parse(genomedata, "fasta"):5 n& o n2 F: L) M' {+ {( \
seqlen = len(seq_record)
, p; X/ y/ W6 r# S* h4 t% k! r self.genomeLength += seqlen' m5 B* Q- z5 ]/ O6 n
for i in range(self.N):
5 p- ~5 w" c+ b$ a) P # 生成断裂点, b& r: A3 r( C, f2 j( H
breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
, i* F3 i0 j% j5 k& M. z3 w # 沿断裂点打断基因组; s; K. H0 `9 ]% H2 r! B4 u
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)
1 l' I& Z6 s* X+ c9 T1 I # 模拟克隆时的随机丢失情况% S% _5 ]! V5 D7 J& y. M- q- x
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)7 P' M9 X3 u% B1 L
# 模拟双端测序. B1 Y, R- J5 g" M6 V# U
self.pairread(clonedfragmentList)
/ C8 Y" U6 L8 C) P1 v' R& ?# \ readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
s: ?( j, G$ l% W readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1], J+ ?0 X+ I, x* k6 \1 j4 Y
SeqIO.write(readsList_1, sequencingResult_1, "fasta")6 n6 Z' k3 P3 C# S
SeqIO.write(readsList_2, sequencingResult_2, "fasta")
2 J/ z, f6 p2 F6 E4 X/ H' F$ e8 _& n2 _' u
def resultsummary(self):
8 U" q- K0 @4 Y$ N4 T print("基因组长度:" + str(self.genomeLength / 1000) + "kb")/ d& |" \! J- c) {: S
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))5 p( p! `# ~4 |5 C2 w2 A& i
print("N值:" + str(self.N))
1 k0 {3 o, W3 c% k; i print("期望片段长度:" + str(self.averagefragmentlength))
6 K9 n4 F# O8 w5 d* f5 Z* t6 _ print("克隆保留率:" + str(self.cloneRetainprobability))
$ H' |5 e l, k* r print("片段数量:" + str(len(self.fragmentList)))/ z) W. `, z6 k) l" l
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))3 ^# ]9 z6 q7 V9 d7 F7 F5 ?1 R- ?
print("reads总数量:" + str(len(self.readsList)))* L2 N8 r2 w* k7 {% o
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")3 {4 {) `: f- m
m = self.allreadslength / self.genomeLength) Q0 Y3 K3 W- G8 T' J0 r
print("覆盖度(m值):" + str(round(m, 5))): F6 m, U I5 m% b8 q& ?7 k D- j
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))
L# E! N/ l- `8 b d9 W2 m print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))+ ?/ [; Y% P4 y. h- ] @
# -------------------------------------------主程序-------------------------------------------/ ]! ^0 Z' \1 W2 V% u
# 模拟单端测序
3 L" R' z( D, l, Y2 s( n8 u9 TsequencingObj = Sequencing()/ U0 k5 g% M6 Y9 B
sequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")# @# [, A6 h9 Z+ z; [
sequencingObj.resultsummary()0 S+ n a& H( i9 `/ a0 ?; R) ]
" S5 M3 {! b9 b8 y
# 模拟双端测序
4 s5 r: h+ V, F' a1 ]) RsequencingObj = Sequencing()
) B5 W* d% u6 e1 k, h) e# esequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
4 G* ?2 {2 {" isequencingObj.resultsummary()
/ c* J/ K- i! z$ J6 ]# Pfrom Bio import SeqIO* x$ l7 E* F/ {& g1 s, W
from math import exp
. J: |: s- b0 S0 u2 r9 timport random" b! t. r& X$ B/ @9 |+ v' o; _
2 e6 z0 U$ u' F- e6 f$ @) \- zclass Sequencing:0 _! u3 J7 V% X! u5 U5 q5 Q
# N代表拷贝份数
- \$ n J1 X" L+ I* }: I0 E def __init__(self):& F0 k/ @3 W! q$ s6 T' @7 Y
self.fragmentList = []& z) ?5 d0 P/ U+ j
self.readsID = 1
( {. K% r( h0 M; S- Z self.readsList = []
) _; q B6 ~- g3 P9 Y self.averagefragmentlength = 650: }) H5 f1 E9 f* m8 g
self.minfragmentlength = 500$ k& i8 p2 W: |' F, D& u+ K
self.maxfragmentlength = 800
+ C" f/ ]: ]+ O9 K$ c0 A self.cloneRetainprobability = 1
$ z* B; @3 P- P! k2 r" H+ z/ A* I self.minreadslength = 50
! [+ ~: R! t" i. U2 S self.maxreadslength = 1504 y; j( a3 s9 a M) b) D- A
self.N = 10
- s f3 b8 E" R7 \- w+ p( q8 z$ Q1 H self.genomeLength = 0: o! j( z a, p# x% i2 s
self.allreadslength = 0
4 O0 x6 v9 V( C- s! {* _" o* K/ c$ P% s3 Q0 T8 M
# 生成断裂点3 V6 u' @( m$ ~0 [
def generatebreakpoint(self, seqlen, averageLength):
/ h* u( K' A4 M # 假设平均每500bp 产生一个断裂点(averageLength = 500),通过随机函数生成seqlen/500个随机断裂点(1到seqlen之间的随机整数)- @5 t: W3 g, \$ A, H; l7 a# L
breakpoint = [random.randint(0, seqlen) for _ in range(int(seqlen / averageLength))]0 j: Y% I! ~! Q6 w
breakpoint.append(seqlen)' J' p# E. c! K& p
breakpoint.append(0)
" f. Q" V" o7 ]; q& R, C # 把随机断裂点从小到大排序5 u# ]5 I/ L5 [8 S
breakpoint.sort()
8 [: e) v( b) W5 X( s: X return breakpoint
) L! B2 q$ Y) X
9 S; f+ m! l' e& \0 F+ y # 沿断裂点打断基因组,并删除不符合长度要求的序列片段,定义片段范围:200-1000bp' u1 M$ N- [0 }( Q- |' ?, T, ]
def breakgenome(self, seq, breakpoint, minfragmentlength, maxfragmentlength):
9 ^* v* G3 r a3 X for i in range(len(breakpoint) - 1):
/ j. _* {' r6 Z& G4 S; ^" M4 {# Y fragment = seq[breakpoint:breakpoint[i + 1]]
1 ? O& [4 ]( t7 } if maxfragmentlength > len(fragment) > minfragmentlength:. v8 M* @: A8 a2 `4 W J) R" |
self.fragmentList.append(fragment)+ A8 ?% d# J2 _
return self.fragmentList
( C. d: m1 i0 L" y9 z8 k% W6 z M9 p
# 模拟克隆时的随机丢失情况,random.random()生成0-1的保留概率
9 _0 l9 d% T3 P( u- ? def clonefragment(self, fragmentList, cloneRetainprobability):
: I( {' I! C+ E2 b6 u: C clonedfragmentList = [], {! b* X8 R2 ]9 e% c
Lossprobability = [random.random() for _ in range(len(fragmentList))], G3 o, T3 e+ b
for i in range(len(fragmentList)):
0 x# H$ S, t* `2 ~/ u& e if Lossprobability <= cloneRetainprobability:
( V m8 @- n" l1 f. N' r clonedfragmentList.append(fragmentList)
8 Q) d2 ^2 h* \ return clonedfragmentList1 O* Q5 D9 H7 Z) G* m `
2 f( }% b' \8 k8 y
# 模拟单端测序,并修改reads的ID号6 ~2 j% h1 C4 e* {( H) e
def singleread(self, clonedfragmentList):
4 P; M5 H d2 X o# S for fragment in clonedfragmentList:! b( y. A, u2 Q B
fragment.id = "" o: d9 ^& i5 H; }! p8 \
fragment.name = ""
9 R& y* w, M* q$ Y4 w! n fragment.description = fragment.description[12:].split(",")[0] Q. e) [' b9 H! m9 E2 \
fragment.description = str(self.readsID) + "." + fragment.description
' Q: d; k8 I0 K1 ~7 C) L/ a+ } self.readsID += 1
, t" H6 q! S# v& r7 F! B" Y readslength = random.randint(self.minreadslength, self.maxreadslength). O) r4 b6 H" l% H6 z" q5 R
self.allreadslength += readslength3 F0 ]3 M" L$ {5 d, s8 a
self.readsList.append(fragment[:readslength])6 E* M1 Q F1 |4 B3 O; @
: l4 H( F, A4 J e def singlereadsequencing(self, genomedata, sequencingResult):" U& k& K7 ~- }. [
for seq_record in SeqIO.parse(genomedata, "fasta"):
1 \1 J$ D+ N S; O seqlen = len(seq_record)
7 ~! T( Y! z# z4 ]7 K self.genomeLength += seqlen
! ~7 L1 U. W) C4 O0 Y/ v! f for i in range(self.N):
8 E9 c1 B' W& J! f! y N& ` # 生成断裂点
( {( _6 N; Q; }# O# t breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)
+ Q! o k3 G, k1 _6 Z # 沿断裂点打断基因组
# z$ S3 S. G2 b self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)% M& x5 @2 v! A+ W) b- p
# 模拟克隆时的随机丢失情况! z$ K& u# y d! ^4 _6 l. A* R& g
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)" S3 I# P1 o* w1 _1 B% s0 C
# 模拟单端测序
3 I. K# v; Y" \- Z7 X4 i8 \' h8 O self.singleread(clonedfragmentList)- P' U+ h2 w2 j3 h) H$ d
SeqIO.write(self.readsList, sequencingResult, "fasta")
) x& }5 h" {/ V, O4 w7 J: |8 U: _# O9 y7 K! K+ I& |
def pairread(self, clonedfragmentList):
3 q1 v; X3 [6 i( ]% B8 B for fragment in clonedfragmentList:# B/ d0 p& ]. N9 ?
fragment.id = ""
5 `4 f% p i* w3 m% l6 a1 [ fragment.name = ""
" E) }6 i: O0 M7 Q- r0 |5 l& v+ c description = fragment.description[12:].split(",")[0]
* t+ [" i3 t) M* z' G3 ?2 E fragment.description = str(self.readsID) + "." + description
/ }4 v) B2 t* v! ?! G5 O& j readslength = random.randint(self.minreadslength, self.maxreadslength)1 ^* ~, b6 L( u! O# V
self.allreadslength += readslength
# K5 q7 Y0 r. E. Y+ q9 x self.readsList.append(fragment[:readslength])3 [& j+ L( O4 m
, w" I1 K/ a% Z; y8 [$ K
readslength = random.randint(self.minreadslength, self.maxreadslength)
6 z5 c3 g8 z, y; `+ N% V, S self.allreadslength += readslength7 H. _3 j2 _+ W: ^# T, g
9 `3 O, A* M5 M! G7 \9 P7 r8 L fragmentcomplement = fragment.reverse_complement()# H4 F4 v. `' E; }7 l
fragmentcomplement.id = ""
+ M, O7 P9 P; ^' j, k fragmentcomplement.name = ""
9 ^$ l) K# O" B9 k* a fragmentcomplement.description = str(self.readsID) + "." + description
$ I+ q) y; ~5 r self.readsList.append(fragmentcomplement[:readslength])
" t6 F7 @) m2 p* D. q" j; h4 k6 k
. }* r4 Y: c( H) x/ U6 P2 r5 P3 R self.readsID += 18 M3 t' K2 e. Q# e; r) `
, F& ], g. y. P+ Y2 n; ~ def pairreadsequencing(self,genomedata, sequencingResult_1, sequencingResult_2):
) q9 L: k2 n: P$ w for seq_record in SeqIO.parse(genomedata, "fasta"):
1 L6 j6 {5 L. w& h: _9 @: L seqlen = len(seq_record)! j/ w: F, s) t
self.genomeLength += seqlen% w' @0 F4 g& J2 \/ w
for i in range(self.N):1 b9 y2 t; N$ G' a9 w$ W- t4 G
# 生成断裂点
- @* Q0 n6 I4 F) d! M, I; @ breakpoint = self.generatebreakpoint(seqlen, self.averagefragmentlength)' S. E C' b: a# W1 b+ J
# 沿断裂点打断基因组0 M! `* p7 m9 o* e
self.breakgenome(seq_record, breakpoint, self.minfragmentlength, self.maxfragmentlength)' g% ~- _; L+ E1 D/ S
# 模拟克隆时的随机丢失情况) R7 F. R% D% i2 N$ P
clonedfragmentList = self.clonefragment(self.fragmentList, self.cloneRetainprobability)
+ h& ?2 D! x- \; M # 模拟双端测序5 o: r5 V& e* q- \2 S9 O
self.pairread(clonedfragmentList)7 h7 Y- y9 ~/ E( l& t4 J
readsList_1 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 0]
( A8 @" C2 i! M' \; P; |5 P$ k0 D) i readsList_2 = [self.readsList for i in range(len(self.readsList)) if i % 2 == 1]- }. F* ]+ h T- ^+ }1 k
SeqIO.write(readsList_1, sequencingResult_1, "fasta")
# c0 }0 `1 u* M SeqIO.write(readsList_2, sequencingResult_2, "fasta")
0 b/ n! F+ @ O8 C+ h1 n( X7 x- [' ~5 d( g/ |5 Q. h
def resultsummary(self):, a' N; \1 q+ H, P" u' R2 A
print("基因组长度:" + str(self.genomeLength / 1000) + "kb")% a. I& R# o- ~: G. l" e
print("片段长度区间:" + str(self.minfragmentlength) + "-" + str(self.maxfragmentlength))
6 F* P0 [) P; k: D$ ]5 _* p print("N值:" + str(self.N))$ `7 H4 \# s* M
print("期望片段长度:" + str(self.averagefragmentlength))
' E Q% D- X# ^ p( D! \ print("克隆保留率:" + str(self.cloneRetainprobability))) h% D1 ]% R" o( \/ ]6 X, @ w# Y
print("片段数量:" + str(len(self.fragmentList)))) R. z4 @% ^/ t. {7 f1 T
print("reads长度:" + str(self.minreadslength) + "-" + str(self.maxreadslength))
! c( U1 r) Z' Z: O8 q print("reads总数量:" + str(len(self.readsList)))' M7 N* F7 f+ Q* F/ _9 k% @; }
print("reads总长度:" + str(self.allreadslength / 1000) + "kb")
+ h7 z* F, X0 q; n m = self.allreadslength / self.genomeLength. F3 S+ k! {! r) j& w
print("覆盖度(m值):" + str(round(m, 5)))0 H6 `7 ?2 D8 U, Y. }: _
print("理论丢失率(e^-m):" + str(round(exp(-m), 5)))& m8 [4 }8 h6 V7 |. |. g! m
print("覆盖率(1-e^-m):" + str(round(1 - exp(-m), 5)))$ h- w8 y+ p. L$ P& Y! l2 V
# -------------------------------------------主程序-------------------------------------------4 F, K3 b. [9 }# F9 ^1 f
# 模拟单端测序
' e; h& n+ e) Y; {, x, SsequencingObj = Sequencing()
% _! `- n1 D# I, j) GsequencingObj.singlereadsequencing("data/NC_025452.fasta", "result/virusSingleRead.fa")6 u$ i; J9 I* P U" _6 X
sequencingObj.resultsummary()1 ?# s7 a! C# O& y& l6 Q/ f- H( m. g1 n+ U
! A+ F/ T9 T8 s( a: W# 模拟双端测序9 F. w) ~+ a9 e/ v/ e. t) E
sequencingObj = Sequencing()- y4 D0 F6 b" n$ {7 N. {5 X0 H
sequencingObj.pairreadsequencing("data/GCF_000146045.2_R64_genomic.fna", "result/yeastPairRead_1.fa", "result/yeastPairRead_2.fa")
+ J+ R$ t: @, ~- m% t2 w, gsequencingObj.resultsummary()
5 o1 ?* h% W5 T# N& Y4 h6 [- d# P( |( Y% J% [
0 u( A' N* V7 ~$ E
7 f5 p$ c J# Z8 x
8 N1 z2 ~: }, C8 V4 \8 Y& J3 m( M! U9 ~ |
zan
|