Optimising ContigJoining

Introduction

In collaboration with Yann Bertrand from the Oxelman group.

We have used ContigJoining and blasted our transcriptomes agains the NCBI "nt" database and SWISSPROT. This approach takes to much time, particularly the step of blasting against "nt". Therefore, we want to test different databases to see the time that could be saved, while still receiving an equally good result.

For this we use the "05_1269_1_7_RS.fasta" dataset, which contains 80312 sequences, of which 72587 have a blast hit (e-value 1e-9) against "nt".

Comparisons will be done by blasting against TAIR, arabi_db (i.e. all Anne’s Arabidopsis transcriptomes), SwissProt and a database with TAIR and arabi_db combined. In total, four databases will be used.

First we formated the BLAST databases: (arabi_db, TAIR10 and the combination of the two)
[root@pc158250 arabi_db]# formatdb -i arabi_db.fasta -p F -n arabi_db -l formatdb.log
[mtop@pc158250 TAIR10_arabi_db]$ cat arabi_db.fasta >> TAIR10_seq_20101214.fas
[mtop@pc158250 TAIR10_arabi_db]$ formatdb -i TAIR10_seq_20101214.fas -p F -n TAIR10_arabi_db -l formatdb.log

... and created run-scripts.


[mtop@pc158250 optimisation]$ cat run.arabi_db.sh 
#!/bin/bash
time blastall -p blastn -m 7 -e 1e-9 -d /usr/local/db/yann/arabi_db/arabi_db  05_1269_1_7_RS_arabi_db.xml

[mtop@pc158250 optimisation]$ cat run.swissprot.sh 
#!/bin/bash
time blastall -p blastx -m 7 -e 1e-3 -d /usr/local/db/yann/swissprot  05_1269_1_7_RS_swissprot.xml

[mtop@pc158250 optimisation]$ cat run.TAIR10.sh 
#!/bin/bash
time blastall -p blastn -m 7 -e 1e-9 -d /usr/local/db/yann/TAIR10/TAIR10_seq_20101214  05_1269_1_7_RS_TAIR10.xml

[mtop@pc158250 optimisation]$ cat run.TAIR10_arabi_db.sh 
#!/bin/bash
time blastall -p blastn -m 7 -e 1e-9 -d /usr/local/db/yann/TAIR10_arabi_db/TAIR10_arabi_db  05_1269_1_7_RS_TAIR10_arabi_db.xml


### This happens when you format a dna fasta file with the "-p T" flag/option ###
[mtop@compute-0-0 optimisation]$ ./run.arabi_db.sh 
[blastall] WARNING: Unable to open arabi_db.nin
[blastall] WARNING: Unable to open arabi_db.nin
[blastall] WARNING: 05_1269_1_7_01343: Unable to open arabi_db.nin
[blastall] WARNING: 05_1269_1_7_01343: Unable to open arabi_db.nin
[blastall] WARNING: 05_1269_1_7_01343: Unable to open arabi_db.nin
[blastall] WARNING: 05_1269_1_7_01343: Unable to open arabi_db.nin
[blastall] FATAL ERROR: 05_1269_1_7_01343: Database /usr/local/db/yann/arabi_db/arabi_db was not found or does not exist
#################################################################

Results

[mtop@compute-0-0 optimisation]$ ./run.arabi_db.sh

  • real 27m7.114s
  • user 26m47.355s
  • sys 0m6.778s

[mtop@compute-0-0 optimisation]$ ./run.TAIR10.sh

  • real 43m57.417s
  • user 43m38.437s
  • sys 0m9.908s

[mtop@compute-0-0 optimisation]$ ./run.TAIR10_arabi_db.sh

  • real 74m3.252s
  • user 73m39.388s
  • sys 0m13.976s

Reply from Yann

"Hi,

against nt, we had 72587 blast hits for 80312 sequences.
Here is the output of what we have sofar:

  • name of initial fasta file: 05_1269_1_7_RS.fasta
  • 3 BLAST files will be parsed
  • the original file has 80312 entries
  • the blast file 05_1269_1_7_RS_arabi_db.xml gives 40647 BLAST-hits and 39665 misses, record stored in: dict_0
  • the blast file 05_1269_1_7_RS_TAIR10.xml gives 59290 BLAST-hits and 21022 misses, record stored in: dict_1
  • the blast file 05_1269_1_7_RS_TAIR10_arabi_db.xml gives 62024 BLAST-hits and 18288 misses, record stored in: dict_2

It means that even with the largest db we are still 10000 hits short. I hope we will get a better result once we have the output of Swissprot.

I have written two short scripts for the tast, hit_blast.py which generates the above output and keeps hit_lists for each db for latter use and
hit_blast_comparisons.py which allows to test different combinations in order to find the one that gives the highest yield.

Hope your writting is going well.

cheers"

hit_blast.py

Author: Yann Bertrand


#title: script for parsing blast xml and calculating the number of hits
#aim: comparing different blast_db in term of hit yield and search speed
#project: contigjoining optimization
#date: 11/10/19
#usage: hit_blast.py <name of fasta file that was blasted>
#note: all .xml files should be placed together with the fasta file in the 
#same folder as this script. The folder cannot contain unrelated .xml files

import sys, os, pickle
from Bio import SeqIO
from Bio.Blast import NCBIXML


initial = sys.argv[1] #initial fasta file
SeqList=os.listdir(os.getcwd())
SeqList_xml = [x for x in SeqList if '.xml' in x] #collect all .xml 
sys.stdout = open('results.txt', 'a')
print "name of initial fasta file:", initial 
print "%s BLAST files will be parsed" % len(SeqList_xml)

# add initial fasta file to dictionary
handle = open(initial)
initial_dico = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()
print "the initial file has", len(initial_dico), "entries"
file = open("initial.pck", "w")
pickle.dump(initial_dico.keys(), file)
file.close()

E = list(enumerate(SeqList_xml))
for item in E:
	name = 'dict_'+str(item[0])
	hits = []
	for record in NCBIXML.parse(open(item[1])):
		if record.alignments:
			hits.append(record.query.split()[0])
	misses = set(initial_dico.keys()) - set(hits) #set difference, alternative wording: set(initial_dico.keys()).difference(set(hits))
	print "the blast file {0} gives {1} BLAST-hits and {2} misses, record stored in: {3}".format(item[1], len(hits), len(misses), name) 
	file = open(name+".pck", "w")
	pickle.dump(hits, file)
	file.close()
sys.stdout.close()

Hit_blast_comparisons.py

Author: Yann Bertrand



#title: script for comparing diffent db combination in order to get the highest hit number
#aim: comparing different blast_db in term of hit yield and search speed
#project: contigjoining optimization
#date: 11/10/19
#usage: hit_blast_comparisons.py 
#note: This script uses the output of hit_blast.py, "stop" to quitstop

import sys, os, pickle
from Bio import SeqIO
from Bio.Blast import NCBIXML

initial = open("initial.pck", 'r')
initial_keys = pickle.load(initial)
initial.close()

while True:
	dict_1 = raw_input("Enter the first dictionnary to load: ")
	if dict_1 == 'stop': break
	dict_2 = raw_input("Enter the second dictionnary to load: ")
	file1 = open(dict_1+".pck", "r") 
	file2 = open(dict_2+".pck", "r") 
	L1 = pickle.load(file1)
	L2 = pickle.load(file2)
	file1.close(), file2.close()
	total = set(L1+L2)
	misses = set(initial_keys) - total
	print "this combination gives a total of", len(total), "hits and", len(misses), "misses"

Next step

The Swissprot analysis is taking very long time, as we are only running this test on one CPU core per analysis. Therefore we have taken the 18288 sequences from "05_1269_1_7_RS.fasta" that did not have a BLAST hit against the "TAIR10_arabi_db" database, and are now blasting them against the Swissprot database, using the following script:


[mtop@pc158250 optimisation]$ cat run.swissprot_no_hits.sh
#!/bin/bash
time blastall -p blastx -m 7 -v 2 -b 2 -e 1e-3 -d /usr/local/db/yann/swissprot  05_1269_1_7_RS_swissprot_no_hits.xml

Results II

The two BLAST analyses against SwissProt has finished:

[mtop@compute-0-0 optimisation]$ ./run.swissprot.sh

  • real 3387m40.839s
  • user 3382m33.139s
  • sys 3m47.413s

[mtop@compute-0-0 optimisation]$ ./run.swissprot_no_hits.sh

  • real 633m11.475s
  • user 632m21.544s
  • sys 0m36.342s