Nuclear encoded chloroplast protein database

Material and Methods

Downloaded TAIR 10.


wget -r ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR10_blastsets/TAIR10_pep_20101214

Extracted 3869 gene model numbers from "/Users/mats/project/Aronsson/chloroplast_sequences/Chloroplast_genome 2.xml" and saved them in "grep_file.3". Changed each entry using VIM:

  • :%s/AT/>AT/g

Fetched the full fasta headers from "TAIR10_pep_20101214" using:


cat ~/db/TAIR/TAIR10_pep_20101214 | grep -f grep_file.3 > headers.4.txt

This found 3833 fasta headers, and not 3869, as expected. I therefore separated the gene model numbers in "headers.4.txt" using VIM:

  • :%s/|.*//g

... and the concatenated this list of retrieved gene model numbers, with the list of desired gene model numbers in "grep_file.3", and saved this new list in "list_all.txt".


sort list_all.txt | uniq -u > missing.txt

... resulted in 36 hits. 35 from the chloroplast genome and one from the mitochondrial genome.

Used the following code to extract the actual sequences from "TAIR10_pep_20101214":


mats@Slartibartfasts:~/project/Aronsson/chloroplast_sequences$ cat something.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Modified code from: http://biostar.stackexchange.com/questions/2827/extracting-multiple-fasta-sequences-at-a-time-from-a-file-containing-many-sequenc

import sys
from Bio import SeqIO

fasta_file = sys.argv[1]  # Input fasta file
number_file = sys.argv[2] # Input interesting numbers file, one per line
result_file = sys.argv[3] # Output fasta file

wanted = set()
#with open(number_file) as f:
f = open(number_file)
for line in f:
	line = line.strip()
	if line != "":			# If line in nuber file is not empty (?)
		wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
#with open(result_file, "w") as f:
o = open(result_file, "w")
for seq in fasta_sequences:
	if seq.id in wanted:
		SeqIO.write([seq], o, "fasta")

Input fasta file = ~/db/TAIR/TAIR10_pep_20101214
Input interesting numbers file = headers.4.txt
Output fasta file = TAIR10_cp.fst


./something.py ~/osx/db/TAIR/TAIR10_pep_20101214 headers.4.txt TAIR10_cp.fst

And here I realise that the step that retrieves the FULL fasta header is not necessary. So, after removing ">" and everything except the gene model number from "headers.4.txt", I end up with a file with the same content as "grep_file.3".

Result

The 3833 nuclear encoded (+ some chloroplast encoded!!!) sequences are stored in "TAIR10_cp.fst".
Next time I'll only need a list of gene model numbers in a file (one number per line) and the script "something.py"


./something.py SEQUENCE_FILE_IN_FASTA_FORMAT LIST_OF_GENE_MODEL_NUMBERS OUTPUT_FILE