Format I/O (FASTA, CSV, JSON, PDB)#
The biolmai.io module provides utilities for converting between file formats and BioLM API JSON structures. This guide covers common usage patterns and best practices.
FASTA Format#
Loading FASTA Files#
The load_fasta() function parses FASTA files and returns a list of dictionaries suitable for API requests:
from biolmai.io import load_fasta
# Load sequences from file
items = load_fasta("sequences.fasta")
# Each item contains:
# - "sequence": The sequence string
# - "id": Sequence identifier from header
# - "metadata": Additional metadata (if present)
print(items[0])
# {'sequence': 'ACDEFGHIKLMNPQRSTVWY', 'id': 'seq1', 'metadata': {}}
FASTA files support: - Multi-line sequences (wrapped sequences are automatically concatenated) - Headers with metadata (pipe-separated or space-separated) - Multiple sequences in a single file
Example FASTA file:
>seq1|protein|test
ACDEFGHIKLMNPQRSTVWY
>seq2 description here
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL
Writing FASTA Files#
The to_fasta() function writes sequences to FASTA format:
from biolmai.io import to_fasta
# Data from API response
data = [
{"sequence": "ACDEFGHIKLMNPQRSTVWY", "id": "seq1"},
{"sequence": "MKTAYIAKQRQISFVKSHFSRQ", "id": "seq2"},
]
# Write to file
to_fasta(data, "output.fasta")
# With metadata
data_with_metadata = [
{
"sequence": "ACDEFGHIKLMNPQRSTVWY",
"id": "seq1",
"metadata": {"description": "Test sequence", "type": "protein"},
}
]
to_fasta(data_with_metadata, "output.fasta")
You can also use a custom sequence key:
data = [{"seq": "ACDEFGHIKLMNPQRSTVWY", "id": "seq1"}]
to_fasta(data, "output.fasta", sequence_key="seq")
CSV Format#
Loading CSV Files#
The load_csv() function parses CSV files with headers:
from biolmai.io import load_csv
# Load CSV file
items = load_csv("data.csv")
# Each row becomes a dictionary with column headers as keys
print(items[0])
# {'sequence': 'ACDEFGHIKLMNPQRSTVWY', 'id': 'seq1', 'score': '0.95'}
You can validate that a specific column exists:
# Raises ValueError if "sequence" column is missing
items = load_csv("data.csv", sequence_key="sequence")
Example CSV file:
sequence,id,score,description
ACDEFGHIKLMNPQRSTVWY,seq1,0.95,Test sequence 1
MKTAYIAKQRQISFVKSHFSRQ,seq2,0.87,Test sequence 2
Writing CSV Files#
The to_csv() function writes data to CSV format:
from biolmai.io import to_csv
# Data from API response
data = [
{"sequence": "ACDEFGHIKLMNPQRSTVWY", "id": "seq1", "score": 0.95},
{"sequence": "MKTAYIAKQRQISFVKSHFSRQ", "id": "seq2", "score": 0.87},
]
# Write to file
to_csv(data, "output.csv")
# With custom fieldnames
to_csv(data, "output.csv", fieldnames=["sequence", "id"])
Missing keys are automatically filled with empty strings.
PDB Format#
Loading PDB Files#
The load_pdb() function reads PDB structure files:
from biolmai.io import load_pdb
# Load single-model PDB
items = load_pdb("structure.pdb")
# Returns: [{"pdb": "HEADER TEST\nATOM 1 N MET A 1\n..."}]
# For multi-model PDBs, returns one item per model
items = load_pdb("multi_model.pdb")
# Returns: [{"pdb": "MODEL 1..."}, {"pdb": "MODEL 2..."}]
Writing PDB Files#
The to_pdb() function writes PDB structures:
from biolmai.io import to_pdb
# Data from API response
data = [{"pdb": "HEADER TEST\nATOM 1 N MET A 1\nEND\n"}]
# Write to file
to_pdb(data, "output.pdb")
# Multiple structures are concatenated
data = [
{"pdb": "MODEL 1\nATOM...\nENDMDL\n"},
{"pdb": "MODEL 2\nATOM...\nENDMDL\n"},
]
to_pdb(data, "output.pdb")
Integration with Model Class#
The io module is designed to work seamlessly with the Model class:
from biolmai.io import load_fasta, to_csv
from biolmai import Model
# Load sequences from FASTA
items = load_fasta("sequences.fasta")
# Use with model
model = Model("esm2-8m")
results = model.encode(items=items)
# Export results to CSV
to_csv(results, "results.csv")
Complete workflow example:
from biolmai.io import load_fasta, to_csv
from biolmai import Model
# 1. Load input sequences
sequences = load_fasta("input.fasta")
# 2. Process with model
model = Model("esmfold")
structures = model.predict(items=sequences)
# 3. Export results
to_csv(structures, "output.csv")
File-like Objects#
All functions support both file paths and file-like objects:
import io
from biolmai.io import load_fasta, to_fasta
# Load from file-like object
file_obj = io.StringIO(">seq1\nACDEFGHIKLMNPQRSTVWY\n")
items = load_fasta(file_obj)
# Write to file-like object
output = io.StringIO()
to_fasta(items, output)
content = output.getvalue()
Error Handling#
The io module raises clear exceptions for common errors:
from biolmai.io import load_fasta
try:
items = load_fasta("nonexistent.fasta")
except FileNotFoundError:
print("File not found")
try:
items = load_fasta("empty.fasta")
except ValueError as e:
print(f"Invalid file: {e}")
Common error types:
- FileNotFoundError: File path doesn’t exist
- ValueError: File is empty, malformed, or missing required fields
- KeyError: Missing required keys in data dictionaries
Best Practices#
Validate input files: Check that files exist and are readable before processing
Handle errors gracefully: Use try/except blocks for file operations
Preserve metadata: Include metadata in FASTA headers when exporting
Use appropriate formats: FASTA for sequences, CSV for tabular data, PDB for structures
Round-trip testing: Test that load → process → save → load works correctly
See Also#
SDK API for
biolmai.ioAPI detailsDisk output in Usage for writing API results to disk