A Protein-Sequence Transformer for the LuxR Family

The challenge

A protein is a sequence of amino acids, and — like language — which residue belongs next depends on the residues around it. That analogy is why Transformer language models can be turned loose on protein sequences: trained on real proteins, they learn to propose new sequences that look like they belong to the same family. I was interested in the LuxR family of bacterial regulators — the proteins that run quorum sensing, the same biology I’d been modeling elsewhere. I wanted to build a generative sequence model for that family myself, in PyTorch, rather than call an existing API, because I wanted to actually understand autoregressive generation and see firsthand where a naive model breaks on biological data.

My contribution

I developed a Transformer-based autoregressive language model that generates amino-acid sequences resembling the LuxR regulatory family. On top of the standard architecture I added a learned pairwise interaction bias between sequence positions — a term that lets the model represent the idea that particular positions influence each other, which matters biologically because a folded protein brings residues that are far apart in the sequence into physical contact. I assembled the training data from bacterial regulatory-protein sequences retrieved from UniProt, with fallback datasets for when the primary query returned too little, and I built a heuristic ranking system to score the generated sequences so the plausible ones surfaced instead of drowning in noise.

Key decisions

The pairwise-bias term was the real design choice. A vanilla character-level Transformer treats a protein as a flat string of tokens; adding an explicit position-pair bias was my attempt to inject a little biological structure without jumping to a full structure-prediction model I had no way to train. It’s a small idea, but it’s mine, and I can explain exactly what it does and why.

The second decision was to build a ranking heuristic rather than trust raw output. Generation is the easy half; filtering for plausibility is the hard half. I would rather hand a reader a ranked shortlist of candidates than a pile of random-looking strings and a claim that some of them are good.

[Name the specific features your ranking heuristic used — sequence length, amino-acid composition, presence of known LuxR motifs, alignment score to real family members — so a reader can judge how strong the filter actually is.]

The outcome

The model generates LuxR-family-like sequences and ranks them by the heuristic, end to end, from data retrieval through to scored output.

I’m clear-eyed about what this is: a learning-scale model, not a research tool competing with ESM or ProGen. Its value was forcing me to implement embeddings, autoregressive decoding, and a custom bias term in PyTorch with my own hands. That groundwork is what makes the current protein-language-model literature readable to me now — I can look at a paper like ESM and recognize the machinery, because I built a small version of it myself.