API#
dscript.alphabets#
- class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]#
Bases:
objectFrom Bepler & Berger.
- Parameters:
chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]
- decode(x)[source]#
Decode numeric encoding to byte string of this alphabet
- Parameters:
x (np.ndarray) – Numeric encoding
- Returns:
Amino acid string
- Return type:
byte str
- class dscript.alphabets.SDM12(mask=False)[source]#
Bases:
AlphabetA D KER N TSQ YF LIVM C W H G P
See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.
- class dscript.alphabets.Uniprot21(mask=False)[source]#
Bases:
AlphabetUniprot 21 Amino Acid Encoding.
From Bepler & Berger.
dscript.fasta#
- dscript.fasta.parse(f: str)[source]#
Parse a FASTA file and return a tuple of sequence names and sequences.
- Parameters:
f (file-like object) – file-like object representing the FASTA file to parse.
- Returns:
A tuple containing: - list of str: Sequence names. - list of str: Sequences.
- Return type:
(list of str, list of str)
- dscript.fasta.parse_dict(f: str)[source]#
Parse a FASTA file and return a dictionary of sequences.
- Parameters:
f (str) – The FASTA file to parse (file-like object).
- Returns:
A dictionary where keys are sequence names and values are sequences.
- Return type:
dict
- dscript.fasta.parse_directory(directory, extension='.seq')[source]#
Parse all files in a directory with a specific extension and return their names and sequences.
- Parameters:
directory (str) – Directory containing the files to parse.
extension (str) – File extension to filter files (default is “.seq”).
- Returns:
A tuple containing: - list of str: Sequence names. - list of str: Sequences.
- Return type:
(list of str, list of str)
- dscript.fasta.parse_from_list(f: str, names: list[str])[source]#
Parse a FASTA file and return a dictionary of sequences for specified names.
- Parameters:
f (str) – The FASTA file to parse (file-like object).
names (list of str) – List of sequence names to extract from the FASTA file.
- Returns:
A dictionary where keys are sequence names and values are sequences.
- Return type:
dict
dscript.foldseek#
- dscript.foldseek.get_3di_sequences(pdb_files: list[str])[source]#
Extract 3Di sequences from PDB/mmCIF files using biotite.structure.alphabet.to_3di(atoms). Returns a dict {basename: SeqRecord}.
At this time, this function will only extract a 3Di sequence for the first chain in each PDB file. If you need to extract multiple chains, you will need to modify this function. This is to maintain consistent naming support with the rest of D-SCRIPT training and inference scripts, as the current requirement is that pdb file names match fasta header names.
dscript.glider#
- dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]#
Computes the common weighted score between p and q.
- Parameters:
p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here
- Returns:
A real value representing the score
- Return type:
float
- dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]#
Computes the common weighted normalized score between p and q.
- Parameters:
p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here
- Returns:
A real value representing the score
- Return type:
float
- dscript.glider.create_edge_dict(edgelist)[source]#
Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.
- Parameters:
edgelist (list) – list with elements of form (p, q, w)
- Returns:
A dictionary with key (p, q) and value w.
- Return type:
dict
- dscript.glider.create_neighborhood_dict(edgelist)[source]#
Create a dictionary with nodes as key and a list of neighborhood nodes as the value
- Parameters:
edgelist (list) – A list with elements of form (p, q, w)
- Returns:
neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
- Return type:
dict
- dscript.glider.densify(edgelist, dim=None, directed=False)[source]#
Given an adjacency list for the graph, computes the adjacency matrix.
- Parameters:
edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed
- Returns:
Graph as an adjacency matrix
- Return type:
np.ndarray
- dscript.glider.get_dim(edgelist)[source]#
Given an adjacency list for a graph, returns the number of nodes in the graph.
- Parameters:
edgelist (list) – Graph adjacency list
- Returns:
Number of nodes in the graph
- Return type:
int
- dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]#
Return glide_mat and glide_map.
- Parameters:
pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE
- Returns:
glide_matrix and corresponding glide_map
- Return type:
tuple(np.ndarray, dict)
- dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source]#
Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.
- Parameters:
edgelist (list) – A list with elements of type (p, q, wt)
X (np.ndarray) – A nxk embedding matrix
params (dict) –
A dictionary with entries:
alpha: real number
beta: real number
delta: real number
loc: String, can be cw for common weighted, l3 for l3 local scoring
To enable ctypes, the following entries should be there:
ctypes_on: True (This key should only be added if ctypes is on)
so_location: String location of the .so dynamic library
thres_p (float) – Threshold percentile value
- Returns:
Glide matrix
- Return type:
np.ndarray
dscript.language_model#
- dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]#
Embed all files in a directory in
.fastaformat using pre-trained language model from Bepler & Berger.- Parameters:
directory (str) – Input directory (
.fastaformat)outputPath (str) – Output embedding file (
.h5format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in
- dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]#
Embed sequences using pre-trained language model from Bepler & Berger.
- Parameters:
fastaPath (str) – Input sequence file (
.fastaformat)outputPath (str) – Output embedding file (
.h5format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
- dscript.language_model.lm_embed(sequence, use_cuda=False)[source]#
Embed a single sequence using pre-trained language model from Bepler & Berger.
- Parameters:
sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]
- Returns:
Embedded sequence
- Return type:
torch.Tensor
dscript.load_worker#
dscript.loading#
dscript.pretrained#
- dscript.pretrained.get_pretrained(version='human_v2')[source]#
Get pre-trained model object.
See the documentation for most up-to-date list.
lm_v1- Language model from Bepler & Berger.human_v1- Human trained model from D-SCRIPT manuscript.human_v2- Human trained model from Topsy-Turvy manuscript.human_tt3d- Human trained model with FoldSeek sequence inputs
Default:
human_v2- Parameters:
version (str) – Version of pre-trained model to get
- Returns:
Pre-trained model
- Return type:
dscript.models.*
- dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]#
Download a pre-trained model if not already exists on local device.
- Parameters:
version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]
- Returns:
Path to state dictionary for pre-trained language model
- Return type:
str
dscript.utils#
- class dscript.utils.PairedDataset(X0, X1, Y)[source]#
Bases:
DatasetDataset to be used by the PyTorch data loader for pairs of sequences and their labels.
- Parameters:
X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels
- dscript.utils.RBF(D, sigma=None, pseudocount=1e-10)[source]#
Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.
\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)
- Parameters:
D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]
- Returns:
Similarity matrix
- Return type:
np.ndarray
- dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=-1, return_dict=True)[source]#
Load keys from hdf5 file into memory
- Parameters:
file_path (str) – Path to hdf5 file
keys (iterable[str]) – List of keys to get
- Returns:
if return_dict, a mapping of keys (proteins names) to pointers to empbeddings. otherwise, a list of pointers in the same order as keys
- Return type:
list
- dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]#
Legacy log function that wraps loguru for backward compatibility.
- Parameters:
m (str) – Message to log
file (file handle or None) – File handle to write to (if None, uses stdout)
timestamped (bool) – Whether to include timestamp (handled by loguru)
print_also (bool) – Whether to also print to stdout when writing to file
dscript.models.embedding#
- class dscript.models.embedding.FullyConnectedEmbed(nin, nout, dropout=0.5, activation=ReLU())[source]#
Bases:
ModuleProtein Projection Module. Takes embedding from language model and outputs low-dimensional interaction aware projection.
- Parameters:
nin (int) – Size of language model output
nout (int) – Dimension of projection
dropout (float) – Proportion of weights to drop out [default: 0.5]
activation (torch.nn.Module) – Activation for linear projection model
- class dscript.models.embedding.IdentityEmbed(*args, **kwargs)[source]#
Bases:
ModuleDoes not reduce the dimension of the language model embeddings, just passes them through to the contact model.
- class dscript.models.embedding.SkipLSTM(nin, nout, hidden_dim, num_layers, dropout=0, bidirectional=True)[source]#
Bases:
ModuleLanguage model from Bepler & Berger.
Loaded with pre-trained weights in embedding function.
- Parameters:
nin (int) – Input dimension of amino acid one-hot [default: 21]
nout (int) – Output dimension of final layer [default: 100]
hidden_dim (int) – Size of hidden dimension [default: 1024]
num_layers (int) – Number of stacked LSTM models [default: 3]
dropout (float) – Proportion of weights to drop out [default: 0]
bidirectional (bool) – Whether to use biLSTM vs. LSTM
dscript.models.contact#
- class dscript.models.contact.ContactCNN(embed_dim, hidden_dim=50, width=7, activation=Sigmoid())[source]#
Bases:
ModuleResidue Contact Prediction Module. Takes embeddings from Projection module and produces contact map, output of Contact module.
- Parameters:
embed_dim (int) –
Output dimension of dscript.models.embedding model \(d\) [default: 100]
hidden_dim (int) – Hidden dimension \(h\) [default: 50]
width (int) – Width of convolutional filter \(2w+1\) [default: 7]
activation (torch.nn.Module) – Activation function for final contact map [default: torch.nn.Sigmoid()]
- cmap(z0, z1)[source]#
Calls dscript.models.contact.FullyConnected.
- Parameters:
z0 (torch.Tensor) – Projection module embedding \((b \times N \times d)\)
z1 (torch.Tensor) – Projection module embedding \((b \times M \times d)\)
- Returns:
Predicted contact broadcast tensor \((b \times N \times M \times h)\)
- Return type:
torch.Tensor
- class dscript.models.contact.FullyConnected(embed_dim, hidden_dim, activation=ReLU())[source]#
Bases:
ModulePerforms part 1 of Contact Prediction Module. Takes embeddings from Projection module and produces broadcast tensor.
Input embeddings of dimension \(d\) are combined into a \(2d\) length MLP input \(z_{cat}\), where \(z_{cat} = [z_0 \ominus z_1 | z_0 \odot z_1]\)
- Parameters:
embed_dim (int) –
Output dimension of dscript.models.embedding model \(d\) [default: 100]
hidden_dim (int) – Hidden dimension \(h\) [default: 50]
activation (torch.nn.Module) – Activation function for broadcast tensor [default: torch.nn.ReLU()]
- batchnorm#
self.proj = nn.Linear(121, 100)
dscript.models.interaction#
- class dscript.models.interaction.DSCRIPTModel(*args, **kwargs)[source]#
Bases:
ModelInteraction,PyTorchModelHubMixin
- class dscript.models.interaction.LogisticActivation(x0=0, k=1, train=False)[source]#
Bases:
ModuleImplementation of Generalized Sigmoid Applies the element-wise function:
\(\sigma(x) = \frac{1}{1 + \exp(-k(x-x_0))}\)
- Parameters:
x0 (float) – The value of the sigmoid midpoint
k (float) – The slope of the sigmoid - trainable - \(k \geq 0\)
train (bool) – Whether \(k\) is a trainable parameter
- class dscript.models.interaction.ModelInteraction(embedding, contact, use_cuda, do_w=True, do_sigmoid=True, do_pool=False, pool_size=9, theta_init=1, lambda_init=0, gamma_init=0)[source]#
Bases:
Module- cpred(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#
Project down input language model embeddings into low dimension using projection module
- Parameters:
z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
- Returns:
Predicted contact map \((b \times N \times M)\)
- Return type:
torch.Tensor
- embed(x)[source]#
Project down input language model embeddings into low dimension using projection module
- Parameters:
z (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
- Returns:
D-SCRIPT projection \((b \times N \times d)\)
- Return type:
torch.Tensor
- map_predict(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#
Project down input language model embeddings into low dimension using projection module
- Parameters:
z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
- Returns:
Predicted contact map, predicted probability of interaction \((b \times N \times d_0), (1)\)
- Return type:
torch.Tensor, torch.Tensor
- predict(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#
Project down input language model embeddings into low dimension using projection module
- Parameters:
z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
- Returns:
Predicted probability of interaction
- Return type:
torch.Tensor, torch.Tensor