API#

dscript.alphabets#

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]#

Bases: object

From Bepler & Berger.

Parameters:

chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]#

Decode numeric encoding to byte string of this alphabet

Parameters:: x (np.ndarray) – Numeric encoding
Returns:: Amino acid string
Return type:: byte str

encode(x)[source]#

Encode a byte string into alphabet indices

Parameters:: x (byte str) – Amino acid string
Returns:: Numeric encoding
Return type:: np.ndarray

get_kmer(h, k)[source]#: retrieve byte string of length k decoded from integer h

unpack(h, k)[source]#: unpack integer h into array of this alphabet with length k

class dscript.alphabets.SDM12(mask=False)[source]#

Bases: Alphabet

A D KER N TSQ YF LIVM C W H G P

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.

class dscript.alphabets.Uniprot21(mask=False)[source]#

Bases: Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta#

dscript.fasta.parse(f: str)[source]#

Parse a FASTA file and return a tuple of sequence names and sequences.

Parameters:: f (file-like object) – file-like object representing the FASTA file to parse.
Returns:: A tuple containing: - list of str: Sequence names. - list of str: Sequences.
Return type:: (list of str, list of str)

dscript.fasta.parse_dict(f: str)[source]#

Parse a FASTA file and return a dictionary of sequences.

Parameters:: f (str) – The FASTA file to parse (file-like object).
Returns:: A dictionary where keys are sequence names and values are sequences.
Return type:: dict

dscript.fasta.parse_directory(directory, extension='.seq')[source]#

Parse all files in a directory with a specific extension and return their names and sequences.

Parameters:

directory (str) – Directory containing the files to parse.
extension (str) – File extension to filter files (default is “.seq”).

Returns:

A tuple containing: - list of str: Sequence names. - list of str: Sequences.

Return type:

(list of str, list of str)

dscript.fasta.parse_from_list(f: str, names: list[str])[source]#

Parse a FASTA file and return a dictionary of sequences for specified names.

Parameters:

f (str) – The FASTA file to parse (file-like object).
names (list of str) – List of sequence names to extract from the FASTA file.

Returns:

A dictionary where keys are sequence names and values are sequences.

Return type:

dict

dscript.fasta.write(nam, seq, f)[source]#

Write a set of sequences to a FASTA file.

Parameters:

nam (list of str) – A list of keys (sequence names).
seq (list of str) – A list of sequences.
f (file-like object) – The file to write to.

dscript.foldseek#

dscript.foldseek.get_3di_sequences(pdb_files: list[str])[source]#

Extract 3Di sequences from PDB/mmCIF files using biotite.structure.alphabet.to_3di(atoms). Returns a dict {basename: SeqRecord}.

At this time, this function will only extract a 3Di sequence for the first chain in each PDB file. If you need to extract multiple chains, you will need to modify this function. This is to maintain consistent naming support with the rest of D-SCRIPT training and inference scripts, as the current requirement is that pdb file names match fasta header names.

dscript.foldseek.get_foldseek_onehot(n0, size_n0, fold_record, fold_vocab)[source]#: fold_record is just a dictionary {ensembl_gene_name => foldseek_sequence}

dscript.glider#

dscript.glider.compute_X_normalized(A, D, t=-1, lm=1, is_normalized=True)[source]#

dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]#

Computes the common weighted score between p and q.

Parameters:

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns:

A real value representing the score

Return type:

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]#

Computes the common weighted normalized score between p and q.

Parameters:

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns:

A real value representing the score

Return type:

float

dscript.glider.compute_degree_vec(edgelist)[source]#

dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]#

dscript.glider.compute_l3_unweighted_mat(A)[source]#

dscript.glider.compute_l3_weighted_mat(A)[source]#

dscript.glider.compute_pinverse_diagonal(D)[source]#

dscript.glider.create_edge_dict(edgelist)[source]#

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters:: edgelist (list) – list with elements of form (p, q, w)
Returns:: A dictionary with key (p, q) and value w.
Return type:: dict

dscript.glider.create_neighborhood_dict(edgelist)[source]#

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters:: edgelist (list) – A list with elements of form (p, q, w)
Returns:: neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
Return type:: dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]#

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters:

edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed

Returns:

Graph as an adjacency matrix

Return type:

np.ndarray

dscript.glider.get_dim(edgelist)[source]#

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters:: edgelist (list) – Graph adjacency list
Returns:: Number of nodes in the graph
Return type:: int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]#

Return glide_mat and glide_map.

Parameters:

pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE

Returns:

glide_matrix and corresponding glide_map

Return type:

tuple(np.ndarray, dict)

dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source]#

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters:

edgelist (list) – A list with elements of type (p, q, wt)
X (np.ndarray) – A nxk embedding matrix
params (dict) –
A dictionary with entries:
- alpha: real number
- beta: real number
- delta: real number
- loc: String, can be cw for common weighted, l3 for l3 local scoring
To enable ctypes, the following entries should be there:
- ctypes_on: True (This key should only be added if ctypes is on)
- so_location: String location of the .so dynamic library
thres_p (float) – Threshold percentile value

Returns:

Glide matrix

Return type:

np.ndarray

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]#

dscript.language_model#

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]#

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters:

directory (str) – Input directory (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]#

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters:

fastaPath (str) – Input sequence file (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]#

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters:

sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns:

Embedded sequence

Return type:

torch.Tensor

dscript.load_worker#

dscript.loading#

class dscript.loading.LoadingPool(file_path, n_jobs=-1, timeout=60)[source]#

Bases: object

load(keys, progress=False)[source]#

load_once(keys, progress=True)[source]#

shutdown()[source]#

dscript.pretrained#

dscript.pretrained.get_pretrained(version='human_v2')[source]#

Get pre-trained model object.

See the documentation for most up-to-date list.

lm_v1 - Language model from Bepler & Berger.
human_v1 - Human trained model from D-SCRIPT manuscript.
human_v2 - Human trained model from Topsy-Turvy manuscript.
human_tt3d - Human trained model with FoldSeek sequence inputs

Default: human_v2

Parameters:: version (str) – Version of pre-trained model to get
Returns:: Pre-trained model
Return type:: dscript.models.*

dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]#

Download a pre-trained model if not already exists on local device.

Parameters:

version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]

Returns:

Path to state dictionary for pre-trained language model

Return type:

str

dscript.pretrained.get_state_dict_path(version: str) → str[source]#

dscript.pretrained.retry(retry_count: int)[source]#

dscript.utils#

class dscript.utils.PairedDataset(X0, X1, Y)[source]#

Bases: Dataset

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters:

X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels

dscript.utils.RBF(D, sigma=None, pseudocount=1e-10)[source]#

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters:

D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns:

Similarity matrix

Return type:

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]#: Collate function for PyTorch data loader.

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=-1, return_dict=True)[source]#

Load keys from hdf5 file into memory

Parameters:

file_path (str) – Path to hdf5 file
keys (iterable[str]) – List of keys to get

Returns:

if return_dict, a mapping of keys (proteins names) to pointers to empbeddings. otherwise, a list of pointers in the same order as keys

Return type:

list

dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]#

Legacy log function that wraps loguru for backward compatibility.

Parameters:

m (str) – Message to log
file (file handle or None) – File handle to write to (if None, uses stdout)
timestamped (bool) – Whether to include timestamp (handled by loguru)
print_also (bool) – Whether to also print to stdout when writing to file

dscript.utils.parse_device(device_arg, logFile)[source]#

dscript.utils.setup_logger(log_file=None, also_stdout=False)[source]#

Setup loguru logger for D-SCRIPT.

Parameters:

log_file (file handle, str, or None) – File handle or path to write logs to
also_stdout (bool) – Whether to also log to stdout

dscript.models.embedding#

class dscript.models.embedding.FullyConnectedEmbed(nin, nout, dropout=0.5, activation=ReLU())[source]#

Bases: Module

Protein Projection Module. Takes embedding from language model and outputs low-dimensional interaction aware projection.

Parameters:

nin (int) – Size of language model output
nout (int) – Dimension of projection
dropout (float) – Proportion of weights to drop out [default: 0.5]
activation (torch.nn.Module) – Activation for linear projection model

forward(x)[source]#

Parameters:: x (torch.Tensor) – Input language model embedding \((b \times N \times d_0)\)
Returns:: Low dimensional projection of embedding
Return type:: torch.Tensor

class dscript.models.embedding.IdentityEmbed(*args, **kwargs)[source]#

Bases: Module

Does not reduce the dimension of the language model embeddings, just passes them through to the contact model.

forward(x)[source]#

Parameters:: x (torch.Tensor) – Input language model embedding \((b \times N \times d_0)\)
Returns:: Same embedding
Return type:: torch.Tensor

class dscript.models.embedding.SkipLSTM(nin, nout, hidden_dim, num_layers, dropout=0, bidirectional=True)[source]#

Bases: Module

Language model from Bepler & Berger.

Loaded with pre-trained weights in embedding function.

Parameters:

nin (int) – Input dimension of amino acid one-hot [default: 21]
nout (int) – Output dimension of final layer [default: 100]
hidden_dim (int) – Size of hidden dimension [default: 1024]
num_layers (int) – Number of stacked LSTM models [default: 3]
dropout (float) – Proportion of weights to drop out [default: 0]
bidirectional (bool) – Whether to use biLSTM vs. LSTM

to_one_hot(x)[source]#

Transform numeric encoded amino acid vector to one-hot encoded vector

Parameters:: x (torch.Tensor) – Input numeric amino acid encoding \((N)\)
Returns:: One-hot encoding vector \((N \times n_{in})\)
Return type:: torch.Tensor

transform(x)[source]#

Parameters:: x (torch.Tensor) – Input numeric amino acid encoding \((N)\)
Returns:: Concatenation of all hidden layers \((N \times (n_{in} + 2 \times \text{num_layers} \times \text{hidden_dim}))\)
Return type:: torch.Tensor

dscript.models.contact#

class dscript.models.contact.ContactCNN(embed_dim, hidden_dim=50, width=7, activation=Sigmoid())[source]#

Bases: Module

Residue Contact Prediction Module. Takes embeddings from Projection module and produces contact map, output of Contact module.

Parameters:

embed_dim (int) –
Output dimension of dscript.models.embedding model \(d\) [default: 100]
hidden_dim (int) – Hidden dimension \(h\) [default: 50]
width (int) – Width of convolutional filter \(2w+1\) [default: 7]
activation (torch.nn.Module) – Activation function for final contact map [default: torch.nn.Sigmoid()]

cmap(z0, z1)[source]#

Calls dscript.models.contact.FullyConnected.

Parameters:

z0 (torch.Tensor) – Projection module embedding \((b \times N \times d)\)
z1 (torch.Tensor) – Projection module embedding \((b \times M \times d)\)

Returns:

Predicted contact broadcast tensor \((b \times N \times M \times h)\)

Return type:

torch.Tensor

forward(z0, z1)[source]#

Parameters:

z0 (torch.Tensor) – Projection module embedding \((b \times N \times d)\)
z1 (torch.Tensor) – Projection module embedding \((b \times M \times d)\)

Returns:

Predicted contact map \((b \times N \times M)\)

Return type:

torch.Tensor

predict(C)[source]#

Predict contact map from broadcast tensor.

Parameters:: B (torch.Tensor) – Predicted contact broadcast \((b \times N \times M \times h)\)
Returns:: Predicted contact map \((b \times N \times M)\)
Return type:: torch.Tensor

class dscript.models.contact.FullyConnected(embed_dim, hidden_dim, activation=ReLU())[source]#

Bases: Module

Performs part 1 of Contact Prediction Module. Takes embeddings from Projection module and produces broadcast tensor.

Input embeddings of dimension \(d\) are combined into a \(2d\) length MLP input \(z_{cat}\), where \(z_{cat} = [z_0 \ominus z_1 | z_0 \odot z_1]\)

Parameters:

embed_dim (int) –
Output dimension of dscript.models.embedding model \(d\) [default: 100]
hidden_dim (int) – Hidden dimension \(h\) [default: 50]
activation (torch.nn.Module) – Activation function for broadcast tensor [default: torch.nn.ReLU()]

batchnorm#: self.proj = nn.Linear(121, 100)

forward(z0, z1)[source]#

Parameters:

z0 (torch.Tensor) – Projection module embedding \((b \times N \times d)\)
z1 (torch.Tensor) – Projection module embedding \((b \times M \times d)\)

Returns:

Predicted broadcast tensor \((b \times N \times M \times h)\)

Return type:

torch.Tensor

dscript.models.interaction#

class dscript.models.interaction.DSCRIPTModel(*args, **kwargs)[source]#: Bases: ModelInteraction, PyTorchModelHubMixin

class dscript.models.interaction.LogisticActivation(x0=0, k=1, train=False)[source]#

Bases: Module

Implementation of Generalized Sigmoid Applies the element-wise function:

\(\sigma(x) = \frac{1}{1 + \exp(-k(x-x_0))}\)

Parameters:

x0 (float) – The value of the sigmoid midpoint
k (float) – The slope of the sigmoid - trainable - \(k \geq 0\)
train (bool) – Whether \(k\) is a trainable parameter

forward(x)[source]#

Applies the function to the input elementwise

Parameters:: x (torch.Tensor) – \((N \times *)\) where \(*\) means, any number of additional dimensions
Returns:: \((N \times *)\), same shape as the input
Return type:: torch.Tensor

class dscript.models.interaction.ModelInteraction(embedding, contact, use_cuda, do_w=True, do_sigmoid=True, do_pool=False, pool_size=9, theta_init=1, lambda_init=0, gamma_init=0)[source]#

Bases: Module

cpred(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#

Project down input language model embeddings into low dimension using projection module

Parameters:

z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)

Returns:

Predicted contact map \((b \times N \times M)\)

Return type:

torch.Tensor

embed(x)[source]#

Project down input language model embeddings into low dimension using projection module

Parameters:: z (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
Returns:: D-SCRIPT projection \((b \times N \times d)\)
Return type:: torch.Tensor

map_predict(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#

Project down input language model embeddings into low dimension using projection module

Parameters:

z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)

Returns:

Predicted contact map, predicted probability of interaction \((b \times N \times d_0), (1)\)

Return type:

torch.Tensor, torch.Tensor

predict(z0, z1, embed_foldseek=False, f0=None, f1=None)[source]#

Project down input language model embeddings into low dimension using projection module

Parameters:

z0 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)
z1 (torch.Tensor) – Language model embedding \((b \times N \times d_0)\)

Returns:

Predicted probability of interaction

Return type:

torch.Tensor, torch.Tensor

API#

dscript.alphabets#

dscript.fasta#

dscript.foldseek#

dscript.glider#

dscript.language_model#

dscript.load_worker#

dscript.loading#

dscript.pretrained#

dscript.utils#

dscript.models.embedding#

dscript.models.contact#

dscript.models.interaction#

This Page