API

dscript.alphabets

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]

Bases: object

From Bepler & Berger.

Parameters

chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]

Decode numeric encoding to byte string of this alphabet

Parameters: x (np.ndarray) – Numeric encoding
Returns: Amino acid string
Return type: byte str

encode(x)[source]

Encode a byte string into alphabet indices

Parameters: x (byte str) – Amino acid string
Returns: Numeric encoding
Return type: np.ndarray

get_kmer(h, k)[source]: retrieve byte string of length k decoded from integer h

unpack(h, k)[source]: unpack integer h into array of this alphabet with length k

class dscript.alphabets.SDM12(mask=False)[source]

Bases: dscript.alphabets.Alphabet

A D KER N TSQ YF LIVM C W H G P

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.

class dscript.alphabets.Uniprot21(mask=False)[source]

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta

dscript.fasta.parse(f: str)[source]

Parse a FASTA file and return a tuple of sequence names and sequences.

Parameters: f (file-like object) – file-like object representing the FASTA file to parse.
Returns: A tuple containing: - list of str: Sequence names. - list of str: Sequences.
Return type: (list of str, list of str)

dscript.fasta.parse_dict(f: str)[source]

Parse a FASTA file and return a dictionary of sequences.

Parameters: f (str) – The FASTA file to parse (file-like object).
Returns: A dictionary where keys are sequence names and values are sequences.
Return type: dict

dscript.fasta.parse_directory(directory, extension='.seq')[source]

Parse all files in a directory with a specific extension and return their names and sequences.

Parameters

directory (str) – Directory containing the files to parse.
extension (str) – File extension to filter files (default is “.seq”).

Returns

A tuple containing: - list of str: Sequence names. - list of str: Sequences.

Return type

(list of str, list of str)

dscript.fasta.parse_from_list(f: str, names: list)[source]

Parse a FASTA file and return a dictionary of sequences for specified names.

Parameters

f (str) – The FASTA file to parse (file-like object).
names (list of str) – List of sequence names to extract from the FASTA file.

Returns

A dictionary where keys are sequence names and values are sequences.

Return type

dict

dscript.fasta.write(nam, seq, f)[source]

Write a set of sequences to a FASTA file.

Parameters

nam (list of str) – A list of keys (sequence names).
seq (list of str) – A list of sequences.
f (file-like object) – The file to write to.

dscript.foldseek

dscript.foldseek.get_3di_sequences(pdb_files: list)[source]

Extract 3Di sequences from PDB/mmCIF files using biotite.structure.alphabet.to_3di(atoms). Returns a dict {basename: SeqRecord}.

At this time, this function will only extract a 3Di sequence for the first chain in each PDB file. If you need to extract multiple chains, you will need to modify this function. This is to maintain consistent naming support with the rest of D-SCRIPT training and inference scripts, as the current requirement is that pdb file names match fasta header names.

dscript.foldseek.get_foldseek_onehot(n0, size_n0, fold_record, fold_vocab)[source]: fold_record is just a dictionary {ensembl_gene_name => foldseek_sequence}

dscript.glider

dscript.glider.compute_X_normalized(A, D, t=- 1, lm=1, is_normalized=True)[source]

dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted normalized score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_degree_vec(edgelist)[source]

dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]

dscript.glider.compute_l3_unweighted_mat(A)[source]

dscript.glider.compute_l3_weighted_mat(A)[source]

dscript.glider.compute_pinverse_diagonal(D)[source]

dscript.glider.create_edge_dict(edgelist)[source]

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters: edgelist (list) – list with elements of form (p, q, w)
Returns: A dictionary with key (p, q) and value w.
Return type: dict

dscript.glider.create_neighborhood_dict(edgelist)[source]

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters: edgelist (list) – A list with elements of form (p, q, w)
Returns: neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
Return type: dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters

edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed

Returns

Graph as an adjacency matrix

Return type

np.ndarray

dscript.glider.get_dim(edgelist)[source]

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters: edgelist (list) – Graph adjacency list
Returns: Number of nodes in the graph
Return type: int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]

Return glide_mat and glide_map.

Parameters

pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE

Returns

glide_matrix and corresponding glide_map

Return type

tuple(np.ndarray, dict)

dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source]

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters

edgelist (list) – A list with elements of type (p, q, wt)
X (np.ndarray) – A nxk embedding matrix
params (dict) –
A dictionary with entries:
- alpha: real number
- beta: real number
- delta: real number
- loc: String, can be cw for common weighted, l3 for l3 local scoring
To enable ctypes, the following entries should be there:
- ctypes_on: True (This key should only be added if ctypes is on)
- so_location: String location of the .so dynamic library
thres_p (float) – Threshold percentile value

Returns

Glide matrix

Return type

np.ndarray

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]

dscript.language_model

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters

directory (str) – Input directory (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters

fastaPath (str) – Input sequence file (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters

sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.load_worker

dscript.loading

class dscript.loading.LoadingPool(file_path, n_jobs=- 1, timeout=60)[source]

Bases: object

load(keys, progress=False)[source]

load_once(keys, progress=True)[source]

shutdown()[source]

dscript.pretrained

dscript.pretrained.get_pretrained(version='human_v2')[source]

Get pre-trained model object.

See the documentation for most up-to-date list.

lm_v1 - Language model from Bepler & Berger.
human_v1 - Human trained model from D-SCRIPT manuscript.
human_v2 - Human trained model from Topsy-Turvy manuscript.
human_tt3d - Human trained model with FoldSeek sequence inputs

Default: human_v2

Parameters: version (str) – Version of pre-trained model to get
Returns: Pre-trained model
Return type: dscript.models.*

dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]

Download a pre-trained model if not already exists on local device.

Parameters

version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.pretrained.get_state_dict_path(version: str) → str[source]

dscript.pretrained.retry(retry_count: int)[source]

dscript.utils

class dscript.utils.PairedDataset(X0, X1, Y)[source]

Bases: Generic[torch.utils.data.dataset._T_co]

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters

X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels

dscript.utils.RBF(D, sigma=None)[source]

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters

D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]: Collate function for PyTorch data loader.

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1, return_dict=True)[source]

Load keys from hdf5 file into memory

Parameters

file_path (str) – Path to hdf5 file
keys (iterable[str]) – List of keys to get

Returns

if return_dict, a mapping of keys (proteins names) to pointers to empbeddings. otherwise, a list of pointers in the same order as keys

Return type

list

dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]

Legacy log function that wraps loguru for backward compatibility.

Parameters

m (str) – Message to log
file (file handle or None) – File handle to write to (if None, uses stdout)
timestamped (bool) – Whether to include timestamp (handled by loguru)
print_also (bool) – Whether to also print to stdout when writing to file

dscript.utils.parse_device(device_arg, logFile)[source]

dscript.utils.setup_logger(log_file=None, also_stdout=False)[source]

Setup loguru logger for D-SCRIPT.

Parameters

log_file (file handle, str, or None) – File handle or path to write logs to
also_stdout (bool) – Whether to also log to stdout