API
dscript.alphabets
-
class
dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source] Bases:
objectFrom Bepler & Berger.
- Parameters
chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]
-
decode(x)[source] Decode numeric encoding to byte string of this alphabet
- Parameters
x (np.ndarray) – Numeric encoding
- Returns
Amino acid string
- Return type
byte str
-
class
dscript.alphabets.SDM12(mask=False)[source] Bases:
dscript.alphabets.AlphabetA D KER N TSQ YF LIVM C W H G P
See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.
-
class
dscript.alphabets.Uniprot21(mask=False)[source] Bases:
dscript.alphabets.AlphabetUniprot 21 Amino Acid Encoding.
From Bepler & Berger.
dscript.fasta
-
dscript.fasta.parse(f: str)[source] Parse a FASTA file and return a tuple of sequence names and sequences.
- Parameters
f (file-like object) – file-like object representing the FASTA file to parse.
- Returns
A tuple containing: - list of str: Sequence names. - list of str: Sequences.
- Return type
(list of str, list of str)
-
dscript.fasta.parse_dict(f: str)[source] Parse a FASTA file and return a dictionary of sequences.
- Parameters
f (str) – The FASTA file to parse (file-like object).
- Returns
A dictionary where keys are sequence names and values are sequences.
- Return type
dict
-
dscript.fasta.parse_directory(directory, extension='.seq')[source] Parse all files in a directory with a specific extension and return their names and sequences.
- Parameters
directory (str) – Directory containing the files to parse.
extension (str) – File extension to filter files (default is “.seq”).
- Returns
A tuple containing: - list of str: Sequence names. - list of str: Sequences.
- Return type
(list of str, list of str)
-
dscript.fasta.parse_from_list(f: str, names: list)[source] Parse a FASTA file and return a dictionary of sequences for specified names.
- Parameters
f (str) – The FASTA file to parse (file-like object).
names (list of str) – List of sequence names to extract from the FASTA file.
- Returns
A dictionary where keys are sequence names and values are sequences.
- Return type
dict
dscript.foldseek
-
dscript.foldseek.get_3di_sequences(pdb_files: list)[source] Extract 3Di sequences from PDB/mmCIF files using biotite.structure.alphabet.to_3di(atoms). Returns a dict {basename: SeqRecord}.
At this time, this function will only extract a 3Di sequence for the first chain in each PDB file. If you need to extract multiple chains, you will need to modify this function. This is to maintain consistent naming support with the rest of D-SCRIPT training and inference scripts, as the current requirement is that pdb file names match fasta header names.
dscript.glider
-
dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source] Computes the common weighted score between p and q.
- Parameters
p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here
- Returns
A real value representing the score
- Return type
float
-
dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source] Computes the common weighted normalized score between p and q.
- Parameters
p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here
- Returns
A real value representing the score
- Return type
float
-
dscript.glider.create_edge_dict(edgelist)[source] Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.
- Parameters
edgelist (list) – list with elements of form (p, q, w)
- Returns
A dictionary with key (p, q) and value w.
- Return type
dict
-
dscript.glider.create_neighborhood_dict(edgelist)[source] Create a dictionary with nodes as key and a list of neighborhood nodes as the value
- Parameters
edgelist (list) – A list with elements of form (p, q, w)
- Returns
neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
- Return type
dict
-
dscript.glider.densify(edgelist, dim=None, directed=False)[source] Given an adjacency list for the graph, computes the adjacency matrix.
- Parameters
edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed
- Returns
Graph as an adjacency matrix
- Return type
np.ndarray
-
dscript.glider.get_dim(edgelist)[source] Given an adjacency list for a graph, returns the number of nodes in the graph.
- Parameters
edgelist (list) – Graph adjacency list
- Returns
Number of nodes in the graph
- Return type
int
-
dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source] Return glide_mat and glide_map.
- Parameters
pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE
- Returns
glide_matrix and corresponding glide_map
- Return type
tuple(np.ndarray, dict)
-
dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source] Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.
- Parameters
edgelist (list) – A list with elements of type (p, q, wt)
X (np.ndarray) – A nxk embedding matrix
params (dict) –
A dictionary with entries:
alpha: real number
beta: real number
delta: real number
loc: String, can be cw for common weighted, l3 for l3 local scoring
To enable ctypes, the following entries should be there:
ctypes_on: True (This key should only be added if ctypes is on)
so_location: String location of the .so dynamic library
thres_p (float) – Threshold percentile value
- Returns
Glide matrix
- Return type
np.ndarray
dscript.language_model
-
dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source] Embed all files in a directory in
.fastaformat using pre-trained language model from Bepler & Berger.- Parameters
directory (str) – Input directory (
.fastaformat)outputPath (str) – Output embedding file (
.h5format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in
-
dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source] Embed sequences using pre-trained language model from Bepler & Berger.
- Parameters
fastaPath (str) – Input sequence file (
.fastaformat)outputPath (str) – Output embedding file (
.h5format)device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
-
dscript.language_model.lm_embed(sequence, use_cuda=False)[source] Embed a single sequence using pre-trained language model from Bepler & Berger.
- Parameters
sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]
- Returns
Embedded sequence
- Return type
torch.Tensor
dscript.load_worker
dscript.loading
dscript.pretrained
-
dscript.pretrained.get_pretrained(version='human_v2')[source] Get pre-trained model object.
See the documentation for most up-to-date list.
lm_v1- Language model from Bepler & Berger.human_v1- Human trained model from D-SCRIPT manuscript.human_v2- Human trained model from Topsy-Turvy manuscript.human_tt3d- Human trained model with FoldSeek sequence inputs
Default:
human_v2- Parameters
version (str) – Version of pre-trained model to get
- Returns
Pre-trained model
- Return type
dscript.models.*
-
dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source] Download a pre-trained model if not already exists on local device.
- Parameters
version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]
- Returns
Path to state dictionary for pre-trained language model
- Return type
str
dscript.utils
-
class
dscript.utils.PairedDataset(X0, X1, Y)[source] Bases:
Generic[torch.utils.data.dataset._T_co]Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.
- Parameters
X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels
-
dscript.utils.RBF(D, sigma=None)[source] Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.
\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)
- Parameters
D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]
- Returns
Similarity matrix
- Return type
np.ndarray
-
dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1, return_dict=True)[source] Load keys from hdf5 file into memory
- Parameters
file_path (str) – Path to hdf5 file
keys (iterable[str]) – List of keys to get
- Returns
if return_dict, a mapping of keys (proteins names) to pointers to empbeddings. otherwise, a list of pointers in the same order as keys
- Return type
list
-
dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source] Legacy log function that wraps loguru for backward compatibility.
- Parameters
m (str) – Message to log
file (file handle or None) – File handle to write to (if None, uses stdout)
timestamped (bool) – Whether to include timestamp (handled by loguru)
print_also (bool) – Whether to also print to stdout when writing to file