API

dscript.alphabets

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]

Bases: object

From Bepler & Berger.

Parameters

chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]

Decode numeric encoding to byte string of this alphabet

Parameters: x (np.ndarray) – Numeric encoding
Returns: Amino acid string
Return type: byte str

encode(x)[source]

Encode a byte string into alphabet indices

Parameters: x (byte str) – Amino acid string
Returns: Numeric encoding
Return type: np.ndarray

get_kmer(h, k)[source]: retrieve byte string of length k decoded from integer h

unpack(h, k)[source]: unpack integer h into array of this alphabet with length k

class dscript.alphabets.SDM12(mask=False)[source]

Bases: dscript.alphabets.Alphabet

A D KER N TSQ YF LIVM C W H G P

See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732308/#B33 “Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment” Peterson et al. 2009. Bioinformatics.

class dscript.alphabets.Uniprot21(mask=False)[source]

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta

dscript.fasta.count_bins(array, bins)[source]

dscript.fasta.parse(f, comment='#')[source]

dscript.fasta.parse_directory(directory, extension='.seq')[source]

dscript.fasta.write(nam, seq, f)[source]

dscript.language_model

dscript.language_model.embed_from_directory(directory, outputPath, device=0, verbose=False, extension='.seq')[source]

Embed all files in a directory in .fasta format using pre-trained language model from Bepler & Berger.

Parameters

directory (str) – Input directory (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress
extension (str) – Extension of all files to read in

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters

fastaPath (str) – Input sequence file (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False)[source]

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters

sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.pretrained

dscript.pretrained.get_pretrained(version='human_v2')[source]

Get pre-trained model object.

See the documentation for most up-to-date list.

lm_v1 - Language model from Bepler & Berger.
human_v1 - Human trained model from D-SCRIPT manuscript.
human_v2 - Human trained model from Topsy-Turvy manuscript.

Default: human_v2

Parameters: version (str) – Version of pre-trained model to get
Returns: Pre-trained model
Return type: dscript.models.*

dscript.pretrained.get_state_dict(version='human_v2', verbose=True)[source]

Download a pre-trained model if not already exists on local device.

Parameters

version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.pretrained.get_state_dict_path(version: str) → str[source]

dscript.pretrained.retry(retry_count: int)[source]

dscript.glider

dscript.glider.compute_X_normalized(A, D, t=- 1, lm=1, is_normalized=True)[source]

dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]

Computes the common weighted normalized score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_degree_vec(edgelist)[source]

dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]

dscript.glider.compute_l3_unweighted_mat(A)[source]

dscript.glider.compute_l3_weighted_mat(A)[source]

dscript.glider.compute_pinverse_diagonal(D)[source]

dscript.glider.create_edge_dict(edgelist)[source]

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters: edgelist (list) – list with elements of form (p, q, w)
Returns: A dictionary with key (p, q) and value w.
Return type: dict

dscript.glider.create_neighborhood_dict(edgelist)[source]

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters: edgelist (list) – A list with elements of form (p, q, w)
Returns: neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
Return type: dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters

edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed

Returns

Graph as an adjacency matrix

Return type

np.ndarray

dscript.glider.get_dim(edgelist)[source]

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters: edgelist (list) – Graph adjacency list
Returns: Number of nodes in the graph
Return type: int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]

Return glide_mat and glide_map.

Parameters

pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE

Returns

glide_matrix and corresponding glide_map

Return type

tuple(np.ndarray, dict)

dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source]

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters

edgelist – A list with elements of type (p, q, wt)
X – A nxk embedding matrix
params – A dictionary with entries

{

alpha => real number beta => real number delta => real number loc => String, can be cw for common weighted, l3 for l3 local scoring

### To enable ctypes, the following entries should be there ###

ctypes_on => True # This key should only be added if ctypes is on (dont add this: # if ctypes is not added)

so_location => String location of the .so dynamic library

}

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]

dscript.utils

class dscript.utils.PairedDataset(X0, X1, Y)[source]

Bases: torch.utils.data.dataset.Dataset

Dataset to be used by the PyTorch data loader for pairs of sequences and their labels.

Parameters

X0 – List of first item in the pair
X1 – List of second item in the pair
Y – List of labels

dscript.utils.RBF(D, sigma=None)[source]

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters

D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.collate_paired_sequences(args)[source]: Collate function for PyTorch data loader.

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1)[source]

Load keys from hdf5 file into memory

Parameters

file_path (str) – Path to hdf5 file
keys (list[str]) – List of keys to get

Returns

Dictionary with keys and records in memory

Return type

dict

dscript.utils.log(m, file=None, timestamped=True, print_also=False)[source]