API¶

dscript.alphabets¶

class dscript.alphabets.Alphabet(chars, encoding=None, mask=False, missing=255)[source]¶

Bases: object

From Bepler & Berger.

Parameters

chars (byte str) – List of characters in alphabet
encoding (np.ndarray) – Mapping of characters to numbers [default: encoding]
mask (bool) – Set encoding mask [default: False]
missing (int) – Number to use for a value outside the alphabet [default: 255]

decode(x)[source]¶

Decode numeric encoding to byte string of this alphabet

Parameters: x (np.ndarray) – Numeric encoding
Returns: Amino acid string
Return type: byte str

encode(x)[source]¶

Encode a byte string into alphabet indices

Parameters: x (byte str) – Amino acid string
Returns: Numeric encoding
Return type: np.ndarray

get_kmer(h, k)[source]¶: retrieve byte string of length k decoded from integer h

unpack(h, k)[source]¶: unpack integer h into array of this alphabet with length k

class dscript.alphabets.Uniprot21(mask=False)[source]¶

Bases: dscript.alphabets.Alphabet

Uniprot 21 Amino Acid Encoding.

From Bepler & Berger.

dscript.fasta¶

dscript.fasta.count_bins(array, bins)[source]¶

dscript.fasta.parse(f, comment='#')[source]¶

dscript.fasta.parse_directory(directory, extension='.seq')[source]¶

dscript.fasta.write(nam, seq, f)[source]¶

dscript.language_model¶

dscript.language_model.embed_from_fasta(fastaPath, outputPath, device=0, verbose=False)[source]¶

Embed sequences using pre-trained language model from Bepler & Berger.

Parameters

fastaPath (str) – Input sequence file (.fasta format)
outputPath (str) – Output embedding file (.h5 format)
device (int) – Compute device to use for embeddings [default: 0]
verbose (bool) – Print embedding progress

dscript.language_model.lm_embed(sequence, use_cuda=False, verbose=True)[source]¶

Embed a single sequence using pre-trained language model from Bepler & Berger.

Parameters

sequence (str) – Input sequence to be embedded
use_cuda (bool) – Whether to generate embeddings using GPU device [default: False]

Returns

Embedded sequence

Return type

torch.Tensor

dscript.pretrained¶

dscript.pretrained.get_pretrained(version='human_v1', verbose=True)[source]¶

Get pre-trained model object.

See the documentation for most up-to-date list.

lm_v1 - Language model from Bepler & Berger.
human_v1 - Human trained model from D-SCRIPT manuscript.

Default: human_v1

Parameters: version (str) – Version of pre-trained model to get
Returns: Pre-trained model
Return type: dscript.models.*

dscript.pretrained.get_state_dict(version='human_v1', verbose=True)[source]¶

Download a pre-trained model if not already exists on local device.

Parameters

version (str) – Version of trained model to download [default: human_1]
verbose (bool) – Print model download status on stdout [default: True]

Returns

Path to state dictionary for pre-trained language model

Return type

str

dscript.glider¶

dscript.glider.compute_X_normalized(A, D, t=- 1, lm=1, is_normalized=True)[source]¶

dscript.glider.compute_cw_score(p, q, edgedict, ndict, params=None)[source]¶

Computes the common weighted score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_cw_score_normalized(p, q, edgedict, ndict, params=None)[source]¶

Computes the common weighted normalized score between p and q.

Parameters

p – A node of the graph
q – Another node in the graph
edgedict (dict) – A dictionary with key (p, q) and value w.
ndict (dict) – A dictionary with key p and the value a set {p1, p2, …}
params (None) – Should always be none here

Returns

A real value representing the score

Return type

float

dscript.glider.compute_degree_vec(edgelist)[source]¶

dscript.glider.compute_l3_score_mat(p, q, edgedict, ndict, params=None)[source]¶

dscript.glider.compute_l3_unweighted_mat(A)[source]¶

dscript.glider.compute_l3_weighted_mat(A)[source]¶

dscript.glider.compute_pinverse_diagonal(D)[source]¶

dscript.glider.create_edge_dict(edgelist)[source]¶

Creates an edge dictionary with the edge (p, q) as the key, and weight w as the value.

Parameters: edgelist (list) – list with elements of form (p, q, w)
Returns: A dictionary with key (p, q) and value w.
Return type: dict

dscript.glider.create_neighborhood_dict(edgelist)[source]¶

Create a dictionary with nodes as key and a list of neighborhood nodes as the value

Parameters: edgelist (list) – A list with elements of form (p, q, w)
Returns: neighborhood_dict -> A dictionary with key p and value, a set {p1, p2, p3, …}
Return type: dict

dscript.glider.densify(edgelist, dim=None, directed=False)[source]¶

Given an adjacency list for the graph, computes the adjacency matrix.

Parameters

edgelist (list) – Graph adjacency list
dim (int) – Number of nodes in the graph
directed (bool) – Whether the graph should be treated as directed

Returns

Graph as an adjacency matrix

Return type

np.ndarray

dscript.glider.get_dim(edgelist)[source]¶

Given an adjacency list for a graph, returns the number of nodes in the graph.

Parameters: edgelist (list) – Graph adjacency list
Returns: Number of nodes in the graph
Return type: int

dscript.glider.glide_compute_map(pos_df, thres_p=0.9, params={})[source]¶

Return glide_mat and glide_map.

Parameters

pos_df (pd.DataFrame) – Dataframe of weighted edges
thres_p (float) – Threshold to treat an edge as positive
params (dict) – Parameters for GLIDE

Returns

glide_matrix and corresponding glide_map

Return type

tuple(np.ndarray, dict)

dscript.glider.glide_predict_links(edgelist, X, params={}, thres_p=0.9)[source]¶

Predicts the most likely links in a graph given an embedding X of a graph. Returns a ranked list of (edges, distances) sorted from closest to furthest.

Parameters

edgelist – A list with elements of type (p, q, wt)
X – A nxk embedding matrix
params – A dictionary with entries

{

alpha => real number beta => real number delta => real number loc => String, can be cw for common weighted, l3 for l3 local scoring

### To enable ctypes, the following entries should be there ###

ctypes_on => True # This key should only be added if ctypes is on (dont add this: # if ctypes is not added)

so_location => String location of the .so dynamic library

}

dscript.glider.glider_score(p, q, glider_map, glider_mat)[source]¶

dscript.utils¶

dscript.utils.RBF(D, sigma=None)[source]¶

Convert distance matrix into similarity matrix using Radial Basis Function (RBF) Kernel.

\(RBF(x,x') = \exp{\frac{-(x - x')^{2}}{2\sigma^{2}}}\)

Parameters

D (np.ndarray) – Distance matrix
sigma (float) – Bandwith of RBF Kernel [default: \(\sqrt{\text{max}(D)}\)]

Returns

Similarity matrix

Return type

np.ndarray

dscript.utils.augment_data(df)[source]¶: For all pairs (A B), also add pairs (B A) :param df: Data frame with 3 columns - pair1, pair2, label :type df: pd.DataFrame :return: Augmented data frame :rtype: pd.DataFrame

dscript.utils.config_logger(file, fmt, level=2, use_stdout=True)[source]¶

dscript.utils.get_local_or_download(destination: str, source: Optional[str] = None)[source]¶

Return file path destination, and if it does not exist download from source.

Parameters

destination (str) – Destination path for downloaded file
source (str) – URL to download file from

Returns

Path of local file

Return type

str

dscript.utils.load_hdf5_parallel(file_path, keys, n_jobs=- 1)[source]¶

Load keys from hdf5 file into memory

Parameters

file_path (str) – Path to hdf5 file
keys (list[str]) – List of keys to get

Returns

Dictionary with keys and records in memory

Return type

dict

dscript.utils.plot_eval_predictions(labels, predictions, path='figure')[source]¶

Plot histogram of positive and negative predictions, precision-recall curve, and receiver operating characteristic curve.

Parameters

y (np.ndarray) – Labels
phat (np.ndarray) – Predicted probabilities
path (str) – File prefix for plots to be saved to [default: figure]