RDKit
Open-source cheminformatics and machine learning.
|
#include <MHFP.h>
Public Member Functions | |
MHFPEncoder (unsigned int n_permutations=2048, unsigned int seed=42) | |
Constructor. | |
std::vector< uint32_t > | FromStringArray (const std::vector< std::string > &vec) |
Creates a MinHash from a vector of strings. | |
std::vector< uint32_t > | FromArray (const std::vector< uint32_t > &vec) |
Creates a MinHash from a list of unsigned integers. | |
std::vector< std::string > | CreateShingling (const ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
Creates a molecular shingling based on circular substructures. | |
std::vector< std::string > | CreateShingling (const std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
std::vector< uint32_t > | Encode (ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
Creates a MinHash vector from a molecule. | |
std::vector< std::vector< uint32_t > > | Encode (std::vector< ROMol > &mols, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
std::vector< uint32_t > | Encode (std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
std::vector< std::vector< uint32_t > > | Encode (std::vector< std::string > &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
ExplicitBitVect | EncodeSECFP (ROMol &mol, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048) |
Creates a binary fingerprint based on circular sub-SMILES. | |
std::vector< ExplicitBitVect > | EncodeSECFP (std::vector< ROMol > &mols, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
ExplicitBitVect | EncodeSECFP (std::string &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
std::vector< ExplicitBitVect > | EncodeSECFP (std::vector< std::string > &smiles, unsigned char radius=3, bool rings=true, bool isomeric=false, bool kekulize=false, unsigned char min_radius=1, size_t length=2048) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts. | |
Static Public Member Functions | |
static double | Distance (const std::vector< uint32_t > &a, const std::vector< uint32_t > &b) |
Calculates the Hamming distance between two MHFP fingerprints. | |
RDKit::MHFPFingerprints::MHFPEncoder::MHFPEncoder | ( | unsigned int | n_permutations = 2048 , |
unsigned int | seed = 42 |
||
) |
Constructor.
Construct a MHFPEncoder
The MHFPEncoder class is instantieted with a given number of permutations and a seed. Fingerprints / minhashes created with a different number of permutations or a different seed are not compatible.
n_permutations | the number of permutations used to create hash functions. This will be the dimensionality of the resulting vector. Default: 2048 . |
seed | a random seed. Default: 42 . |
std::vector< std::string > RDKit::MHFPFingerprints::MHFPEncoder::CreateShingling | ( | const ROMol & | mol, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
Creates a molecular shingling based on circular substructures.
A molecular shingling is a vector of SMILES that were extracted from and represent a molecule. This method extracts substructures centered at each atom of the molecule with different radii. A molecule with 10 atoms will generate 10 * 3
shingles when a radius of 3
is chosen.
radius | the maximum radius of the substructure that is generated at each atom. Default: 3 . |
rings | whether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C" , "C1CCCCCC1" would be added to the shingling. Default: true . |
isomeric | whether the SMILES added to the shingling are isomeric. Default: false . |
kekulize | whether the SMILES added to the shingling are kekulized. Default: true . NOTE that this will throw an exception if the molecule cannot be kekulized. |
min_radius | the minimum radius that is used to extract n-grams. Default: 1 . |
std::vector< std::string > RDKit::MHFPFingerprints::MHFPEncoder::CreateShingling | ( | const std::string & | smiles, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::Encode | ( | ROMol & | mol, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
Creates a MinHash vector from a molecule.
This methods is a wrapper around MHFPEncoder::CreateShingling and MHFPEncoder::FromStringArray. When a vector of molecules or SMILES is passed and RDKit was compiled with OpenMP, it is parallelized and will speed up by a factor of the number of cores.
radius | the maximum radius of the substructure that is generated at each atom. Default: 3 . |
rings | whether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C" , "C1CCCCCC1" would be added to the shingling. Default: true . |
isomeric | whether the SMILES added to the shingling are isomeric. Default: false . |
kekulize | whether the SMILES added to the shingling are kekulized. Default: true . NOTE that this will throw an exception if the molecule cannot be kekulized. |
min_radius | the minimum radius that is used to extract n-grams. Default: 1 . |
std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::Encode | ( | std::string & | smiles, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< std::vector< uint32_t > > RDKit::MHFPFingerprints::MHFPEncoder::Encode | ( | std::vector< ROMol > & | mols, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< std::vector< uint32_t > > RDKit::MHFPFingerprints::MHFPEncoder::Encode | ( | std::vector< std::string > & | smiles, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
ExplicitBitVect RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP | ( | ROMol & | mol, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 , |
||
size_t | length = 2048 |
||
) |
Creates a binary fingerprint based on circular sub-SMILES.
Creates a binary fingerprint similar to ECFP. However, instead of using a Morgan-style hashing, circular n-grams (sub-SMILES) are created, hashed directly and folded.
radius | the maximum radius of the substructure that is generated at each atom. Default: 3 . |
rings | whether the rings (SSSR) are extrected from the molecule and added to the shingling. Given the molecule "C1CCCCCC1C(=O)C" , "C1CCCCCC1" would be added to the shingling. Default: true . |
isomeric | whether the SMILES added to the shingling are isomeric. Default: false . |
kekulize | whether the SMILES added to the shingling are kekulized. Default: true . NOTE that this will throw an exception if the molecule cannot be kekulized. |
min_radius | the minimum radius that is used to extract n-grams. Default: 1 . |
length | the length into which the fingerprint is folded. Default: 2048 . |
ExplicitBitVect RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP | ( | std::string & | smiles, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 , |
||
size_t | length = 2048 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< ExplicitBitVect > RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP | ( | std::vector< ROMol > & | mols, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 , |
||
size_t | length = 2048 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< ExplicitBitVect > RDKit::MHFPFingerprints::MHFPEncoder::EncodeSECFP | ( | std::vector< std::string > & | smiles, |
unsigned char | radius = 3 , |
||
bool | rings = true , |
||
bool | isomeric = false , |
||
bool | kekulize = false , |
||
unsigned char | min_radius = 1 , |
||
size_t | length = 2048 |
||
) |
This is an overloaded member function, provided for convenience. It differs from the above function only in what argument(s) it accepts.
std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::FromArray | ( | const std::vector< uint32_t > & | vec | ) |
Creates a MinHash from a list of unsigned integers.
This method is exposed in order to enable advanced usage of this MHFP implementation such as MinHashing a sparse array generated by another fingerprint (e.g. Morgan / ECFP).
vec | a vector containg unsigned integers. |
std::vector< uint32_t > RDKit::MHFPFingerprints::MHFPEncoder::FromStringArray | ( | const std::vector< std::string > & | vec | ) |
Creates a MinHash from a vector of strings.
This method is exposed in order to enable advanced usage of this MHFP implementation such as customizing the properties that are hashed in order to create an MHFP instance. In theory, any number of values that can be represented as strings can be minhashed. This method is called by MHFPEncoder::Encode.
vec | a vector containg strings (e.g. the smiles shingling of a molecule). |