1. Input-output operations

chython.files subpackage contains file readers and writers classes.

1.1. MDL RDF reader

RDFRead class can be used for RDF files reading. Instance of this class is file-like object which support iteration, has a method read() for parsing all data and context manager.

1.1.1. Read file from disk

[1]:
from chython.files import * # import all available readers and writers

with RDFRead('example.rdf') as f:
    first = next(f)  # get first reaction using generator
    data = f.read()  # read remaining reactions to list of ReactionContainers

data = []
with RDFRead('example.rdf') as f:
    for r in f:  # looping is supported. Useful for large files.
        data.append(r)

OOP-stype Pathlib supported

[2]:
from pathlib import Path

with RDFRead(Path('example.rdf')) as r: # OOP style call
    r = next(r)

opened files supported

RDF file should be opened in text mode

[3]:
with open('example.rdf') as f, RDFRead(f) as r:
    r = next(r) # OOP style application

1.1.2. Transparent loading from archives and network

Readers designed transparently support any type of data sources.

Data sources should be file-like objects.

[4]:
from requests import get
from io import StringIO

# get function return requested URL which has attribute text.
# in example this text is whole RDF stored in single string.
# RDFread does not support parsing of strings, but one can emulate files with data
# instead of strings by using io.StringIO
with StringIO(get('https://github.com/chython/chython/raw/master/doc/tutorial/example.rdf').text) as f, RDFRead(f) as r:
    r = next(r)

# python support gzipped data. This example shows how to work with compressed
# data directly without decompressing them to disk.
from gzip import open as gzip_open
with gzip_open('example.rdf.gz', 'rt') as f, RDFRead(f) as r:
    r = next(r)

# zip-files also supported out of the box
# zipped files can be opened only in binary mode. io.TextIOWrapper can be used for transparent decoding them into text
from zipfile import ZipFile
from io import TextIOWrapper
with ZipFile('example.zip') as z, z.open('example.rdf') as c:
    with TextIOWrapper(c) as f, RDFRead(f) as r:
        r = next(r)

# tar-file reading example
from tarfile import open as tar_open
from io import TextIOWrapper
with tar_open('example.tar.gz') as t:
    c = t.extractfile('example.rdf')
    with TextIOWrapper(c) as f, RDFRead(f) as r:
        r = next(r)

1.2. Other Readers

  • SDFRead - MOL, SDF files reader (versions v2000, v3000 are supported)

  • MRVRead - ChemAxon MRV files reader (lxml parser is used)

  • SMILESRead - SMILES strings files reader (coho backend used). Every row should start with new SMILES

  • INCHIRead - INCHI strings files reader (INCHI trust backend used). Every row should start with new InChI

  • XYZRead - xyz files reader (only structures with explicit hydrogens supported)

  • PDBRead - PDB files parser (only structures with explicit hydrogens supported)

All files except MRV should be opened in text-mode
MRV requires binary mode open('/path/to/data.mrv', 'rb')
[5]:
with MRVRead(open('example.mrv', 'rb')) as f:
    m = next(f)

1.3. File writers

Export in following file formats is supported:

  • RDFWrite (v2000) - molecules and reactions export in RDF format

  • SDFWrite (v2000) - molecules export in SDF format

  • ERDFWrite (v3000) - molecules and reactions export in RDF format

  • ESDFWrite (v3000) - molecules export in SDF format

  • MRVWrite - molecules and reactions export in MRV format

Writers have the same API as readers. All writers work with text-files Writers have write method which accepts as argument single reaction or molecule object

[6]:
with RDFWrite('out.rdf') as f: # context manager supported
    for r in data:
        f.write(r)
# file out.rdf will be overriden
[7]:
f = RDFWrite('out.rdf') # ongoing writing into a single file
for r in data:
    f.write(r)

f.write(r)
f.close() # close file. Flushes Python writer buffers.

1.4. Pickle support

Chython containers fully support pickle dumping and loading.

Pickle dumps are more fast than common files and could be used as temporal storage.

[8]:
from pickle import loads, dumps
[9]:
loads(dumps(r)) # load reaction from Pickle dump
[9]:
../_images/tutorial_notebook_15_0.svg

1.5. Chython binary format (chython pack)

Chython introduce new effective format for molecules and reactions, which combine benefits from MDL and SMILES formats. Molecules store 2d-coordinates; tetrahedron, allene and cis-trans stereo; explicit bonds, implicit hydrogen count, atom numbers, radical mark, charge, isotope.

Size only 1.5-2 times larger than SMILES. Parsing speed is faster than pickle. Full specification described in source code.

[10]:
from chython import MoleculeContainer, ReactionContainer

b = r.pack()
r = ReactionContainer.unpack(b)

# same for molecules
# MoleculeContainer.unpack(m.pack())

1.6. Metadata access

RDF, SDF, etc - files have metadata which stored in molecules and reactions objects

[11]:
r = next(RDFRead('example.rdf'))
r.meta # dictionary for molecule/reaction properties storage. For example, DTYPE/DATUM fields of RDF file are read into this dictionary
[11]:
{'CdId': '1872',
 'solvent': '3',
 'temperature': '129.5',
 'tabulated_constant': '-6.87'}
[12]:
r.name  # string with reaction title from RDF
[12]:
'reaction title'
[13]:
r.reactants[0].name # string with reactant molecule title from MOL
[13]:
'molecule title'

1.7. Depiction into SVG

[14]:
r.depict()[:100] # show only part of string
[14]:
'<svg width="18.32cm" height="3.39cm" viewBox="-0.62 -1.70 18.32 3.39" xmlns="http://www.w3.org/2000/'
[15]:
r # Notebooks supported!
[15]:
../_images/tutorial_notebook_24_0.svg
[16]:
from chython import depict_settings

depict_settings(aam=False)  # configure depiction
[17]:
r.flush_cache()  # drop cached depiction
r
[17]:
../_images/tutorial_notebook_26_0.svg
[18]:
depict_settings()  # restore defaults

1.8. String parsers

[19]:
from chython import smiles, smarts, mdl_mol, xyz, inchi
[20]:
m = smiles('CCO')
m.clean2d()
m
[20]:
../_images/tutorial_notebook_30_0.svg
[21]:
m = inchi('InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3')
m.clean2d()
m
[21]:
../_images/tutorial_notebook_31_0.svg
[22]:
m = mdl_mol('''
  Mrv2115 04202210182D

  3  2  0  0  0  0            999 V2000
    1.2375   -0.7145    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9520   -1.1270    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.6664   -0.7145    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
M  END''')
m
[22]:
../_images/tutorial_notebook_32_0.svg
[23]:
m = xyz((('O', 0., 0., 0.), ('H', 1., 0., 0.), ('H', 0., 1., 0.)))
m.clean2d()
m
[23]:
../_images/tutorial_notebook_33_0.svg

1.8.1. SMARTS

Only limited features list supported.

  • stereo ignored.

  • only D, a, h, r and !R atom primitives supported.

  • bond order list and not bond supported.

  • [not]ring bond supported only in combination with explicit bonds, not bonds and bonds orders lists.

  • mapping, charge and isotopes supported.

  • list of elements supported.

  • A - treats as any element. A-primitive (aliphatic) ignored.

  • M - treats as any metal..

  • &-logic operator unsupported.

  • ;-logic operator is mandatory except for charge, isotope, stereo marks. however preferable.

  • CXSMARTS radicals supported.

  • hybridization and heteroatoms count in CXSMARTS atomProp notation as and keys supported.

For example:

``[C;r5,r6;a]-;!@[C;h0,h1] |^1:1,atomProp:1.hyb.32:1.het.0|`` - aromatic C member of 5 or 6 atoms ring connected with non-ring single bond to SP3 or SP2 radical C with 0 or 1 hydrogen and no heteroatom neighbors.
[24]:
q = smarts('[C;r5,r6;a]-;!@[C;h0,h1] |^1:1,atomProp:1.hyb.32:1.het.0|')
print(q)  # canonic atoms order!
[C;r5,r6;a]-;!@[C;h0,h1] |^1:1,atomProp:1.hyb.32:1.het.0|

2. Signatures and duplicates selection

2.1. Molecule Signatures

MoleculeContainer has methods for unique molecule signature generation. Signature is SMILES string with canonical atoms ordering. Order of atoms calculated by Morgan-like algorithm.

For signature generation one need to call str function on MoleculeContainer object. Fixed length hash of signature could be retrieved by calling bytes function on molecule (correspond to SHA 512 bitstring).

Next string formatting keys supported:

a - Generate asymmetric closures. !s - Disable stereo marks. A - Use aromatic bonds instead aromatic atoms. m - Set atom mapping. r - Generate random-ordered smiles. h - Show implicit hydrogens. !b - Disable bonds tokens.

[25]:
from chython import smiles  # smiles string parser

m = smiles('c1ccccc1C=2C=CC=CC=2[C@H](O)C')
str(m) # signature
bytes(m) # cryptographic signature hash
hash(m) # runtime-dependent signature hash. See Python str hash behavior

print(m)
print(f'f string {m}')  # use signature in string formatting
print('C-style string %s' % m)
print('format method {}'.format(m))
print(f'{m:A}')
print(f'{m:h!b}') # combination supported
c1ccccc1C=2C=CC=CC=2[C@H](O)C
f string c1ccccc1C=2C=CC=CC=2[C@H](O)C
C-style string c1ccccc1C=2C=CC=CC=2[C@H](O)C
format method c1ccccc1C=2C=CC=CC=2[C@H](O)C
C:1:C:C:C:C:C:1C=2C=CC=CC=2[C@H](O)C
[c]1[c][c][c][c][c]1[C]2[CH][CH][CH][CH][C]2[C@H]([OH])[CH3]

Molecules comparable and hashable

Comparison of MoleculeContainer is based on its signatures. Moreover, since strings in Python are hashable, MoleculeContaier also hashable.

NOTE: MoleculeContainer can be changed. This can lead to unobvious behavior of the sets and dictionaries in which these molecules were placed before the change. Avoid changing molecules (standardize, aromatize, hydrogens and atoms/bonds changes) placed inside sets and dictionaries.

[26]:
m != smiles('c1ccccc1')
[26]:
True
[27]:
# Simplest way to exclude duplicated structures
len({m, m, smiles('c1ccccc1')}) == 2 # create set of unique molecules. Only 2 of them were different.
[27]:
True

2.2. Reaction signatures

ReactionContainer have its signature. Signature is SMIRKS string in which molecules of reactants, reagents, products presented in canonical order.

API is the same as for molecules

Next extra formatting keys supported:

!c - Keep nested containers order !C - skip cxsmiles fragments contract

[28]:
print(r)
C(C(=O)O)(=O)O.C(C)O.C(C)O>>CCOC(=O)C(OCC)=O
[29]:
format(r, '!c')
[29]:
'C(C)O.C(C(=O)O)(=O)O.C(C)O>>CCOC(=O)C(OCC)=O'

3. Structure standardization

3.1. Molecules

MoleculeContainer has standardize, kekule, thiele, neutralize, implicify_hydrogens, explicify_hidrogens and canonicalize methods.

Method thiele transforms Kekule representation of rings into aromatized. Method standardize applies functional group standardization rules to molecules (more than 80 rules).

Method canonicalize apply set of methods: neutralize, standardize, kekule, implicify_hydrogens, thiele

[30]:
m = smiles('c1ccccc1N(=O)=O')
m.clean2d()  # calculate 2d layout
m #  depict
[30]:
../_images/tutorial_notebook_45_0.svg
[31]:
m.kekule()  # transform to kekule form
m
[31]:
../_images/tutorial_notebook_46_0.svg
[32]:
m.standardize() # fix groups
m
[32]:
../_images/tutorial_notebook_47_0.svg
[33]:
m.thiele() # transform to aromatized form
m
[33]:
../_images/tutorial_notebook_48_0.svg
[34]:
m = smiles('[NH3+]CC(=O)[O-]')
m.clean2d()
m.neutralize() # fix zwitter-ions
m
[34]:
../_images/tutorial_notebook_49_0.svg

Molecules has explicify_hydrogens and implicify_hydrogens methods to handle hydrogens.

This methods is used to add or remove hydrogens in molecule.

Note implicify_hydrogens working for aromatic rings only in kekule form. explicify_hydrogens for aromatized forms required kekule and optionally thiele procedures applied before.

[35]:
print(m.explicify_hydrogens()) # return number of added hydrogens
m.clean2d()
m
5
[35]:
../_images/tutorial_notebook_51_1.svg
[36]:
m.implicify_hydrogens()
m
[36]:
../_images/tutorial_notebook_52_0.svg

To use GPU for AAM calculations, specify device:

import chython
chython.torch_device = 'cuda'

Note: reset_mapping loads torch neural network once. So, it is impossible to change device on the fly. Do it before first call of reset_mapping! To parallelize AAM with multiprocessing, call reset_mapping only in workers, to avoid bottleneck with single GPU model.

[37]:
m = smiles('C=N=Cc1ccccc1')
print('errors:', m.check_valence()) # atoms with valence problems. aromatic rings should be kekulized (canonicaqlized) to check problems
m.canonicalize()
print('errors:', m.check_valence())
m.clean2d()
m
errors: [2, 4, 5, 6, 7, 8, 9]
errors: [2]
[37]:
../_images/tutorial_notebook_54_1.svg

3.2. Reactions

ReactionContainer has same methods as molecules. In this case they are applied to all molecules in reaction.

explicify_hydrogen method try to keep atom-to-atom mapping.

Reaction specific methods:

  • remove_reagents - move reactants to reagents. based on atom-to-atom mapping.

  • contract_ions - merge ions in single multicomponent molecule.

  • reset_mapping - perfom atom-to-atom mapping. Required chytorch-rxnmap package.

  • fix_mapping - rule based atom-to-atom mapping fix for known mistakes.

[38]:
r = smiles('[Na+:1].[OH-:2].[CH3:7][O:5][C:4]([CH3:3])=[O:6]>>[CH3:3][C:4]([OH:8])=[O:6]') # mapping required
r.clean2d()
r
[38]:
../_images/tutorial_notebook_56_0.svg
[39]:
r.contract_ions()
r
[39]:
../_images/tutorial_notebook_57_0.svg
[40]:
r.remove_reagents(keep_reagents=True)
r
[40]:
../_images/tutorial_notebook_58_0.svg
[41]:
r.explicify_hydrogens()
r.clean2d()
r
[41]:
../_images/tutorial_notebook_59_0.svg
[42]:
r = smiles('OC(=O)C(=C)C=C.C=CC#N>>OC(=O)C1=CCCC(C1)C#N')
r.clean2d()
r.reset_mapping()
r
[42]:
../_images/tutorial_notebook_60_0.svg

4. Isomorphism

4.1. Molecules Isomorphism

Chython has simple substructure/structure isomorphism API.

Note, that atoms are matched in subgraph isomorphism only if they have same charge/multiplicity and isotope options.

[43]:
benzene = smiles('c1ccccc1')
toluene = smiles('c1ccccc1C')
# isomorphism operations
print(benzene < toluene)  # benzene is substructure of toluene
print(benzene > toluene)  # benzene is not superstructure of toluene
print(benzene <= toluene) # benzene is substructure/or same structure of toluene
print(benzene >= toluene) # benzene is not superstructure/or same structure of toluene
print(benzene < benzene) # benzene is not substructure of benzene. it's equal
print(benzene <= benzene)
True
False
True
False
False
True

Mappings of substructure or structure to structure can be returned using substructure.get_mapping(structure) method. Method acts as generator.

[44]:
next(benzene.get_mapping(toluene))
[44]:
{1: 6, 6: 1, 5: 2, 4: 3, 3: 4, 2: 5}
[45]:
for m in benzene.get_mapping(toluene, automorphism_filter=False):  # iterate over all possible substructure mappings
    print(m)
{1: 6, 6: 1, 5: 2, 4: 3, 3: 4, 2: 5}
{1: 6, 6: 5, 5: 4, 4: 3, 3: 2, 2: 1}
{1: 5, 6: 6, 5: 1, 4: 2, 3: 3, 2: 4}
{1: 5, 6: 4, 5: 3, 4: 2, 3: 1, 2: 6}
{1: 4, 6: 5, 5: 6, 4: 1, 3: 2, 2: 3}
{1: 4, 6: 3, 5: 2, 4: 1, 3: 6, 2: 5}
{1: 3, 6: 4, 5: 5, 4: 6, 3: 1, 2: 2}
{1: 3, 6: 2, 5: 1, 4: 6, 3: 5, 2: 4}
{1: 2, 6: 3, 5: 4, 4: 5, 3: 6, 2: 1}
{1: 2, 6: 1, 5: 6, 4: 5, 3: 4, 2: 3}
{1: 1, 6: 6, 5: 5, 4: 4, 3: 3, 2: 2}
{1: 1, 6: 2, 5: 3, 4: 4, 3: 5, 2: 6}

4.2. Queries

Queries (QueryContainer) is special objects which additionally takes into account neighbors, hybridization, hydrogen count, ring size and heteroatom neighbors count state of atoms and bond in ring state.

Queries can be generated from molecules by substructure method with as_query argument.

Few special arguments for controlling atom state available.

[46]:
m = smiles('NCC(=O)O')
carboxy = m.substructure([3, 4, 5], as_query=True, skip_neighbors_marks=False, skip_hybridizations_marks=False, skip_hydrogens_marks=False, skip_rings_sizes_marks=False)
print(carboxy)
[O;h1;D1]-[C;h0;D3]=[O;h0;D1] |atomProp:0.hyb.3:1.hyb.2:2.hyb.2|
[47]:
carboxy < m
[47]:
True
[48]:
carboxy < smiles('NCC(=O)OC') # not acid!
[48]:
False

4.2.1. Query building API

It is possible to build query and molecule objects in programming way

[49]:
from chython import QueryContainer, MoleculeContainer
from chython.containers.bonds import QueryBond
from chython.periodictable import ListElement
[50]:
q = QueryContainer() # create empty query
q.add_atom('C', neighbors=3, hybridization=2, heteroatoms=1, rings_sizes=0, hydrogens=0)
q.add_atom(ListElement(['O', 'S']), n=3) # oxygen or sulphur, with atom number 3
q.add_bond(1, 3, 2) # atoms enumerated from 1. connect first and 3rd atom by bouble bond.
print(q) # match only acyclic [tia] ketones
[C;h0;D3;!R]=[O,S] |atomProp:0.hyb.2:0.het.1|
[51]:
q < smiles('CC(=O)O')
[51]:
False
[52]:
q < smiles('CC(=S)C')
[52]:
True
[53]:
q < smiles('CC=O')
[53]:
False
[54]:
q < smiles('C1CC(=O)CC1')
[54]:
False
[55]:
q = QueryContainer()
q.add_atom('C', rings_sizes=6, hybridization=4)
q.add_atom('C', rings_sizes=6, hybridization=4)
q.add_bond(1, 2, QueryBond(1, False))  # QueryBond(order, in_ring)
print(q)  # match ring-ring linker
[C;r6;a]-;!@[C;r6;a]
[56]:
q < smiles('C1Cc2ccccc2-c2ccccc12')
[56]:
False
[57]:
q < smiles('c1ccc(cc1)-c1ccccc1')
[57]:
True
[58]:
q < smiles('C1CC(=O)CC1')
[58]:
False
[59]:
q = QueryContainer()
q.add_atom('C', rings_sizes=6, hybridization=4)
q.add_atom('C', rings_sizes=6, hybridization=4)
q.add_bond(1, 2, QueryBond(1, False))  # QueryBond(order, in_ring)
print(q)  # match ring-ring linker
[C;r6;a]-;!@[C;r6;a]
[60]:
q < smiles('C1Cc2ccccc2-c2ccccc12')
[60]:
False
[61]:
q < smiles('c1ccc(cc1)-c1ccccc1')
[61]:
True

Molecules construction API the same, except extra query attributes

[62]:
m = MoleculeContainer()
m.add_atom('C')
m.add_atom('C')
m.add_atom('O')
m.add_bond(1, 2, 1)
m.add_bond(2, 3, 2)
m.clean2d()
m
[62]:
../_images/tutorial_notebook_85_0.svg

5. Reactor

Reactor works similar to ChemAxon Reactions enumeration.

Example here presents application of it to create esters from acids and alcoholes.

First we need to construct carboxy group and alcohole matcher queries. Then, ether group need to be specified.

Atom numbers in query and patch should be mapped to each other. The same atoms should have same numbers.

[63]:
acid = QueryContainer()
acid.add_atom('C')
acid.add_atom('O', neighbors=1)
acid.add_atom('O')
acid.add_bond(1, 2, 1)
acid.add_bond(1, 3, 2)
print(acid)
[O]=[C]-[O;D1]
[64]:
alco = QueryContainer()
alco.add_atom('C', n=4, heteroatoms=1) # set atom number manually
alco.add_atom('O', 5, neighbors=1)
alco.add_bond(4, 5, 1)
print(alco)
[O;D1]-[C] |atomProp:1.het.1|
[65]:
ether = QueryContainer()
ether.add_atom('C')
ether.add_atom('O', 3)
ether.add_atom('C')
ether.add_atom('O')
ether.add_bond(1, 3, 2)
ether.add_bond(1, 5, 1)
ether.add_bond(4, 5, 1)
print(ether)
[C](=[O])-[O]-[C]
[66]:
from chython import Reactor
from chython.utils import grid_depict
from ipywidgets import HTML
[67]:
rxn = Reactor([acid, alco], [ether], delete_atoms=True, one_shot=False)
# delete atoms not presented in product query
# do multiple reactions if possible
[68]:
alcohols = [smiles('CO'), smiles('CCO'), smiles('CC(C)O')]
acids = [smiles('C(=O)O'), smiles('CC(=O)O'), smiles('OC(=O)C(=O)O')]
for x in alcohols + acids:
    x.clean2d()
[69]:
HTML(grid_depict(alcohols + acids))
[70]:
from itertools import product

products = []
for x in product(acids, alcohols):
    for p in rxn(x): # apply transformation on given list of reactants
        p.clean2d()
        products.append(p)
len(products)
[70]:
12
[71]:
products[0]
[71]:
../_images/tutorial_notebook_95_0.svg
[72]:
products[-3]
[72]:
../_images/tutorial_notebook_96_0.svg
[73]:
products[-4]
[73]:
../_images/tutorial_notebook_97_0.svg

6. Molecules and Reactions API

There are explanation of some methods

[74]:
anion, cation = smiles('[Cl-].[Na+]').split() # disconnected components can be split
print(anion, cation)
salt = anion | cation # molecules can be merged
salt = anion.union(cation, remap=True) # fix mapping overlap
salt.clean2d()
salt
[Cl-] [Na+]
[74]:
../_images/tutorial_notebook_99_1.svg
[75]:
m = toluene.substructure([1, 2, 3, 4, 5, 6]) # extraction of substructure
# set recalculate_hydrogens=False to save hydrogen count info. useful for full component extraction.
m.clean2d()
print(m.atom(1).atomic_symbol, m.atom(1).implicit_hydrogens) # aromatic structures require kekule>thiele procedure to fix hydrogens count
m.kekule() and m.thiele()
print(m.atom(1).atomic_symbol, m.atom(1).implicit_hydrogens)
m
C None
C 1
[75]:
../_images/tutorial_notebook_100_1.svg
[76]:
remapped = m.remap({1: 7}, copy=True) # change atom numbers
remapped
[76]:
../_images/tutorial_notebook_101_0.svg