Standardization¶

Canonicalization, functional group normalization, aromaticity, tautomers, and deduplication.

Canonicalize¶

canonicalize() applies the full normalization pipeline: neutralize, standardize functional groups, Kekule normalization, implicit hydrogen cleanup, aromatization, and charge standardization.

from chython import smiles

mol = smiles('C(=O)(O)c1ccccc1')

mol.canonicalize()
print(str(mol))  # canonical SMILES

# Options
mol.canonicalize(
    fix_tautomers=True,    # canonical tautomer form (default)
    keep_kekule=False,     # return Kekule instead of aromatic
    logging=False,         # return list of changes made
    ignore=True,           # skip standardization bugs
)

# With logging: returns list of (atoms, rule_id, description) tuples
log = mol.canonicalize(logging=True)
for atoms, rule_id, description in log:
    print(f'{description} at atoms {atoms}')

Functional Group Standardization¶

standardize() normalizes functional groups (nitro, sulfoxide, etc.) without changing aromaticity or tautomers. Over 80 rules applied:

mol = smiles('c1ccccc1N(=O)=O')  # nitro
mol.standardize()                 # normalizes to [N+]([O-])=O form

# With logging
log = mol.standardize(logging=True)

# Charge normalization (zwitterions)
mol.standardize_charges()

Neutralize¶

mol = smiles('[NH3+]CC(=O)[O-]')  # zwitterion
mol.neutralize()                   # removes zwitterionic charges

Aromaticity¶

mol = smiles('c1ccccc1')

# Convert aromatic (Thiele) to Kekule form
mol.kekule()
print(str(mol))  # C1=CC=CC=C1

# Convert back to aromatic form
mol.thiele()
print(str(mol))  # c1ccccc1

# Enumerate all Kekule structures
for kekule_form in mol.enumerate_kekule():
    print(str(kekule_form))

Implicit / Explicit Hydrogens¶

mol = smiles('CCO')

# Add explicit hydrogens
added = mol.explicify_hydrogens()  # returns count of added H
mol.clean2d()  # recalculate layout after adding atoms

# Remove explicit hydrogens (make implicit)
mol.implicify_hydrogens()

# Fix implicit hydrogen counts
mol.fix_structure()

implicify_hydrogens works for aromatic rings only in Kekule form. explicify_hydrogens for aromatized forms requires kekule() then optionally thiele() afterward.

Tautomers¶

mol = smiles('Oc1ccncc1')  # 4-pyridinol

# Enumerate tautomers
for tautomer in mol.enumerate_tautomers(limit=100):
    print(str(tautomer))

# Include charge-shifted forms
for tautomer in mol.enumerate_charged_tautomers(limit=100):
    print(str(tautomer))

Valence Checking¶

mol = smiles('C=N=Cc1ccccc1')

# Check for valence problems (returns list of atom numbers with issues)
errors = mol.check_valence()
print('errors:', errors)

# Aromatic rings must be kekulized first for accurate checking
mol.canonicalize()
errors = mol.check_valence()

Deduplication¶

Using Sets¶

Molecules are hashable (based on canonical SMILES), so sets remove duplicates:

mols = [smiles('CCO'), smiles('OCC'), smiles('C(O)C')]

for m in mols:
    m.canonicalize()

unique = set(mols)  # 1 molecule (all three are ethanol)

Using Canonical SMILES¶

seen = set()
unique = []
for mol in mols:
    mol.canonicalize()
    s = str(mol)
    if s not in seen:
        seen.add(s)
        unique.append(mol)

Graph Equality¶

is_equal() compares molecular graphs including all atom/bond properties:

mol1 = smiles('CCO')
mol2 = smiles('OCC')

mol1.is_equal(mol2)  # True