Validate, standardize & annotate

We’ll walk you through the following flow:

  1. define validation criteria

  2. validate & standardize metadata

  3. save validated & annotated artifacts

How do we validate metadata?

Registries in your database define the “truth” for metadata.

For instance, if “Experiment 1” has been registered as the name of a ULabel record, it is a validated value for field ULabel.name.

!lamin init --storage ./test-annotate --schema bionty
Hide code cell output
💡 connected lamindb: testuser1/test-annotate
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

ln.settings.verbosity = "hint"
💡 connected lamindb: testuser1/test-annotate

Let’s start with a DataFrame that we’d like to validate:

df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df
temperature cell_type assay_ontology_id donor
0 37.2 cerebral pyramidal neuron EFO:0008913 D0001
1 36.3 astrocyte EFO:0008913 D0002
2 38.2 oligodendrocyte EFO:0008913 DOOO3

Validate and standardize metadata

# define validation criteria for the categoricals
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}
# create an object to guide validation and annotation
annotate = ln.Annotate.from_df(df, categoricals=categoricals)
# validate
validated = annotate.validate()
validated
✅ added 3 records with Feature.name for columns: 'cell_type', 'assay_ontology_id', 'donor'
1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
💡 mapping cell_type on CellType.name
❗    found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
      → save terms via .add_validated_from('cell_type')
1 terms is not validated: 'cerebral pyramidal neuron'
      → save terms via .add_new_from('cell_type')
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
💡 mapping donor on ULabel.name
3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
      → save terms via .add_new_from('donor')
False

Validate using registries in another instance

Sometimes you want to validate against other existing registries, for instance cellxgene.

This allows us to directly transfer values that are currently missing in our registries from the cellxgene instance.

annotate = ln.Annotate.from_df(
    df, 
    categoricals=categoricals,
    using="laminlabs/cellxgene",  # pass the instance slug
)
annotate.validate()
1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
💡 mapping cell_type on CellType.name
❗    found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
      → save terms via .add_validated_from('cell_type')
1 terms is not validated: 'cerebral pyramidal neuron'
      → save terms via .add_new_from('cell_type')
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
💡 mapping donor on ULabel.name
3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
      → save terms via .add_new_from('donor')
False

Register new metadata labels

Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You’ll mostly use your lamindb instance to validate any incoming new data and annotate it.

annotate.add_validated_from(df.cell_type.name)
1 non-validated categories are not saved in CellType.name: ['cerebral pyramidal neuron']!
      → to lookup categories, use lookup().cell_type
      → to save, run .add_new_from('cell_type')
✅ added 2 records from laminlabs/cellxgene with CellType.name for cell_type: 'astrocyte', 'oligodendrocyte'
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()
lookup
Lookup objects from the laminlabs/cellxgene:
 .cell_type
 .assay_ontology_id
 .donor
 .columns
 

Example:
    → categories = validator.lookup().cell_type
    → categories.alveolar_type_1_fibroblast_cell
cell_types = lookup[df.cell_type.name]
cell_types.cerebral_cortex_pyramidal_neuron
CellType(updated_at=2023-11-28 22:37:06 UTC, uid='2sgq6sE7', name='cerebral cortex pyramidal neuron', ontology_id='CL:4023111', description='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', created_by_id=1, public_source_id=48)
# fix the typo
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.add_validated_from(df.cell_type.name)
✅ added 1 record from laminlabs/cellxgene with CellType.name for cell_type: 'cerebral cortex pyramidal neuron'
# register non-validated terms
annotate.add_new_from(df.donor.name)
✅ added 3 records with ULabel.name for donor: 'D0001', 'D0002', 'DOOO3'
# validate again
validated = annotate.validate()
validated
✅ cell_type is validated against CellType.name
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True

Validate an AnnData object

Here we specify which var_fields and obs_fields to validate against.

df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata
AnnData object with n_obs × n_vars = 3 × 5
    obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
annotate = ln.Annotate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)
1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 6 records from public with Gene.symbol for var_index: 'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
annotate.validate()
✅ var_index is validated against Gene.symbol
✅ cell_type is validated against CellType.name
💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id
❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True
annotate.add_validated_from("all")
💡 saving labels for 'cell_type'
💡 saving labels for 'assay_ontology_id'
✅ added 1 record from public with ExperimentalFactor.ontology_id for assay_ontology_id: 'EFO:0008913'
💡 saving labels for 'donor'
annotate.validate()
✅ var_index is validated against Gene.symbol
✅ cell_type is validated against CellType.name
✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id
✅ donor is validated against ULabel.name
True

Save an artifact

The validated object can be subsequently saved as an Artifact:

artifact = annotate.save_artifact(description="test AnnData")
❗ no run & transform get linked, consider calling ln.track()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/U7AmpEKkbVlUjG9QVN7s.h5ad')
✅ storing artifact 'U7AmpEKkbVlUjG9QVN7s' at '/home/runner/work/lamindb/lamindb/docs/test-annotate/.lamindb/U7AmpEKkbVlUjG9QVN7s.h5ad'
💡 you can auto-track these data as a run input by calling `ln.track()`
💡 parsing feature names of X stored in slot 'var'
5 terms (100.00%) are validated for symbol
✅    linked: FeatureSet(uid='7y4ERtwxI0vIXqnEgJ9j', n=6, dtype='int', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)
💡 parsing feature names of slot 'obs'
3 terms (75.00%) are validated for name
1 term (25.00%) is not validated for name: temperature
✅    linked: FeatureSet(uid='QhoHpOgPk8C2JzDBzPwx', n=3, registry='Feature', hash='n3vV5hsGwsLgWijVeNwe', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
✅ linked feature 'cell_type' to registry 'bionty.CellType'
✅ linked feature 'assay_ontology_id' to registry 'bionty.ExperimentalFactor'
✅ linked feature 'donor' to registry 'ULabel'
artifact.describe()
Artifact(updated_at=2024-05-23 10:58:22 UTC, uid='U7AmpEKkbVlUjG9QVN7s', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=20336, hash='wozXf_B6VsK6QXH81skJ8A', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True)

Provenance:
  📎 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
  📎 storage: uid='HVlitn2uYQXR', root='/home/runner/work/lamindb/lamindb/docs/test-annotate', type='local', instance_uid='3kW5y8h7c8wG')
Features:
  var: FeatureSet(uid='7y4ERtwxI0vIXqnEgJ9j', n=6, dtype='int', registry='bionty.Gene')
    'TCF7', 'PDCD1', 'CD3E', 'CD4', 'CD8A'
  obs: FeatureSet(uid='QhoHpOgPk8C2JzDBzPwx', n=3, registry='Feature')
    🔗 cell_type (3, cat[bionty.CellType]): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
    🔗 assay_ontology_id (3, cat[bionty.ExperimentalFactor]): 'single-cell RNA sequencing'
    🔗 donor (3, cat[ULabel]): 'D0001', 'D0002', 'DOOO3'
Labels:
  📎 cell_types (3, bionty.CellType): 'astrocyte', 'oligodendrocyte', 'cerebral cortex pyramidal neuron'
  📎 experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
  📎 ulabels (3, ULabel): 'D0001', 'D0002', 'DOOO3'

Save a collection

Register a new collection for the registered artifact:

# register a new collection
collection = annotate.save_collection(
    artifact,  # registered artifact above, can also pass a list of artifacts
    name="Experiment X in brain",  # title of the publication
    description="10.1126/science.xxxxx",  # DOI of the publication
    reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress" # source type (e.g. GEO, ArrayExpress, SRA, etc.)
)
❗ no run & transform get linked, consider calling ln.track()
✅ loaded: FeatureSet(uid='7y4ERtwxI0vIXqnEgJ9j', n=6, dtype='int', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)
✅ loaded: FeatureSet(uid='QhoHpOgPk8C2JzDBzPwx', n=3, registry='Feature', hash='n3vV5hsGwsLgWijVeNwe', created_by_id=1)
💡 you can auto-track these data as a run input by calling `ln.track()`
collection.artifacts.df()
version created_at created_by_id updated_at uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual
id
1 None 2024-05-23 10:58:21.985252+00:00 1 2024-05-23 10:58:22.030237+00:00 U7AmpEKkbVlUjG9QVN7s 1 None .h5ad AnnData test AnnData 20336 wozXf_B6VsK6QXH81skJ8A md5 None 3 None None 1 True