thor.fineST
- class thor.fineST(image_path, name, spot_adata_path=None, st_dir=None, cell_features_list=None, cell_features_csv_path=None, genes_path=None, save_dir=None, recipe='gene', **kwargs)
Bases:
objectClass for in silico cell gene expression inference
- Parameters:
image_path (str) – Path to the HE staining image or an image of other types which is aligned to the spatial transcriptome spots (full resolution).
name (str) – Name of the sample.
spot_adata_path (str, optional) – Path to the processed spot adata (e.g., from the Visium sequencing data). The counts/expression array (.X) and spots coordinates are required (.obsm[“spatial”]). Expecting that adata.X is lognormalized. Either spot_adata_path or st_dir are needed. If spot_adata_path is provided, st_dir will be neglected.
st_dir (str, optional) – Directory to the SpaceRanger output directory, where the count matrix and spatial directory can be found.
cell_features_csv_path (str, optional) – Path to the CSV file that stored the cell features. First two columns are expected (exactly) to be the nuclei positions “x” and “y”.
cell_features_list (list or None, optional) – List of features to be used for generating the cell-cell graph. First two are expected (exactly) to be the nuclei positions “x” and “y”. Features will be read from cell_features_csv_path csv file and the list will be used for selection. By default, if no external features are provided, those features [“x”, “y”, “mean_gray”, “std_gray”, “entropy_img”, “mean_r”, “mean_g”, “mean_b”, “std_r”, “std_g”, “std_b”] are used.
genes_path (str, optional) – Path to the file that contains a headless one column of the genes (same format as used in the adata.var_names) to be included for sure.
save_dir (str or None, optional) – Path to the directory of saving fineST prediction results.
recipe (str, optional) –
- Specifies the mode for predicting the gene expression. Valid choices are: “gene”, “reduced”, “mix”.
”gene”: use the user-set genes for prediction with gene mode. This includes the genes from the the used_for_prediction key in adata.var.
”reduced”: use the reduced genes from the VAE model with reduced mode. Ignoring the genes from the used_for_prediction key in adata.var.
”mix”: use the reduced genes from the VAE model using reduced mode and the rest of the user-set genes for prediction with gene mode.
**kwargs (dict, optional) – Keyword arguments for any additional attributes to be set for the class. This allows future loading of the saved json file to create a new instance of the class.
Methods
copyObtained the genes whose expression will be estimated through diffusion of latent variables collectively.
Load the saved json file to create a new instance of the class.
Load the user-input genes to be used for prediction.
Load the parameters from the json file.
Load the predicted gene expression into adata.
Load the trained VAE model.
Predict the gene expression for the cells.
Prepare the input for the fineST estimation.
Prepare the running modes for the fineST estimation of gene expression.
Whether the required attributes are set before running the prediction.
Save the attributes of the class to json file.
Set the path to the cell features csv file.
Set the cell features to be used for the cell-cell graph construction.
Set the genes to be used for prediction.
Set the parameters for the fineST estimation.
Train a VAE model for the spot-level transcriptome data.
Visualize the cell graph.
Write the gene expression into adata.
Write the parameters to json files.
- get_reduced_genes(keep=0.9, min_mean_expression=0.5)
Obtained the genes whose expression will be estimated through diffusion of latent variables collectively. All the genes trained using VAE usually are not reconstructed faithfully. Therefore, we will use the genes with high reconstruction quality (measured by cosine similarity with the input gene expression).
- Parameters:
keep (float, optional) – The proportion of the genes to be kept for the reduced mode in the VAE model. The genes are ranked according to the VAE reconstruction quality. Default is 0.9.
min_mean_expression (float, optional) – Threshold of the mean expression for the genes to be used in reduced mode. Default is 0.5.
- classmethod load(json_path)
Load the saved json file to create a new instance of the class.
- Parameters:
json_path (str) – Path to the saved json file.
- load_genes(genes_file_path)
Load the user-input genes to be used for prediction.
- Parameters:
genes_file_path (
str) – Path to the csv file that contains the genes to be used for prediction. The genes should be in the first column of the csv file. Genes should match the var_names used in adata.- Returns:
Update the genes attribute of the class.
- Return type:
None
- load_params(json_path)
Load the parameters from the json file. The parameters will be matched to update the self.run_params and self.graph_parmas, which are used for the prediction.
- Parameters:
json_path (str) – Path to the json file that contains the parameters.
- load_result(file_name, layer_name=None)
Load the predicted gene expression into adata.
- Parameters:
file_name (str) – File name of the predicted gene expression. This file should be in the save_dir. Taking the relative path to the save_dir.
layer_name (str, optional) – Layer name of the predicted gene expression. If None, the .X will be used.
- Returns:
adata – Anndata object with the predicted gene expression for the cells.
- Return type:
anndata.AnnData
- load_vae_model(model_path=None)
Load the trained VAE model.
- Parameters:
model_path (tuple) – Folder where the encoder and decoder models (.h5 files) locate. The filenames should be {self.name}_VAE_encoder.h5 and {self.name}_VAE_decoder.h5.
- Returns:
Update the generate attribute of the class.
- Return type:
None
- predict_gene_expression(**kwargs)
Predict the gene expression for the cells. Internally calls the
estimate_expression_markov_graph_diffusion()function for the finest estimation. The keyword parameters specified here will overwrite existing settings.- Parameters:
kwargs (dict) – Parameters for the self.run_params and self.graph_params.
- Returns:
adata – Anndata object with the predicted gene expression for the cells.
- Return type:
anndata.AnnData
- prepare_input(mapping_margin=10, spot_identifier='spot_barcodes')
Prepare the input for the fineST estimation.
First, generate the cell-wise adata from the cell features and spot adata. In this step, the segmented cells will be read from the self.cell_features_csv_path and the outliers from the segmentation will be removed according to the distance between a cell and its nearest neighbor. Second, the spot gene expression is mapped to aligned nearest cells. Lastly, the spot heterogeneity will be computed using the image features for future construction of the cell-cell graph and the transition matrix.
- Parameters:
mapping_margin (numeric, optional) – Margin for mapping the spot gene expression to the cells. Default is 10, which will attempt to map cells which are within 10- spot radius of any spot (so almost all identified cells are mapped to nearest spots). Decrease this number if you would like to eliminate isolated cells.
- prepare_recipe()
Prepare the running modes for the fineST estimation of gene expression.
- Supported recipe: “gene”, “reduced”, “mix”.
“gene”: use the user-set genes for prediction with gene mode. This includes the genes from the the used_for_prediction key in adata.var.
“reduced”: use the reduced genes from the VAE model with reduced mode. Ignoring the genes from the used_for_prediction key in adata.var.
“mix”: use the reduced genes from the VAE model using reduced mode and the rest of the user-set genes for prediction with gene mode.
- sanity_check()
Whether the required attributes are set before running the prediction.
- Returns:
True if all the required attributes are set, False otherwise.
- Return type:
Boolean
- save(exclude=['generate', 'adata', 'conn_csr_matrix'])
Save the attributes of the class to json file. The saved json file can be directly loaded to create a new instance of the class.
- Parameters:
exclude (list, optional) – List of attributes to be excluded from saving. By default, [“generate”, “adata”] are excluded.
- set_cell_features_csv_path(cell_features_csv_path=None)
Set the path to the cell features csv file. If cell_features_csv_path is None, the cell features csv file will be generated from the HE staining image. Otherwise, the cell_features_csv_path will be used for the cell features.
- set_cell_features_list(cell_features_list=None)
Set the cell features to be used for the cell-cell graph construction. If cell_features_list is None, all the columns in the self.cell_features_csv_path will be used. Otherwise, the cell_features_list will be used for selecting the columns.
- set_genes_for_prediction(genes_selection_key='highly_variable')
Set the genes to be used for prediction. This will update the used_for_prediction column in adata.var.
- Parameters:
genes_selection_key (str, optional) – var key in adata for selection of the genes. By default, “highly_variable” will be used. We also recommend using spatially variable genes (with SPARK-X). If the key is not present in adata.var, you are responsible to either do so before running this function. None or “all” is also supported. None will only use the user-supplied genes and “all” will use all the genes (this is almost never recommended).
- Returns:
Update the used_for_prediction column in adata.var.
- Return type:
None
- set_params(**kwargs)
Set the parameters for the fineST estimation.
The keyword parameters specified here will overwrite existing settings. The complete list of the graph_params and run_params can be found in the
graph_paramsandrun_paramsattributes.The graph_params are the parameters for the cell-cell graph construction and the transition matrix estimation. Behaviors of some important parameters are listed below.
- Graph params:
- n_neighbors: int
Number of neighbors for the cell-cell graph construction. Default is 5. Increasing this number will increase the connectivity of the cell-cell graph.
- geom_morph_ratio: float
The ratio of the geometric distance and the morphological distance for the cell-cell graph construction. Default is 10. Increasing this number will lead to more local connections.
- adjust_cell_network_by_transcriptome_scale: int or float
The scale of the transcriptome heterogeneity to be used for adjusting the cell-cell graph in relative to the morphological distance. Default is 0. Increasing this number will increase the contribution of the transcriptome to build the cell-cell graph.
- snn_threshold: float
The threshold of proportion of shared neighbors for connection. Default is 0.1. Increasing this number will increase the criteria and lead to more sparse cell-cell graph.
User-supplied cell-cell graph can also be used. The conn_csr_matrix parameter can be used to provide the cell-cell graph in the form of a scipy csr matrix. Some other parameters are also important for the transition matrix estimation.
- Transition matrix params:
- preferential_flow: bool
Whether to use the preferential flow for the transition matrix estimation. So the information is controlled to make sure the information is flowing from the high quality cells to the low quality cells, where the quality is measured by the heterogeneity of the cells in the morphological space. Default is True. Setting this to False will lead to a symmetric transition.
- weigh_cells: bool
Whether to weigh the cells by the transcriptome heterogeneity for the transition matrix estimation. Default is True.
- smoothing_scale: float
The scale of the self-weight in the Lambda matrix. Default is 0.5. Increasing this number will preserve more original gene expression.
- inflation_percentage: float
How much to inflate the gene expression space after the transition matrix estimation. Default is None. Reasonable values range from [0, 10]. If this is None or 0, the gene expression space will not be inflated. Increasing this number will inflate the gene expression space for preserving the features space during the Markov graph diffusion. Read more about it in this paper Taubin smoothing.
The run_params are the parameters for the Markov graph diffusion. Behaviors of some important parameters are listed below,
- Markov diffusion params:
- n_iter: int
Number of iterations for the Markov graph diffusion. Default is 10. Usually the estimation converges within 20 iterations.
- initialize: bool
Whether to initialize the cell-cell graph and the transition matrix. Default is True. Setting this to False will use the supplied/precomputed transition matrix (in adata.obsp[“transition_matrix”]) for the Markov graph diffusion.
- conn_key: str
The key in adata.obsp for the cell-cell graph. Default is “snn”.
- reduced_dimension_transcriptome_obsm_key: str
The key in adata.obsm for the cell features. Default is “X_pca”. If reduced_dimension_transcriptome_obsm_key is not in adata.obsm, the cell features will be used for the cell-cell graph construction.
- layer: str
The layer of the gene expression to be used for the Markov graph diffusion. Default is None. If this is None, the .X will be used.
- is_rawCount: bool
Whether the gene expression is raw count. Default is True. If this is True, the output gene expression will be in raw counts as well. Otherwise, the output gene expression will be in log-normalized counts.
- stochastic_expression_neighbors_level: str
The level of the neighbors to be used for the stochastic expression. Default is “spot”. Valid values are “spot” and “cell”. “spot” means the cells enclosed by neighboring spots will be used for the stochastic expression. “cell” means the neighbors of the cells will be used for the stochastic expression.
- vae_training(vae_genes_set=None, min_mean_expression=0.1, **kwargs)
Train a VAE model for the spot-level transcriptome data.
- Parameters:
vae_genes_set (set, optional) – Set of genes to be used for VAE training. If None, all the genes (adata.var.used_for_prediction, which are specified in the prepare_input function) with mean expression > min_mean_expression will be used.
min_mean_expression (float, optional) – Minimum mean expression for the genes to be used for VAE training.
kwargs (dict) – Keyword arguments for the
VAE.train_vae()function.
- Return type:
None
Notes
This function internally calls the
VAE.train_vae()function for training the VAE using the preprocessed transcriptomic data.
- visualize_cell_network(**kwargs)
Visualize the cell graph. Internally calls the
plot_cell_graph()function.
- write_adata(file_name, ad)
Write the gene expression into adata.
- Parameters:
file_name (str) – File name of gene expression. This file will be saved in the save_dir. Taking the relative path to the save_dir.
- write_params(exclude=['conn_csr_matrix'])
Write the parameters to json files. This includes the run_params and graph_params.