IVaps.aps

APS estimation functions

Functions

estimate_aps_onnx(onnx[, X_c, X_d, data, C, …])

Estimate APS for given dataset and ONNX model

estimate_aps_user_defined(ml[, X_c, X_d, …])

Estimate APS for given dataset and user defined ML function

IVaps.aps.estimate_aps_onnx(onnx: str, X_c=None, X_d=None, data=None, C: Optional[Sequence] = None, D: Optional[Sequence] = None, L: Optional[Dict[int, Set]] = None, S: int = 100, delta: float = 0.8, seed: Optional[int] = None, types: Tuple[numpy.dtype, numpy.dtype] = (None, None), input_type: int = 1, input_names: Tuple[str, str] = ('c_inputs', 'd_inputs'), fcn=None, vectorized: bool = False, cpu: bool = False, iobound: bool = False, parallel: bool = False, nprocesses: Optional[int] = None, ntasks: int = 1, **kwargs)[source]

Estimate APS for given dataset and ONNX model

Approximate propensity score estimation involves taking draws \(X_c^1, \ldots,X_c^S\) from the uniform distribution on \(N(X_{ci}, \delta)\), where \(N(X_{ci},\delta)\) is the \(p_c\) dimensional ball centered at \(X_{ci}\) with radius \(\delta\).

\(X_c^1, \ldots,X_c^S\) are destandardized before passed for ML inference. The estimation equation is \(p^s(X_i;\delta) = \frac{1}{S} \sum_{s=1}^{S} ML(X_c^s, X_{di})\).

Parameters
  • onnx (str) – String path to ONNX model

  • X_c (array-like, default: None) – 1D/2D vector of continuous input variables

  • X_d (array-like, default: None) – 1D/2D vector of discrete input variables

  • data (array-like, default: None) – Dataset containing ML input variables

  • C (array-like, default: None) – Integer column indices for continous variables

  • D (array-like, default: None) – Integer column indices for discrete variables

  • L (Dict[int, Set]) – Dictionary with keys as indices of X_c and values as sets of discrete values

  • S (int, default: 100) – Number of draws for each APS estimation

  • delta (float/list, default: 0.8) – Radius of sampling ball. If list, then APS is recomputed for each delta in list.

  • seed (int, default: None) – Seed for sampling

  • types (Tuple[np.dtype, np.dtype], default: (None, None)) – Numpy dtypes for continuous and discrete data; by default types are inferred

  • input_type (int, default: 1) – Whether the model takes continuous/discrete inputs together (1) or separately (2)

  • input_names (Tuple[str,str], default: (“c_inputs”, “d_inputs”)) – Names of input nodes of ONNX model

  • fcn (Object, default: None) – Decision function to apply to ML output

  • vectorized (bool, default: False) – Indicator for whether decision function is already vectorized

  • cpu (bool, default False) – Run inference on CPU; defaults to GPU if available

  • parallel (bool, default: False) – Whether to parallelize the APS estimation

  • nprocesses (int, default: None) – Number of processes to parallelize. Defaults to number of processors on machine.

  • ntasks (int, default: 1) – Number of tasks to send to each worker process.

Returns

Array of estimated APS for each observation in sample

Return type

np.ndarray

Notes

X_c, X_d, and data should never have any overlapping columns. This is not checkable through the code, so please double check this when passing in the inputs.

IVaps.aps.estimate_aps_user_defined(ml, X_c=None, X_d=None, data=None, C: Optional[Sequence] = None, D: Optional[Sequence] = None, L: Optional[Dict[int, Set]] = None, S: int = 100, delta: float = 0.8, seed: Optional[int] = None, pandas: bool = False, pandas_cols: Optional[Sequence] = None, keep_order: bool = False, reorder: Optional[Sequence] = None, parallel: bool = False, nprocesses: Optional[int] = None, ntasks: int = 1, **kwargs)[source]

Estimate APS for given dataset and user defined ML function

Approximate propensity score estimation involves taking draws \(X_c^1, \ldots,X_c^S\) from the uniform distribution on \(N(X_{ci}, \delta)\), where \(N(X_{ci},\delta)\) is the \(p_c\) dimensional ball centered at \(X_{ci}\) with radius \(\delta\).

\(X_c^1, \ldots,X_c^S\) are destandardized before passed for ML inference. The estimation equation is \(p^s(X_i;\delta) = \frac{1}{S} \sum_{s=1}^{S} ML(X_c^s, X_{di})\).

Parameters
  • ml (Object) – User defined ml function

  • X_c (array-like, default: None) – 1D/2D vector of continuous input variables

  • X_d (array-like, default: None) – 1D/2D vector of discrete input variables

  • data (array-like, default: None) – Dataset containing ML input variables

  • C (array-like, default: None) – Integer column indices for continous variables

  • D (array-like, default: None) – Integer column indices for discrete variables

  • L (Dict[int, Set]) – Dictionary with keys as indices of X_c and values as sets of discrete values

  • S (int, default: 100) – Number of draws for each APS estimation

  • delta (float, default: 0.8) – Radius of sampling ball

  • seed (int, default: None) – Seed for sampling

  • pandas (bool, default: False) – Whether to cast inputs into pandas dataframe

  • pandas_cols (Sequence, default: None) – Columns names for dataframe input

  • keep_order (bool, default: False) – Whether to maintain the column order if data passed as a single 2D array

  • reorder (Sequence, default: False) – Indices to reorder the data assuming original order [X_c, X_d]

  • parallel (bool, default: False) – Whether to parallelize the APS estimation

  • nprocesses (int, default: None) – Number of processes to parallelize. Defaults to number of processors on machine.

  • ntasks (int, default: 1) – Number of tasks to send to each worker process.

  • **kwargs (keyword arguments to pass into user function)

Returns

Array of estimated APS for each observation in sample

Return type

np.ndarray

Notes

X_c, X_d, and data should never have any overlapping variables. This is not checkable through the code, so please double check this when passing in the inputs.

The arguments keep_order, reorder, and pandas_cols are applied sequentially, in that order. This means that if keep_order is set, then reorder will reorder the columns from the original column order as data. pandas_cols will then be the names of the new ordered dataset.

The default ordering of inputs is [X_c, X_d], where the continuous variables and discrete variables will be in the original order regardless of how their input is passed. If reorder is called without keep_order, then the reordering will be performed on this default ordering.

Parallelization uses the Pool module from pathos, which will NOT be able to deal with execution on GPU. If the user function enables inference on GPU, then it is recommended to implement parallelization within the user function as well.

The optimal settings for nprocesses and nchunks are specific to each machine, and it is highly recommended that the user pass these arguments to maximize the performance boost. This SO thread recommends setting nchunks to be 14 * # of workers for optimal performance.