IVaps.aps¶
APS estimation functions
Functions
|
Estimate APS for given dataset and ONNX model |
|
Estimate APS for given dataset and user defined ML function |
- IVaps.aps.estimate_aps_onnx(onnx: str, X_c=None, X_d=None, data=None, C: Optional[Sequence] = None, D: Optional[Sequence] = None, L: Optional[Dict[int, Set]] = None, S: int = 100, delta: float = 0.8, seed: Optional[int] = None, types: Tuple[numpy.dtype, numpy.dtype] = (None, None), input_type: int = 1, input_names: Tuple[str, str] = ('c_inputs', 'd_inputs'), fcn=None, vectorized: bool = False, cpu: bool = False, iobound: bool = False, parallel: bool = False, nprocesses: Optional[int] = None, ntasks: int = 1, **kwargs)[source]¶
Estimate APS for given dataset and ONNX model
Approximate propensity score estimation involves taking draws \(X_c^1, \ldots,X_c^S\) from the uniform distribution on \(N(X_{ci}, \delta)\), where \(N(X_{ci},\delta)\) is the \(p_c\) dimensional ball centered at \(X_{ci}\) with radius \(\delta\).
\(X_c^1, \ldots,X_c^S\) are destandardized before passed for ML inference. The estimation equation is \(p^s(X_i;\delta) = \frac{1}{S} \sum_{s=1}^{S} ML(X_c^s, X_{di})\).
- Parameters
onnx (str) – String path to ONNX model
X_c (array-like, default: None) – 1D/2D vector of continuous input variables
X_d (array-like, default: None) – 1D/2D vector of discrete input variables
data (array-like, default: None) – Dataset containing ML input variables
C (array-like, default: None) – Integer column indices for continous variables
D (array-like, default: None) – Integer column indices for discrete variables
L (Dict[int, Set]) – Dictionary with keys as indices of X_c and values as sets of discrete values
S (int, default: 100) – Number of draws for each APS estimation
delta (float/list, default: 0.8) – Radius of sampling ball. If list, then APS is recomputed for each delta in list.
seed (int, default: None) – Seed for sampling
types (Tuple[np.dtype, np.dtype], default: (None, None)) – Numpy dtypes for continuous and discrete data; by default types are inferred
input_type (int, default: 1) – Whether the model takes continuous/discrete inputs together (1) or separately (2)
input_names (Tuple[str,str], default: (“c_inputs”, “d_inputs”)) – Names of input nodes of ONNX model
fcn (Object, default: None) – Decision function to apply to ML output
vectorized (bool, default: False) – Indicator for whether decision function is already vectorized
cpu (bool, default False) – Run inference on CPU; defaults to GPU if available
parallel (bool, default: False) – Whether to parallelize the APS estimation
nprocesses (int, default: None) – Number of processes to parallelize. Defaults to number of processors on machine.
ntasks (int, default: 1) – Number of tasks to send to each worker process.
- Returns
Array of estimated APS for each observation in sample
- Return type
np.ndarray
Notes
X_c, X_d, and data should never have any overlapping columns. This is not checkable through the code, so please double check this when passing in the inputs.
- IVaps.aps.estimate_aps_user_defined(ml, X_c=None, X_d=None, data=None, C: Optional[Sequence] = None, D: Optional[Sequence] = None, L: Optional[Dict[int, Set]] = None, S: int = 100, delta: float = 0.8, seed: Optional[int] = None, pandas: bool = False, pandas_cols: Optional[Sequence] = None, keep_order: bool = False, reorder: Optional[Sequence] = None, parallel: bool = False, nprocesses: Optional[int] = None, ntasks: int = 1, **kwargs)[source]¶
Estimate APS for given dataset and user defined ML function
Approximate propensity score estimation involves taking draws \(X_c^1, \ldots,X_c^S\) from the uniform distribution on \(N(X_{ci}, \delta)\), where \(N(X_{ci},\delta)\) is the \(p_c\) dimensional ball centered at \(X_{ci}\) with radius \(\delta\).
\(X_c^1, \ldots,X_c^S\) are destandardized before passed for ML inference. The estimation equation is \(p^s(X_i;\delta) = \frac{1}{S} \sum_{s=1}^{S} ML(X_c^s, X_{di})\).
- Parameters
ml (Object) – User defined ml function
X_c (array-like, default: None) – 1D/2D vector of continuous input variables
X_d (array-like, default: None) – 1D/2D vector of discrete input variables
data (array-like, default: None) – Dataset containing ML input variables
C (array-like, default: None) – Integer column indices for continous variables
D (array-like, default: None) – Integer column indices for discrete variables
L (Dict[int, Set]) – Dictionary with keys as indices of X_c and values as sets of discrete values
S (int, default: 100) – Number of draws for each APS estimation
delta (float, default: 0.8) – Radius of sampling ball
seed (int, default: None) – Seed for sampling
pandas (bool, default: False) – Whether to cast inputs into pandas dataframe
pandas_cols (Sequence, default: None) – Columns names for dataframe input
keep_order (bool, default: False) – Whether to maintain the column order if data passed as a single 2D array
reorder (Sequence, default: False) – Indices to reorder the data assuming original order [X_c, X_d]
parallel (bool, default: False) – Whether to parallelize the APS estimation
nprocesses (int, default: None) – Number of processes to parallelize. Defaults to number of processors on machine.
ntasks (int, default: 1) – Number of tasks to send to each worker process.
**kwargs (keyword arguments to pass into user function)
- Returns
Array of estimated APS for each observation in sample
- Return type
np.ndarray
Notes
X_c, X_d, and data should never have any overlapping variables. This is not checkable through the code, so please double check this when passing in the inputs.
The arguments keep_order, reorder, and pandas_cols are applied sequentially, in that order. This means that if keep_order is set, then reorder will reorder the columns from the original column order as data. pandas_cols will then be the names of the new ordered dataset.
The default ordering of inputs is [X_c, X_d], where the continuous variables and discrete variables will be in the original order regardless of how their input is passed. If reorder is called without keep_order, then the reordering will be performed on this default ordering.
Parallelization uses the Pool module from pathos, which will NOT be able to deal with execution on GPU. If the user function enables inference on GPU, then it is recommended to implement parallelization within the user function as well.
The optimal settings for nprocesses and nchunks are specific to each machine, and it is highly recommended that the user pass these arguments to maximize the performance boost. This SO thread recommends setting nchunks to be 14 * # of workers for optimal performance.