catenets.datasets.dataset_twins module

Twins dataset Load real-world individualized treatment effects estimation datasets

Reference: http://data.nber.org/data/linked-birth-infant-death-data-vital-statistics-data.html

load(data_path: pathlib.Path, train_ratio: float = 0.8, treatment_type: str = 'rand', seed: int = 42, treat_prop: float = 0.5) → Tuple

Twins dataset dataloader.

Download the dataset if needed.
Load the dataset.
Preprocess the data.
Return train/test split.

Parameters

data_path (Path) – Path to the CSV. If it is missing, it will be downloaded.
train_ratio (float) – Train/test ratio
treatment_type (str) – Treatment generation strategy
seed (float) – Random seed
treat_prop (float) – Treatment proportion

Returns

train_x (array or pd.DataFrame) – Features in training data.
train_t (array or pd.DataFrame) – Treatments in training data.
train_y (array or pd.DataFrame) – Observed outcomes in training data.
train_potential_y (array or pd.DataFrame) – Potential outcomes in training data.
test_x (array or pd.DataFrame) – Features in testing data.
test_potential_y (array or pd.DataFrame) – Potential outcomes in testing data.

preprocess(fn_csv: pathlib.Path, train_ratio: float = 0.8, treatment_type: str = 'rand', seed: int = 42, treat_prop: float = 0.5) → Tuple

Helper for preprocessing the Twins dataset.

Parameters

fn_csv (Path) – Dataset CSV file path.
train_ratio (float) – The ratio of training data.
treatment_type (string) – The treatment selection strategy.
seed (float) – Random seed.

Returns

train_x (array or pd.DataFrame) – Features in training data.
train_t (array or pd.DataFrame) – Treatments in training data.
train_y (array or pd.DataFrame) – Observed outcomes in training data.
train_potential_y (array or pd.DataFrame) – Potential outcomes in training data.
test_x (array or pd.DataFrame) – Features in testing data.
test_potential_y (array or pd.DataFrame) – Potential outcomes in testing data.