Introduction...............................................................2
Statistical Inference......................................................12
Checking the PH Assumption.................................................24
Innovation and Improvements................................................26
Conclusion.................................................................30
References.................................................................30

1. Introduction

1.1 Abstract

Colon cancer is the third most common cancer in the United States, with diagnosis rates on a steady rise for those under the age of 50. Since the 1980s, overall rates have decreased for the elderly; however, the underlying risk for the general population has increased. Given this alarming information, scientists and medical professionals need to assess how colon cancer affects different groups and which treatments prove effective against the disease. [1]

1.2 Objective

The objective of this project is to see if the treatment has a significant effect in treating colon cancer.

Aim of the Analysis

Our project seeks to answer a few questions:

Is there a significant difference in survival times between the control group and the treatment group for the colon dataset?
Will survival rates vary significantly between different groups (male vs. female, young vs. old, extent of differentiation of tumor)?
What are the covariates from the dataset that should be included in the model? (a.k.a which variabels should be adjusted for and included into the model)

1.3 Exploring and Cleaning the Dataset

1.3.1 Data Structure

data(colon)
str(colon)

## 'data.frame':    1858 obs. of  16 variables:
##  $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
##  $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
##  $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
##  $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
##  $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
##  $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
##  $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ time    : num  1521 968 3087 3087 963 ...
##  $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

The dataset has 1858 observations and 16 variables.
Since some of the categorical variables are being recognized as numerical variables.
Using a factor() function we will change the types of the categorical variables into factors.

colon$sex <- factor(colon$sex, levels=c("0","1"), labels=c("Female","Male"))
colon$adhere <- factor(colon$adhere, levels=c("0","1"))
colon$perfor <- factor(colon$perfor)
colon$differ <- factor(colon$differ, levels=c("1","2", "3"), labels=c("well","moderate", "poor"))
colon$extent <- factor(colon$extent, levels=c("1","2","3","4"), labels=c("submucosa","muscle", "serosa", "contiguous_str"))
colon$surg <- factor(colon$surg, levels=c("0","1"), labels=c("short","long"))
colon$obstruct <- factor(colon$obstruct)
str(colon)

## 'data.frame':    1858 obs. of  16 variables:
##  $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 1 2 2 ...
##  $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
##  $ obstruct: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 1 1 ...
##  $ perfor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ adhere  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ...
##  $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
##  $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ differ  : Factor w/ 3 levels "well","moderate",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ extent  : Factor w/ 4 levels "submucosa","muscle",..: 3 3 3 3 2 2 3 3 3 3 ...
##  $ surg    : Factor w/ 2 levels "short","long": 1 1 1 1 1 1 2 2 2 2 ...
##  $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ time    : num  1521 968 3087 3087 963 ...
##  $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

#factor model
#code format from lab3

1.3.2 Data Description

head(colon, 10)

([2] the explanations regarding the columns of each dataset was found from Therneau's documentation of Package 'survival') This data illustrates one of the first trials of chemotherapy for colon cancer. Every patient is assigned an index value (id) along with a value 1 for the study column. Each patient is assigned two rows of record, one for recurrence in and another for death.

Death and Recurrence?

The two rows are distinguished by the values of column ‘etype’.
Value 1 in column ‘etype’ stands for recurrence while value 2 in column ‘etype’ stands for death.

With the head() function we were able to print out the first ten rows of the dataset. Every two rows in the colon dataset has the same id values. The distinguishment for each row is with the column called 'etype' The column distinguishes a single patient into two different segments signifying recurrence, etype = 1, and death, etype = 2.
We will be subsetting the dataset to seek the answers regarding our analysis question regarding death.

#subsetting the colon dataset
recur <- subset(colon, colon$etype == 1 )
death<- subset(colon, colon$etype == 2)

#data of death (etype == 2)
head(death, 10)

Now we have two subsets of the original dataset 'colon' that are called 'recur' for the subject's recurrence data and 'death' for the subject's data regarding death

Treatment Variable

This is our key variable of interest that we will be testing and building a model around.

levels(death$rx)

## [1] "Obs"     "Lev"     "Lev+5FU"

Column ‘rx’ contains the three treatment levels : 'Obs', 'Lev', and 'Lev + 5-FU'. ‘Obs’ is observation, also known as the control group, which is a segment of patients that did not receive treatment.
‘Lev’ is stands for Levamisole, which is a low toxicity compound. ‘Lev + 5-FU’ is a treatment of both Levamisole and -FU, which is a moderately toxic chemotherapy agent. Before explaining the other variables, we want to make sure that the segment size is not too different from one group to another when the subjects are counted in groups of the treatments received for colon cancer. 315 subjects were in the control group, 310 subjects treated with Levamisole only, and 304 subjects were treated with both Levamisole and fluorouracil(5-FU)[3] together.

table(death$rx)

## 
##     Obs     Lev Lev+5FU 
##     315     310     304

barplot(table(death$rx), col = "skyblue")

# can be found here in chapter 4 sections 1 and 2 https://daviddalpiaz.github.io/appliedstats/summarizing-data.html#categorical

Figure 1: Boxplot of Treatments

[4] can be found here in chapter 4 sections 1 and 2 https://daviddalpiaz.github.io/appliedstats/summarizing-data.html#categorical

Other Variables

Column ‘time’ is the days until event or censoring.
The censoring status is indicated in column ‘status’ with either a 1 or a 0.
For column ‘sex’, value 0 is assigned to female patients and value 1 for male patients. Column ‘age’ is in the unit of years.
The column ‘obstruct’ indicates if the colon was obstructed by a tumour.
The column ‘perfor’ indicates the perforation of the colon.
The column ‘adhere’ indicates the adherence to nearby organs. Each of these three columns are indicated with either 1 or 0.
The ‘nodes’ column lists the number of lymph nodes with detectable cancer with integer values.
The ‘differ’ column lists the differentiation of tumour with 1=well, 2=moderate, and 3=poor.
The ‘extent’ column indicates the extent of local spread with 1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures.
The ‘surg’ column shows the time from surgery to registration with 1=long and 0=short.

Excluded Variable

The 'node4' column indicates more than 4 positive lymph nodes with values of 1 or 0.
However, in the documentation for the colon dataset it is mentioned that "Peter Higgins has pointed out a data inconsistency, revealed by table(colon\(nodes,colon\)node4). We don’t know which of the two variables is actually correct so have elected not to ’fix’ it." [1]

Column 'nodes' is a number of lymph nodes with detectable cancer.[1] Column 'node4' is a categorical variable that verifies if the subject had more than 4 lymph nodes with cancer or did not.[1] If the columns were to be consistent with one another, all the values for 'node4' where values in 'nodes' is less than 4 should be 0.

filter(colon, nodes == '1' & node4 == 1)

# function documentation found here
# https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter

[5] filter function documentation found here https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter

We see ourselves that the inconsistency exists in the dataset, by filtering for subjects who, according to the dataset, have 1 lymp node with detectable cancer but also have more than 4 lymph nodes with cancer.

For accurate results, we wish to remove the inconsistency.
We decided to remove node4 from the steps of analysis with the assumption that nodes is the accurate data, because we view the column node4 to have been collected as a succeeding column of nodes in the experiment process

2. Statistical Inference

2.1 Plotting the Data

# simple KM curve
death <- subset(colon, colon$etype==2)
colon.dsurv <- Surv(death$time, death$status)
colon.dfit <-survfit(colon.dsurv~1, data = death)
plot(colon.dfit,
     main = "Kaplan-Meier Curves \n for Colon Cancer Patients", 
     xlab = "Time (in Days)", 
     ylab = "Survival Probability", mark = 18)

# https://gauchospace.ucsb.edu/courses/pluginfile.php/11095265/mod_resource/content/1/175%20Section%201%20-%20Kyle%20Guan.pdf
#code from Kyle Guan section 1 notes

Figure 2: Kaplan-Meier Plot

This is a Kaplan-Meier curve generated using a plot function (from package graphics, version 3.6.2, License: Part of R 3.6.2 by R-core R-core@R-project.org).
The curve illustrates the different survival probabilities of colon cancer patients and the dots on the curve indicate censored subjects who left the study before an event occurred.
Here we see that the survival probability curve smoothes out into slope closer to 0 as the subject's time in the study passes 2500 days. We see that most of the censoring of patients in the study happened as the survival probability drops from 0.6 and as time in the study passes 1750 days. The amount of censoring shown in this KM curve might be an explanation as to why the survival probability does not drop to a 0.

#KM Curve by Treatment
rxdeath.fit <- survfit(Surv(time,status)~rx,data=death)
ggsurvplot(rxdeath.fit, 
           title = "KM Curve by Treatments",
           xlab = "Time (in Days)", 
           ylab = "Survival Probability",
           pval = TRUE)

Figure 3: Kaplan-Meier Plot by Treatment

To get a better visualization on the relationship between the survival probabilities of subjects in our interest with the treatment, we generated a Kaplan-Meier curve that is grouped by treatment a ggsurvplot function (from package survminer, version 0.4.8, License: GPL-2 by Alboukadel Kassambara) from the death subset.
There are two interesting trends outlined here. First, the two curves indicating the observation group and the "Lev" treatment group (subject group that was treated only with Levamisole) seem to follow a similar trend and intersect numerous times. Visually, the two curves do not seem to have a significant difference, meaning treating only Levamisole on colon cancer patients do not seem to have a significant difference on survival probabilities. This project will explore this further in detail in the next chapter. Second, the censored data we mentioned from the previous graph seem to show its effect on the treatment grouped Kaplan-Meier curve in this graph. Like the previous KM curve graph, the curves for each treatment group smooth out below survival probability of 0.50. Also, the confidence intervals at each time, visualized as vertical lines, in each treatment group's curve line becomes more vivid(longer) as subject's time in the study increases.

2.2 Hypothesis Testing

Apart from checking with visualization, we want to test for a statistical significant difference between the control and the treatment group. We performed a log-Rank test and a log-likelihood test.

log-Rank test

We performed a log-Rank test to if the survival curves between the observation group(rx = "Obs") and the treatment groups (rx = "Lev" and rx = "Lev + 5FU") \(H_0\) : \(S_1(t)\) = \(S_2(t)\) = \(S_3(t)\); There is no significant difference in survival curves between the control group(rx = "Obs") vs treatment group (rx = "Lev" or rx = "Lev + 5FU") \(H_1\) : There is a significant difference in survival curves between the control group vs treatment group(rx = "Lev" or rx = "Lev + 5FU")

survdiff(Surv(time, status) ~ rx, data = death)

## Call:
## survdiff(formula = Surv(time, status) ~ rx, data = death)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## rx=Obs     315      168      148      2.58      3.85
## rx=Lev     310      161      146      1.52      2.25
## rx=Lev+5FU 304      123      157      7.55     11.62
## 
##  Chisq= 11.7  on 2 degrees of freedom, p= 0.003

The p-value 0.03 and is smaller than 0.05. We successfully reject the null hypothesis and claim that there is a significant difference in survival curves between the control group vs treatment group(rx = "Lev" or rx = "Lev + 5FU")

likelihood ratio test

We also performed a likelihood ratio test to see how the hazard rates of treatment groups (rx = "Lev" and rx = "Lev + 5FU") are different from the observation group. \(H_0\) : \(B_1(t)\) = \(B_2(t)\) = \(B_3(t)\); There is no significant difference in survival rates between the control group(rx = "Obs") vs treatment group (rx = "Lev" or rx = "Lev + 5FU") \(H_1\) : There is a significant difference in survival rates between the control group vs treatment group(rx = "Lev" or rx = "Lev + 5FU")

colondeath.cox <- coxph(colon.dsurv~rx, data = death)
colondeath.cox

## Call:
## coxph(formula = colon.dsurv ~ rx, data = death)
## 
##               coef exp(coef) se(coef)      z       p
## rxLev     -0.02664   0.97371  0.11030 -0.241 0.80917
## rxLev+5FU -0.37171   0.68955  0.11875 -3.130 0.00175
## 
## Likelihood ratio test=12.15  on 2 df, p=0.002302
## n= 929, number of events= 452

#section5 file from Kyle Guan

The p-value is 0.002302 which is less than 0.05, thus we reject the null hypothesis and claim that there exists a significant difference between the survival rates of subjects that don't use treatment("Obs") vs those who do("Lev", "Lev + 5FU") Based on the negative sign of the coefficients, it is shown that both treatments' hazard rates are both lower than that of the hazard rate of observation group. We see that the coefficient The hazard rate for patients who are on rx of Lev is approximately 97.37 % of the hazard rate for patients in the control ("Obs") The hazard rate for patients who are on rx of Lev + 5FU is approximately 68.96 % of the hazard rate for patients in the control ("Obs"). Thus we see treatment using "Lev + 5FU" lessened hazard rates much more than than using just "Lev" did. A similar narrative is given from the 95% confidence intervals of the hazard ratios of each treatment level to the observation group.

#hazard ratio confidence interval
#rxLev to Obs
exp(-0.02664 - 1.96 * 0.11030) # lower

## [1] 0.7844064

exp(-0.02664 + 1.96 * 0.11030) # upper

## [1] 1.208703

#http://www.sthda.com/english/wiki/cox-proportional-hazards-model

The p-value of treatment Lev is 0.809 and the 95% confidence interval of the hazard ratio is 0.7844 to 1.209. The confidence interval includes 1. This indicates that we cannot claim that treatment with only Levamisole had a signficant effect in the hazard rates of the aptients in the study.

#hazard ratio confidence interval
#rxLev + 5FU to Obs
exp(-0.37171 - 1.96 * 0.11875) # lower

## [1] 0.5463694

exp(-0.37171 + 1.96 * 0.11875) # upper

## [1] 0.8702628

#http://www.sthda.com/english/wiki/cox-proportional-hazards-model

The p-value of treatment Lev is 0.00175 and the 95% confidence interval of the hazard ratio is 0.5463 to 0.8702. The confidence interval is less than 1, meaning it possessed statistical significance in reducing the hazard rates of the patients in the study.

2.3 BIC Model Fitting with Forward Selection

In this section, we want to find the "correct" model with a set of covariates that predict survival the with the highest accuracy. We want to find the covariates that work the best with our treatment variable "rx".
We will find the best model without the treatment variable "rx" and compare that to the selected model that includes "rx". We used a BIC model fitting forward selection process to identify important covariates we need to consider with treatment in our Cox PH model when testing for the significance of treatments given to colon cancer patients.

#first covariate
death.colon.model1 <- coxph(Surv(time, status) ~ sex, data = death)
death.colon.model2 <- coxph(Surv(time, status) ~ age, data = death)
death.colon.model3 <- coxph(Surv(time, status) ~ obstruct, data = death)
death.colon.model4 <- coxph(Surv(time, status) ~ perfor, data = death)
death.colon.model5 <- coxph(Surv(time, status) ~ adhere, data = death)
death.colon.model6 <- coxph(Surv(time, status) ~ nodes, data = death)
death.colon.model7 <- coxph(Surv(time, status) ~ differ, data = death)
death.colon.model8 <- coxph(Surv(time, status) ~ extent, data = death)
death.colon.model9 <- coxph(Surv(time, status) ~ surg, data = death)
BIC(death.colon.model1, death.colon.model2, death.colon.model3, 
    death.colon.model4, death.colon.model5, death.colon.model6, 
    death.colon.model7, death.colon.model8, death.colon.model9
    )

## Warning in BIC.default(death.colon.model1, death.colon.model2,
## death.colon.model3, : models are not all fitted to the same number of
## observations

the BIC is the smallest for death.colon.model7 which contains the 'differ' variable.

#second covariate
death.colon.model7.1 <- coxph(Surv(time, status) ~ differ + sex, data = death)
death.colon.model7.2 <- coxph(Surv(time, status) ~ differ + age, data = death)
death.colon.model7.3 <- coxph(Surv(time, status) ~ differ + obstruct, data = death)
death.colon.model7.4 <- coxph(Surv(time, status) ~ differ + perfor, data = death)
death.colon.model7.5 <- coxph(Surv(time, status) ~ differ + adhere, data = death)
death.colon.model7.6 <- coxph(Surv(time, status) ~ differ + nodes, data = death)
death.colon.model7.8 <- coxph(Surv(time, status) ~ differ + extent, data = death)
death.colon.model7.9 <- coxph(Surv(time, status) ~ differ + surg, data = death)
BIC(death.colon.model7.1, death.colon.model7.2, death.colon.model7.3, 
    death.colon.model7.4, death.colon.model7.5, death.colon.model7.6, 
    death.colon.model7.8, death.colon.model7.9)

## Warning in BIC.default(death.colon.model7.1, death.colon.model7.2,
## death.colon.model7.3, : models are not all fitted to the same number of
## observations

death.colon.model7.6 has the smallest BIC score of 5560.410 and is smaller than the BIC score of death.colon.model7. We include the covariate 'nodes' in the model selection process.

#third covariate
death.colon.model7.6.1 <- coxph(Surv(time, status) ~ differ + nodes +  sex, data = death)
death.colon.model7.6.2 <- coxph(Surv(time, status) ~ differ + nodes +  age, data = death)
death.colon.model7.6.3 <- coxph(Surv(time, status) ~ differ + nodes +  obstruct, data = death)
death.colon.model7.6.4 <- coxph(Surv(time, status) ~ differ + nodes +  perfor, data = death)
death.colon.model7.6.5 <- coxph(Surv(time, status) ~ differ + nodes +  adhere, data = death)
death.colon.model7.6.8 <- coxph(Surv(time, status) ~ differ + nodes +  extent, data = death)
death.colon.model7.6.9 <- coxph(Surv(time, status) ~ differ + nodes +  surg, data = death)
BIC(death.colon.model7.6.1, death.colon.model7.6.2, 
    death.colon.model7.6.3, death.colon.model7.6.4, 
    death.colon.model7.6.5, death.colon.model7.6.8, 
    death.colon.model7.6.9)

death.colon.model7.6.8 has the smallest BIC score of 5560.361 and is smaller than the BIC score of death.colon.model 7.6 We keep the covariate 'extent in the model selection process.

#fourth covariate
death.colon.model7.6.8.1 <- coxph(Surv(time, status) ~differ + nodes + 
                                     extent +  sex, data = death)
death.colon.model7.6.8.2 <- coxph(Surv(time, status) ~ differ + nodes + 
                                     extent +  age, data = death)
death.colon.model7.6.8.3 <- coxph(Surv(time, status) ~ differ + nodes + 
                                     extent +  obstruct, data = death)
death.colon.model7.6.8.4 <- coxph(Surv(time, status) ~ differ + nodes +  
                                     extent +  perfor, data = death)
death.colon.model7.6.8.5 <- coxph(Surv(time, status) ~ differ + nodes +  
                                     extent +  adhere, data = death)
death.colon.model7.6.8.9 <- coxph(Surv(time, status) ~differ + nodes + 
                                     extent +  surg, data = death)
BIC(death.colon.model7.6.8.1, death.colon.model7.6.8.2, death.colon.model7.6.8.3, 
    death.colon.model7.6.8.4, death.colon.model7.6.8.5, 
    death.colon.model7.6.8.9)

All the BIC scores in this table are greater than the BIC score of death.colon.model7.6.8, which was 5560.361
We do not include a new covariate into our model and stop the model selection process here. We are left with a model that contains the following covariates: nodes, differ, extent

withoutTreatment <- (coxph(Surv(time, status) ~ nodes + differ + 
                             extent, data = death)) #death.colon.model7.6.8
withTreatment <- (coxph(Surv(time, status) ~ rx + nodes + differ +  
                          extent, data = death))
anova(withoutTreatment, withTreatment)

With the deviance table that compares the selected model and the selected model with the treatment, we get a p-value of 0.0032. We claim that for the selected models there is a significant difference when including the treatment variable and when the treatment variable is not included. The cox proportional hazards model that we select from the BIC process includes variables "nodes", "differ", "extent", and the treatment variable "rx".

# we plut in the treatment variable into the model we have fitted
death.colon.trt.select <- coxph(Surv(time, status) ~ rx + nodes + 
                                  differ +  extent, data = death)
death.colon.trt.select

## Call:
## coxph(formula = Surv(time, status) ~ rx + nodes + differ + extent, 
##     data = death)
## 
##                           coef exp(coef)  se(coef)      z       p
## rxLev                -0.108109  0.897530  0.114235 -0.946 0.34396
## rxLev+5FU            -0.397996  0.671664  0.121474 -3.276 0.00105
## nodes                 0.084065  1.087700  0.009339  9.002 < 2e-16
## differmoderate       -0.102575  0.902511  0.167740 -0.612 0.54086
## differpoor            0.194893  1.215181  0.196682  0.991 0.32173
## extentmuscle          0.730410  2.075932  0.602647  1.212 0.22551
## extentserosa          1.254786  3.507086  0.581172  2.159 0.03085
## extentcontiguous_str  1.611147  5.008552  0.615033  2.620 0.00880
## 
## Likelihood ratio test=107.7  on 8 df, p=< 2.2e-16
## n= 888, number of events= 430 
##    (41 observations deleted due to missingness)

With the new coxPH model chosen from the forward selection process we performed a log-likelhood test again. The p-value is 0.002302 which is less than 0.05, thus we reject the null hypothesis and claim that there exists a significant difference between the survival rates of subjects that don't use treatment("Obs") vs those who do("Lev", "Lev + 5FU") The hazard rate for patients who are on rx of Lev is approximately 97.37 % of the hazard rate for patients in the control ("Obs") The hazard rate for patients who are on rx of Lev + 5FU is approximately 68.96 % of the hazard rate for patients in the control ("Obs")

It does not seem like the survival rates of subjects that are under placebo are significantly different from those who are only using Levamisole.
Using levamisole by itself doesn't appear to be have a significant effect in minimizing deaths.

3. coxPH Assumption checking

A cox proportional hazards model is built upon an assumption that "the HR(hazard ratio) is constant over time." (D.Kleinbaum, M.Klein, Survival Analysis, A Self‐Learning Text, Third Edition, pg.124) To ensure the validity of the cox PH regression model, we need to test whether the model we have selected from the BIC process does not violate the PH assumption.

Using the cox.zph function (from package survival, version 3.2-7, License: LGPL (>= 2) by Terry Therneau), we tested the proportional hazards assumption for our selected cox regression model fit.

cox.zph(death.colon.trt.select)

##          chisq df       p
## rx      2.7363  2 0.25458
## nodes   0.0138  1 0.90652
## differ 15.8014  2 0.00037
## extent  6.7199  3 0.08138
## GLOBAL 24.7727  8 0.00170

The p-value from the cox.zph goodness of fit test for the coxPH assumption is 0.00170 and is smaller than 0.05.
We see that the model we have selected violates the pH assumption, which means that in our model we have a covariate that should not be assumed to be independent of time. (D.Kleinbaum, M.Klein, Survival Analysis, A Self‐Learning Text, Third Edition, pg.164)

From the goodness of fit test we conducted with the cox.zph function, we see that the covariate "differ" also had a p-value of 0.00037. To see if this specific covariate is the cause of the PH assumption the model, we visualized a log-log plot to check for linearity.

cloglog <- function(x){log(-log(x))}
ggsurvplot(survfit(Surv(death$time, death$status) ~ death$differ, data = death)
           , fun = cloglog, title = "Log-Log Plot (by differentiation)", 
           xlab = "Time (in Days)", ylab = "log-log(S(t))")

Figure 4: Log-Log Plot by differentation

Based on our Log-Log Plot of the curves that are grouped by the column 'differ', we see that the curves are generally parralel in their shape. We see that the blue curve(differ=well) and the red curve(differ=moderate) cross over each other during their paths. We also notice that their is a lot of noise in the end of the graph as all of the curves converage. These two things are noticeable, but we still may assume our Cox Proportional Hazards Assumptions are met. We see that all of the curves are generally parralel, and we remember that this is very noisy data, so we still accept this plot. The paths of the curves is parralel, and it may also be pointed out that the reason the Red and Blue curve may cross is because they are much more similar than the Green curve. The Green curve(differ=poor) is much different than the Blue curve and Red curve which is mainly what we are looking for. There may be not many differences between moderate and well differentiation of tumours, but there is a difference between the poor differentiation patients. Thus, we may assume our Cox Proportional Hazards Assumptions are met when grouped by differ.

The log-log plot of the curves grouped by the column 'differ' intersect and converge heavily towards the end of the graph. We decide to make changes to the covariate 'differ' to improve our Cox PH analysis.

4. Improvements and Innovations

Stratification

To improve our Cox PH analysis, we stratified on the covariate "differ" so that our model would not apply PH assumption to "differ" covariate.

# we plut in the treatment variable into the model we have fitted
death.colon.trt.select.strata <- coxph(Surv(time, status) ~ rx + nodes + 
                                  strata(differ) +  extent, data = death)
death.colon.trt.select.strata

## Call:
## coxph(formula = Surv(time, status) ~ rx + nodes + strata(differ) + 
##     extent, data = death)
## 
##                           coef exp(coef)  se(coef)      z       p
## rxLev                -0.102229  0.902822  0.114362 -0.894 0.37137
## rxLev+5FU            -0.381147  0.683077  0.121441 -3.139 0.00170
## nodes                 0.082256  1.085734  0.009382  8.768 < 2e-16
## extentmuscle          0.749324  2.115570  0.602595  1.243 0.21369
## extentserosa          1.272497  3.569754  0.581146  2.190 0.02855
## extentcontiguous_str  1.668490  5.304153  0.614989  2.713 0.00667
## 
## Likelihood ratio test=92.86  on 6 df, p=< 2.2e-16
## n= 888, number of events= 430 
##    (41 observations deleted due to missingness)

A likelihood ratio test on our newly stratified model provides a p-value that is much smaller than 0.05. Once again, with the new stratified model, we reject the null hypothesis and claim that there exists a significant difference between the survival rates of subjects that don't use treatment("Obs") vs those who do("Lev", "Lev + 5FU")

To see if our new model still violate the PH assumption, we run a cox.zph test on our new stratified model.

cox.zph(death.colon.trt.select.strata)

##        chisq df    p
## rx     2.288  2 0.32
## nodes  0.692  1 0.41
## extent 5.803  3 0.12
## GLOBAL 9.431  6 0.15

The p-value from the cox.zph goodness of fit test for the coxPH assumption is 0.15 and is greater than 0.05.
We conclude that the model we have selected & stratified does not violate the pH assumption any longer. We see that have taken the correct steps to ensure the validity in our cox PH model with stratification.

Innovation

survtrend <- function(formula, data, print.table = TRUE) {
  diff.fit <- survdiff(formula, data = data)
  df <- length(diff.fit$n) - 1
  score <- coxph(formula, data = data)$score
  if (print.table) {
    diff.tumor <- cbind(diff.fit$n, diff.fit$obs, diff.fit$exp)
    colnames(diff.tumor) <- c("N", "Observed", "Expected")
    print(diff.tumor)
  }
  cat("\nLogrank Test : Chi(", df, ") = ", diff.fit$chisq, ", p-value = ", 1 -
  pchisq(diff.fit$chisq, df), sep = "")
  cat("\nTarone Test Trend : Chi(1) = ", score, ", p-value = ", 1 - pchisq(score,
  1), sep = "")
}

death2 <- with(death, Surv(time, status))
#rx2 <-survfit(death2 ~ rx, data = death, conf.type = "log-log")

differ1 <-survfit(death2 ~ differ, data = death, conf.type = "log-log")
colors <- c("cornflowerblue", "deeppink", "forestgreen")
plot(differ1, lty = c(1:3), main = "Survival Curves Based on Differentiation of Tumor",
col = colors, lwd = 2,
xlab = "Days", ylab = "Survival Probability")
legend("bottomleft", lty = c(1:3),
col = colors, lwd = 2,
legend = c("Well", "Moderately", "Poorly"), bty = "n")

Figure 5: Survival Curve Plot

The graph indicates subjects with poorly differentiated tumor cells have a lower chance of survival;
however, as time increases, it converges more closely to the "well" and "moderately" lines.
In this example, we use the Tarone-Ware test, which is a variant of the log-rank test.

survtrend(death2 ~ differ, data = death)  #citation

##                   N Observed Expected
## differ=well      93       42  47.5287
## differ=moderate 663      311 334.9173
## differ=poor     150       88  58.5540
## 
## Logrank Test : Chi(2) = 17.18909, p-value = 0.0001851124
## Tarone Test Trend : Chi(1) = 17.18517, p-value = 3.390732e-05

From our test we get a P-value that is lower than our alpha level of 0.05. Thus, we reject the Null hypothesis and accept the alternative hypothesis. Thus, we conclude that there is a statistically significant difference between the survival probabilities of differing differentiation of the tumours.

"Tarone and Ware (1977) test -- with weights equal to the square root of the number of subjects in the risk pool at time \(t_i\)

Like Wilcoxon’s test, this test is appropriate when hazard functions are thought to vary in ways other than proportionally and when censoring patterns are similar across groups. The test statistic is constructed by weighting the contribution of each failure time to the overall test statistic by the square root of the number of subjects at risk. Thus, like the Wilcoxon test, it gives heavier weights, although not as large, to earlier failure times. Although less susceptible to the failure and censoring pattern in the data than Wilcoxon’s test, this could remain a problem if large differences in these patterns exist between groups."

found here: https://www.stata.com/manuals13/stststest.pdf

5. Conclusion

Overall

[Write which model we chose, and why/how that explains our original objective. State what we found through analysis and why it's important.]

Because each subject has recurrence data and death data, our first task was to subset etype==2, indicating “death.” We then plotted Kaplan-Meier curves for the colon cancer patients, and our next step was plotting KaplanMeier curves for patients by treatments (“Observation,” “Levamisole,” and “Levamisole and Fluorouracil”). We observed that Lev+5FU seems to improve survival rates by a significant amount, so we implemented the log-rank test and likelihood ratio test in order to get a deeper view of the treatments’ effects. From these results, we conclude that using both Levamisole and Fluorouracil significantly improves survival rates. On the other hand, it does not appear that Levamisole by itself does not have a significant effect on survival rates. We also checked the hazard ratio confidence interval which validated our conclusions. Next, we wanted to find the most accurate model that predicts survival times. Our method was BIC model fitting with forward selection. Since our focus is on the effect of the treatments, our goal was to determine which covariates work best with the variable “rx.” Our first model that we chose violated our coxPH assumption because the covariate “differ” did not pass the check for linearity. Thus, we stratified on this covariate, and our improved model no longer violates the proportional hazards assumption. The model we have chosen is as follows: death.colon.trt.select.strata <- coxph(Surv(time, status) ~ rx + nodes + strata(differ) + extent, data = death).

limitations

assumption that nodes column was the correct one instead of node4
not taking measures to fix the insignificance of the Lev treatment level

6.References

[1] https://www.cancer.gov/types/common-cancers#:~:text=The%20most%20common%20type%20of,are%20combined%20for%20the%20list.

https://www.npr.org/sections/health-shots/2017/02/28/517563769/why-are-more-young-americans-getting-colon-cancer

[2] Therneau TM, Lumley T, Elizabeth A, Cynthia C. Package 'survival': Survival Analysis 23-4 available at https://cran.r-project.org/web/packages/survival/survival.pdf

[3] https://www.cancernetwork.com/view/why-levamisole-appears-improve-efficacy-5-fu

[4] https://daviddalpiaz.github.io/appliedstats/summarizing-data.html#categorical

[5] https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter

[Also need to find links that helped us with code.]

citation("survival")

## 
## Therneau T (2015). _A Package for Survival Analysis in S_. version
## 2.38, <URL: https://CRAN.R-project.org/package=survival>.
## 
## Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival
## Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

citation("ggplot2")

## 
## To cite ggplot2 in publications, please use:
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }

citation("ggpubr")

## 
## To cite package 'ggpubr' in publications use:
## 
##   Alboukadel Kassambara (2020). ggpubr: 'ggplot2' Based Publication
##   Ready Plots. R package version 0.4.0.
##   https://CRAN.R-project.org/package=ggpubr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {ggpubr: 'ggplot2' Based Publication Ready Plots},
##     author = {Alboukadel Kassambara},
##     year = {2020},
##     note = {R package version 0.4.0},
##     url = {https://CRAN.R-project.org/package=ggpubr},
##   }

citation("dplyr")

## 
## To cite package 'dplyr' in publications use:
## 
##   Hadley Wickham, Romain François, Lionel Henry and Kirill Müller
##   (2020). dplyr: A Grammar of Data Manipulation. R package version
##   1.0.0. https://CRAN.R-project.org/package=dplyr
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller},
##     year = {2020},
##     note = {R package version 1.0.0},
##     url = {https://CRAN.R-project.org/package=dplyr},
##   }

citation("survminer")

## 
## To cite package 'survminer' in publications use:
## 
##   Alboukadel Kassambara, Marcin Kosinski and Przemyslaw Biecek (2020).
##   survminer: Drawing Survival Curves using 'ggplot2'. R package version
##   0.4.8. https://CRAN.R-project.org/package=survminer
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {survminer: Drawing Survival Curves using 'ggplot2'},
##     author = {Alboukadel Kassambara and Marcin Kosinski and Przemyslaw Biecek},
##     year = {2020},
##     note = {R package version 0.4.8},
##     url = {https://CRAN.R-project.org/package=survminer},
##   }

Survival Analysis on Colon Caner Patients

Alexander Yoon, Julia Ho, Sam Kozlowski

12/18/2020

Table of Contents