User George Savva - Cross Validated

Channel: User George Savva - Cross Validated

Comment by George Savva on Best plot in R for count data with a broad range...

January 24, 2024, 2:07 am

@NickCox I completely agree. This is an interesting question so if I get time I might think about how to make the waffle plot with direct labelling.

View Article

Comment by George Savva on Example of sample size calculation for a...

January 25, 2024, 6:25 am

@usεr11852 ive added comments to the code

View Article

Answer by George Savva for Is a mixed-model really required?

December 16, 2022, 3:37 am

Depending on your exact research question and the reasons for multiple procedures, consider removing the individuals with multiple procedures from the dataset, or only include their first or a randomly...

View Article

Answer by George Savva for Expected Unique Toys

January 18, 2023, 7:20 am

Your approach was nearly right but I would think about the expectation of having collected the $j$th toy after each draw, and then summing the expectation over the toys.The expected number of unique...

View Article

Answer by George Savva for Should I treat these data points as outliers?

January 22, 2023, 4:26 am

It should be obvious that 50% of your data will fall above the 75th or below the 25th centile, and that 2% is either above the 99th or below the 1st.There is no data-driven answer to 'what is an...

View Article

Answer by George Savva for Confidence interval for logistic regression in R

January 26, 2023, 8:01 am

It looks like gtsummary is using the Wald test based on a normal approximation to get the p-value (the same as summary.glm would use), while its getting the confidence interval from confint which uses...

View Article

Image may be NSFW.
Clik here to view.

Answer by George Savva for Dichotomising vs keeping categories in regression

February 2, 2023, 5:23 am

This is a great example of why it is so difficult to 'hypothesize after the results are known' and why we should make analysis plans before we start, even for observational data. Every analysis we try...

View Article

Answer by George Savva for Is it possible to use variable importance to make...

February 3, 2023, 2:31 am

The term 'Risk factor' is vague, its not clear whether you're referring to a modifiable factor that if we were to alter would affect your outcome, or a predictive factor whose value is correlated with...

View Article

Answer by George Savva for Why Is It Important To Simulate Data From A...

February 6, 2023, 4:48 am

Parametric bootstrapping relies on simulating data from an estimated model. If it is difficult to work out confidence intervals analytically, then simulating datasets from your estimated model can give...

View Article

Image may be NSFW.
Clik here to view.

Answer by George Savva for Are there any simpler alternatives to boxplots and...

February 10, 2023, 5:38 am

I agree with you that a box plot alone doesn't describe the data very well. But a box plot overlaid with the data points might do. See for example the graph below. I've switched the axis around so it's...

View Article

Experimental design for a 96 well plate

February 10, 2023, 7:55 am

I am designing an experiment that will include an assay on a 96 well plate and could use some advice on the optimal layout.In our experiment we have 12 biological replicates, and each replicate has...

View Article

Image may be NSFW.
Clik here to view.

Answer by George Savva for Linear regression with categorical variable: why...

March 8, 2023, 8:35 am

The default linear model implied by lm assumes that the residual standard deviation in sepal length is the same for all plants. Under this assumption, the residual standard deviation for any plant is...

View Article

Answer by George Savva for Similarity measure for two discrete distributions...

March 11, 2023, 4:05 pm

This is one minus the Bray-Curtis dissimilarity between two compositions expressed as proportions.See: https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarityThis is a widely used measure in...

View Article

Answer by George Savva for Is there any downside to scaling a dataset?

March 31, 2023, 3:43 am

Your updated question references application of 'StandardScaler' to everything. From the documentation by default this centers (by subtracting the mean) and then rescales every feature to a unit...

View Article

Image may be NSFW.
Clik here to view.

Answer by George Savva for Should adjusted models produce narrower CIs than...

May 18, 2023, 6:23 am

One of the most important reasons to add covariates into a regression model is to explain residual variation in the outcome, and so increase precision in parameter estimates from the model.So if the...

View Article

Answer by George Savva for Inconsistent results of a 2-way anova from aov()...

July 3, 2023, 4:37 am

The different type-3 tables from car and rstatix are using different contrasts for the main effects. From the documentation in rstatix::anova_test() (emphasis mine)By default, R uses treatment...

View Article

Answer by George Savva for Paired t-test for 2 patients?

September 21, 2023, 2:12 am

You should probably resist the temptation (or the pressure) to conduct a hypothesis test on feasibility study data, particularly with N=2.A feasibility study has met its objectives if it tests the...

View Article

Answer by George Savva for Differences in Regression model for Dummy Coding...

September 22, 2023, 5:29 am

There's a mistake in your coding.data_WPA$UppCl <- recode(data_WPA$cot_class, "1=0; 2=1; 3=0; 4=0; 5=1").should be:data_WPA$UppCl <- recode(data_WPA$cot_class, "1=0; 2=0; 3=0; 4=0; 5=1").This...

View Article

Answer by George Savva for Calculating statistical significance of gender...

September 27, 2023, 6:11 am

As well as the issues already raised in the comments and the answer - if you were motivated to conduct this test just because you noticed the gender difference then you are HARKing (hypothesizing after...

View Article

Answer by George Savva for Can you use survival analysis on subjects spanning...

October 10, 2023, 5:27 am

You have a 'time varying covariate' in that the 'period' variable changes over time and individual contracts can change within individuals.The typical way to handle this is to split individuals into...

View Article

Answer by George Savva for Mixed models of lme4 and nlme giving too much...

November 7, 2023, 2:43 am

There are a couple of different questions here I think, first on what should happen to the cluster mean estimates when a mixed effects model is used, and second what should happen to the estimates for...

View Article

What is the statistical rationale for a phase 2 clinical trial?

November 18, 2023, 1:23 pm

There is a lot of contradictory information about the purpose of phase 2 clinical trials.Many sources claim that the aim is to test whether a treatment works, and sample size calculators exist to...

View Article

Is there a good reason for a lab to repeat experiments instead of conducting...

December 18, 2023, 4:10 am

It is not uncommon for biologists to repeat experiments to confirm initial results. Intuitively this makes some sense but to me seems inefficient and potentially problematic.(For example, one client...

View Article

Answer by George Savva for Correlations Fixed Effect

February 6, 2024, 2:41 am

Your assumption of multi-collinearity is correct, but it's not necessary for multi-collinearity that a*c and a*b are correlated in themselves.Multi-collinearity occurs if any linear combination of...

View Article

Image may be NSFW.
Clik here to view.

Answer by George Savva for A and B are independent. Does P(A ∩ B|C) = P(A|C)...

February 13, 2024, 4:04 am

No this is not in general true, as you can see from a simple counter example:Toss two independent coins.Event $A$ is coin 1 head. $P(A)=0.5$Event $B$ is coin 2 head. $P(B)=0.5$Event $C$ is either coin...

View Article

Comment by George Savva on Power of two-sample z-test

February 16, 2024, 3:15 am

I think @Ggjj11 means a two sample vs one sample test. The formula is for a one-sample test (or a paired test) not test of two independent samples.

View Article

Answer by George Savva for Power of two-sample z-test

February 16, 2024, 6:34 am

Your formula is for a one-sample test, that is for testing the null hypothesis that the mean of a population has a given value given a single sample from it.You can check the 'one sample' calculation...

View Article

Comment by George Savva on Why is an offset needed in Poisson Regression?

February 28, 2024, 7:48 am

What do you mean by 'modelling rates instead of a count'?

View Article

Comment by George Savva on Why is an offset needed in Poisson Regression?

February 28, 2024, 8:34 am

@pq44pq Consider 100 pieces of gum in 100 square meters of pavement vs 1 piece of gum on 1 square meter. These are the same rate but do not carry the same amount of information.

View Article

Comment by George Savva on How to incorporate p-value information for...

March 1, 2024, 12:57 am

What does a p-value for a distance mean? A p-value corresponds to a hypothesis test, what is the hypothesis test here?

View Article

Comment by George Savva on Is there a good reason for a lab to repeat...

March 5, 2024, 12:42 am

Thanks for the considered answer. I agree these would be the main arguments in favour of repeating experiments. But I'm not sure any of these scenarios addresses the question of why a single blocked...

View Article

Answer by George Savva for Are these valid histograms?

March 8, 2024, 4:56 am

Histograms can be used to represent frequency (as in the examples you showed) or density.Like you I was taught density histograms only, but frequency histograms are in very common use (and are the...

View Article

Comment by George Savva on What are some good books on how to avoid, as a...

March 11, 2024, 1:44 am

Not writing as an answer because I haven't read it, but 'Bullshit Jobs' by David Graeber might fit the bill. There was an article in Significance a few years ago on how it might relate to...

View Article

Answer by George Savva for Increasing the power by dropping points: can I do it?

March 12, 2024, 7:23 am

Your question might relate to 'independent filtering' of data that is performed by commonly used bioinformatics pipelines before any FDR adjustment takes place.The typical situation this is applied is...

View Article

Comment by George Savva on How does plotting QQ plot on ggplot work?

March 14, 2024, 4:02 am

There are two different stat_qq_line functions, one in ggplot2 and another in qqplotr and they do different things. Can you specify which you are using?

View Article

Comment by George Savva on Unexpected p-value distribution of Mann-Whitney U...

March 18, 2024, 3:04 am

I can't reproduce this in R, I get a perfectly uniform distribution.

View Article

Comment by George Savva on Using McNemar’s test to compare binary classifiers

March 21, 2024, 2:44 am

I would guess that the categories you want on your table margins are not 'yes' and 'no' but 'correct' and 'incorrect' classification. The McNemar test is then a test whether one model classifies more...

View Article

Comment by George Savva on emmeans differences in logistic regression give...

March 22, 2024, 7:22 am

You could look into what the marginaleffects package does. With type="response" It will provide you with a 'significant' p-value with a standard error and confidence interval. I'm not adding as an...

View Article

Comment by George Savva on Two-sided t-tests: Why do we need to test...

April 16, 2024, 5:17 am

+1 for the quote at the end. we need more of this!

View Article

Answer by George Savva for If we know that there were no type 1 errors in...

April 16, 2024, 4:09 am

Yes you would still need to make some correction to control the probability of a Type 1 error from your experimental procedure as a whole.If you reset alpha to (say) 0.05 whenever the interim analysis...

View Article

Comment by George Savva on phacking R package. Is it possible to estimate the...

April 19, 2024, 5:47 am

From the documentation it doesn't look like the phacking package does what you think it does. What is your understanding of what the phacking_meta function is for? My reading is that it is for...

View Article

Answer by George Savva for Applying the law of total variance

April 19, 2024, 6:36 am

The first term is the variance caused by $Y$ varying, the second term is the variance caused by $X$ varying.To calculate the second term, find the expectation of $Y$ for each value of $X$, and then...

View Article

Comment by George Savva on How to test specific contrasts about levels of...

April 22, 2024, 1:06 am

I can't add an answer while the question is closed, but if you modify the dataset such that Group only takes one value for treatment level B (effectively adding the constraint corresponding to your...

View Article