top of page

A2: Data Analysis & Research Design Evaluation 1/090012 Fundamentals of Biostatistics/11521 Programming for Data Science G/MAT5212 /ECON1066

For solutions, purchase a LIVE CHAT plan or contact us

Submission Deadline: Before 1300 hours (1pm) WAST Monday September 13

Q1: (22 marks). This question is designed to assess your understanding of the scientific research process and how it is used to answer research questions as explained in the relevant ilecture. You are given a broad area (listed below) and you are being asked to come up with a research question and plan a hypothetical research study to answer your research question. You will do this by answering few questions in the table below. You can come up with any research question as long as it fulfils three conditions:
⦁ Your research question is relevant somehow to the given broad area of COVID Lockdowns and First Year University Students.
⦁ Your research question is answerable by collecting and analysing some real data (although you are not actually collecting data for this or for any other question).
⦁ The hypothetical study you are planning to answer your research question is ‘doable’ i.e. it is possible practically and humanely.
What is your Research Question?
(1 mark)

What is the aim of your proposed study? What benefit/s it will have?
(2 marks)

Which study design do you choose for your study and why?
(2 marks)

What will be the inclusion criteria for your study participants?
Which sampling method you will choose? and why?
How will you recruit participants? (3.5 marks)

What are your independent and dependent variables?

What are three other factors/variables you will collect information about and why? (3.5 marks) .
Provide scale of measurement & units for each variable as you described in the part above? (e.g. Pulse rate; ratio variable; number of beats/min) (5 marks)

For each of your five variables, provide one suitable descriptive statistic & one suitable graph (5 marks)

Q2: (4 marks) ) A study investigated the link between prostate cancer and Dietary Inflammatory Index Score (DIIS) measured as (High and Low) using a prospective cohort design. A total of 12000 men, all free of prostate cancer at baseline, were recruited for this study. At the end of a four year follow up, 52 men were diagnosed with prostate cancer among 3502 who regularly consumed diet with high inflammatory index while there were 17 cases among those who consumed diet that had low inflammatory index.

For Q3 to Q6 you will use the SPSS data file you have created using information from Documents A & B.

Q3. (9 marks). Few parts of this question will require the use of Special Functions you learned in lab 2.
⦁ Create a suitable graph to display the frequency distribution for those who spend more than ten hours per week on the social media and describe your graph briefly (4 lines maximum). (2 marks)
⦁ Choose those who prefer dog as pet and obtain a frequency table showing mean, standard deviation, minimum and maximum for their time spent for Online shopping. (1 mark)
⦁ Create two groups (or categories) using variable Degree: (2 marks)
Group 1: Less than High School + High School + TAFE Qualification
Group 2: Undergraduate Degree + Post Graduate Degree
Now produce a table that shows mean, standard deviation, minimum and maximum Work Hours for both groups.
⦁ Create a new variable ‘CollectiveAveragetime’ (which is the overall average based on TV hours, Socialmedia and OnlineShopping). Provide a suitable graph for the new variable (CollectiveAveragetime) and describe main features of your graph (4 lines maximum) (2 marks).
⦁ Obtain and describe a suitable graph that shows relationship between Work Hours and Online Shopping. (2 marks).

Q4) (9 marks) Choose a variable (yes it is your choice of a suitable variable) and carry out all the checks to assess if it shows roughly normal distribution in the population, represented by this sample of 32 participants. Provide a brief description for each check under the relevant output. At the end provide an overall conclusion. Log transformation is not required. (Useful tip; lab 3 is a good guide but explanations are in the relevant ilecture).

Q5. (3 marks) Choose any two variables from your dataset that are suitable for investigating if there is a possible Association between them. Perform and report on the statistical significance of your findings (assessment of practical significance is not required). You do not need to test or report on any assumption.

Q6. (3 marks) Choose any two variables from your dataset that are suitable for investigating a possible Correlation. You do not need to test or report on any assumption.
Perform the analysis and briefly describe the strength, direction and statistical significance of the relationship.
Assess and report on the Practical significance.



Q1: (22 marks). This question is designed to assess your understanding of the scientific research process, how it is used to plan and conduct a research study to answer a research question (as explained in the relevant ilecture). You are given a broad topic area (listed below) and you will formulate a relevant research question and then plan a hypothetical research study to answer this research question. You will do this by answering questions provided in the table below. You can come up with any research question as long as it fulfils three conditions:
Your research question is relevant to the topic/area Financial Stress & First Year Curtin Students.
Your research question is answerable by collecting and analysing some real data (though you are not actually going to collect data for this or for any other question).
The hypothetical study you are planning to answer your research question is ‘doable’ i.e. it is possible practically and humanely.
Formatting requirements & Table: (for Q1 only): Times New Roman size 12 font with single spacing. Blank space provided in the table for each answer is the maximum space. Marker will only read and mark what is in the allowed space. Please stop typing when you cannot see what you are typing (as this means your answer has reached the maximum space). Apart from Q1, there are no formatting requirements for any other question in this assignment.
CAUTION: DO NOT change any dimensions or change the size of the table, rows or columns. Write your answers below & do not copy & paste this table to another document. Otherwise Penalties will apply.
Financial Stress & First Year Curtin Students
What is your Research Question?
(1 mark)

What is the aim of your proposed study? What benefit/s it will have?
(2 marks)

Which study design do you choose for your study and why?
(2 marks)

What will be the inclusion criteria for your study participants?
Which sampling method you will choose? and why?
How will you recruit participants? (3.5 marks)

What are your independent and dependent variables?
What are three other factors/variables you will collect information about and why? (3.5 marks)
Note: Must be a mix of both data types.
Provide measurement scale & units/categories (whichever applies) for each variable you described in the part above?(e.g. Pulse rate; ratio & no of beats/min) (5 marks)

For each of your five variables, provide one suitable descriptive statistic & one suitable graph (5 marks)

NOTE: You must include relevant SPSS output with your answer for following questions; No output = No mark
For Questions below use SPSS data file you have created from Document B.

Q2. (8 marks). Most parts of this question are based on the use of special functions you learned in second lab.
Provide a suitable graph that displays frequency distribution for those who have statistics anxiety score of more than 25 and provide description of your graph (4 lines maximum). (2 marks)
Provide a suitable graph that shows relationship between IQ scores and time spent on social media. Provide description of your graph (4 lines maximum). (2 marks)
Provide a suitable graph of pet ownership for those who answered ‘yes’ to the Evaluate question. (1 mark). Description is not required.
(2 marks). First, recode Work Status into a new variable that has only two categories (as below).
Category 1: Full time
Category 2: Other (Part time + Casual Jobs + Do not work) (1 mark)

Now obtain mean, median, mode, standard deviation, minimum and maximum on Social media time displayed separately for each group. (1 mark)

Create a new variable ‘Score’ by combining Statistics anxiety score and IQ scores (You are combining or adding the scores, and not calculating an average). Provide a table showing mean, median, mode, standard deviation, minimum and maximum for ‘Score’. (1 mark)

Q3. (9 marks) Choose a variable from your dataset/data file (yes it is your choice of a suitable variable) and carry out all the checks to assess if it shows roughly normal distribution in the population, represented by the provided sample. Provide a brief description for each check under the relevant SPSS output. At the end provide an overall conclusion. Log transformation is not required. (Useful tip; lab 3 is a good guide but explanations are in the relevant ilecture).

Q4. (3 marks) Choose any two variables from your dataset/data file that are suitable for investigating if there is a possible Association between them. Perform and report on the statistical significance of your findings (assessment of practical significance is not required). You do not need to test or report on any assumption.

Q5. (3 marks) Choose any two variables from your dataset/data file that are suitable for investigating a possible Correlation. You do not need to test or report on any assumption.
Perform the analysis and briefly describe the strength, direction and statistical significance of the relationship.
Assess and report on the Practical significance.

====================================================================================================

090012 – Fundamentals of Biostatistics

The dataset you will analyse for this assessment comes from a cross-sectional survey on substance
use and mental health in Sydney University students and was conducted in November 2019. The
dataset comes with 74 variables and 1433 participants – you can find the codebook for the dataset
on Canvas. Please read all instructions and have a look at the codebook before you start! If you have
any questions, have a look at the FAQ provided on Canvas. If you have any further questions please
ask them during the Zoom Meetings or use the discussion board on Canvas.

Structure: There is no specific structure for this assessment, please just use the Tasks as headings.
There is no need for an introduction or a summary, ensure to state all information that is needed for
each of the tasks. Ensure to copy and paste the full SPSS Output at the end of the assessment.

Note: The assessment is divided into three separate tasks. Task 1 and 2 will assist and guide you to
ensure the dataset is ready and you have provided sufficient information on the assessment itself –
this is needed to be able to answer the questions in Task 3.

Task 1: Data cleaning and Preparation
1. Import the data into SPSS and code the data as per codebook provided on Canvas.
2. Check the data and clean according to the specific instructions provided in the codebook
depending on the variable. Make sure to provide us with information on the cleaning
process stating the participant ID.
3. New variables are needed for upcoming tasks. Create the following variables:
a. Two new binary variables showing the use of any illicit substance in a participant’s
life and in the last year (Yes/No). Note: You may need two steps to create these
variables.
b. Create a summary score variable for: (I) Self Efficacy, (II) Coping, and (III) Mental
Health. Analyse and comment on the internal consistency of the variable. Adjust the
summary as necessary based on your analysis. If you undertake adjustments,
describe the steps you have undertaken and comment on the internal consistency of
the final summary score variable.
c. Create a summary score and a categorical variable for the AUDIT Scores. You do not
need to analyse the internal consistency of this scale.
d. In the variable ‘Area_live’ combine ‘Inner Regional’ and ‘Outer Regional’ into one
value titled ‘Regional’; combine ‘Remote’ and ‘Very Remote’ into one value titled
‘Remote’.

Task 2: Describing the sample
Describe the sample in the dataset. You need to report on the following variables: Age, Ethnicity,
Living Area, Gender, Sexual Orientation, Relationship Status, Lifetime and Past Year Use of Illicit
Substances, the summary scores for Self-Efficacy, Coping, and Mental Health as well as the summary
scores and categories for the AUDIT C. Assess the normality of all continuous variables using Option
1 discussed Module 3 (3.5.1 – Normality Test). Visualise the data for two categorical variables using
appropriate graphs.
Note: Since you analysed the normality here, you can just briefly comment under upcoming analysis
and refer back to this point. You do not need to analyse the normality of a variable more than once
as the data stays the same.

Task 3: Analysis
Analyse the data using the most appropriate statistical methods to answer the following questions:
1. Is there a difference in
a. coping scores between living areas?
b. mental health scores between living areas?
2. Is there a difference in AUDIT-C scores between participants who used illicit substances in
the past year and those who did not?
3. Is there a difference in lifetime illicit substance use by
a. Gender?
b. Living Area?
c. Sexual Orientation?
d. Relationship Status?
4. What is the predicted mental health continuum score for a hypothetical person with a
coping score of 70?

For all questions in Task 3, you must report the following:
- hypotheses,
- checking assumptions,
- report test statistic and other relevant results,
- comment on clinical significance of the findings.
Use this as the structure for this part of the assessment. You do not need to state a research
question.

=============================================================================================

11521 Programming for Data Science G

23:59 Sunday 18/09/2022 (Week 7)

[6 marks] Question 1: Implement a Python program for Nearest Neighbour Classifier that can
classify an unknown data sample to one of the given classes.
For example, there are 2 classes Red and Blue, and x is an unknown data sample (i.e., we do not
know x is red or blue). After calculating all distances between x and all data samples in the 2 classes,
we find a data sample in the Red class that has shortest distance to x, so x is classified as a red data
sample.
Requirements: Your program reads data samples from 2 text files for 2 classes and unknown data
samples from another text file, runs the Nearest Neighbour Classifier algorithm as demonstrated in
the screenshots below, and outputs all unknown data samples and their classified label to screen
and to another text file. Your program should work with any data dimension D > 1 and any number
of unknown data samples > 0. For Python programming, use a tuple to store a data sample, a list to
store all data samples, and modules to store functions. The main program includes only function
calls and does not include any function implementations. Please do not use other versions of
Nearest Neighbour Classifier you can find on websites or research articles, and do not import any
external packages (except tkinter) to this project.

[14 marks] Question 2: Implement a Python program for K-Means Clustering that can group data
samples to clusters.
For example, you are given a set of data samples to group them into 2 clusters. The K-means
clustering algorithm generates 2 cluster centres at random, groups data samples that are nearest to
the first cluster centre to form a cluster then do the same with the second one to form another
cluster. The algorithm will generate new cluster centres by averaging data samples in the same
cluster. If the difference between the 2 old cluster centres and the 2 new cluster centres are not
significant, the algorithm will stop, otherwise it removes the old cluster centres and re-groups data
samples for the new cluster centres as seen above to form new clusters. The process repeats until
the difference between the old and new cluster centres is not significant.
Requirements: Your program reads data samples from a text file, runs K-means Clustering algorithm
as demonstrated in the screenshots below, and outputs all data samples with cluster centres to
screen as below. Your program should work with any data dimension D > 1 and any number of
clusters K > 1. For Python programming, use tkinter to display data samples and cluster centres on a
canvas, a tuple to store a data sample or a cluster centre, a list to store all data samples or all cluster
centres, and modules to store functions. The main program includes only function calls and does not
include any function implementations. Please do not use other versions of K-Means Clustering that
you can find on websites or research articles to implement this project. Please do not import any
external packages (except tkinter) to this project.
The screenshots below explain how K-means Clustering algorithm works.

============================================================================================

MAT5212

Question 1
D-Dimer is a small protein fragment found in blood that is indicative of blood clotting.
A D-dimer concentration above a threshold of 500 ng/ml is used to diagnose venous
thromboembolism (VTE). It is suspected that in healthy pregnancy D-Dimer levels
naturally increase above this threshold, thereby leading to false positive VTE test
results. To assess how D-Dimer changes at a population level during healthy
pregnancy, 90 healthy pregnant women, from various trimester stages (30 women at

trimester 1; 30 women at trimester 2; 30 women at trimester 3), were recruited and
their D-Dimer concentrations were measured. The aim of the study is to determine if
the mean D-Dimer concentrations differ between the three trimesters of pregnancy.
The data for this question can be found in the ‘D-Dimer’ sheet of your data file. Use the
5% level of significance to answer this question.
 
Question 2
VO2 max is a measure of cardiovascular health. It is the maximum oxygen volume a
person can consume while performing intense exercise. For sedentary individuals, it
takes approximately 4-6 weeks of training to noticeably increase VO2 max. High-
intensity interval training (HIIT) programs are thought to increase VO2 max more
quickly. 
To determine whether HIIT was capable of increasing VO2 max after only 3 weeks, 25
sedentary males aged 40-49 years were recruited for the study. Their VO2 max was
measured at baseline and again after 3 weeks of the training program. The
measurements (ml/kg/min) can be found in the ‘VO2’ sheet of your data file. Use alpha
= 0.01 to answer this question.
 
Question 3
Nitrate is a naturally occurring compound that is found in rivers, lakes, and
groundwater. Excess nitrate can enter waterways from sources of human activities
such as landfills, urban drainage, and run-off from fertilised soil. High levels of nitrate
in drinking water can lead to health problems. 
There is evidence to show that surface water nitrate levels fluctuate seasonally during
wet and dry seasons; however, the effects of seasonality on ground water remain
unclear. To determine whether ground water nitrate levels change seasonally,
quarterly water samples were taken from 20 wells and the nitrate levels were
measured (mg/L). Use the date in the ‘Nitrate’ sheet of your data file and alpha = 0.05
to answer this question.

========================================================================================

ECON1066

QUESTION 1)
Use the dataset: WDI_2250.RData
1) Present descriptive statistics for the GDPpc variable (min, max, median, mean, 25th
and 75th percentile) and describe each statistic in full sentences. Identify which are the
countries with the lowest and highest GDPpc.

3 marks
2) Present a histogram of the GDPpc variable, and comment on the shape of the
distribution.

2 marks
3) Calculate a 95 % confidence interval around the mean of the GDPpc variable
(sample), and interpret the interval estimates in full sentences. Show your manual
calculation and the associated confidence interval formulas!

4 marks
Subtotal: 9 marks

QUESTION 2)
Use the dataset: WDI_2250.RData
Use R to run a cross sectional regression on GDP per capita for the listed countries as
follows:
Ln(GDPpc) = �� + ����(������) + ������� + ����� + ���������� + �
The variables are defined as follows:
GDPpc = GDP per capita, PPP (current international $)
Conspc= Households and NPISHs final consumption expenditure per capita (constant 2015
US$) [NE.CON.PRVT.PC.KD]
Trade=Trade (% of GDP) [NE.TRD.GNFS.ZS]
HCI=Human capital index (HCI) (scale 0-1) [HD.HCI.OVRL]
Hightech=Medium and high-tech manufacturing value added (% manufacturing value added)
[NV.MNF.TECH.ZS.UN]
You will have to take the natural log of GDPpc and Consumption per capita yourself using
R!

1) Present your regression results in a table below (R output):

5 marks

2) Interpret the constant (2.5 marks) and its p-value (1.5 marks).

4 marks

3) Interpret the coefficient on household and NPISH consumption and its p-value (1.5 marks
each).

3 marks

4) Interpret the coefficient on trade and its p-value (1.5 marks each).

3 marks
5) Interpret the coefficient on Human Capital Index and carry out (meaning: calculate with
the official formula) a t-test to determine the significance of the coefficient (1.5 marks
each). Hint: Use a lower than unit scale for HCI, such as a “0.1 scale change” for
example.

3 marks

a.

6) Interpret the R2 of the regression.

2 marks
7) Run the following regression and present the regression results (Q2.7) below (2 marks):
Ln(GDPpc) = �� + ����(������) +u
• Comment on how the coefficient on ln(Consumption pc) differs from that of Question
2.1! (1 mark)
• Why do you observe this difference and what does it mean for the (un)biasedness of
the coefficient in 2.7? (1 mark)
• What is the direction of the bias and why? (1 mark)
• Can you directly compare Eq. 2.1 and 2.7 (hint: look at the sample size of the
estimates)? (1 mark)

6 marks
8) Describe each of the Gauss Markov Assumptions (2.5 marks) and specifically explain if
they are likely to hold for the regression in Q 2.1 or not (2.5 marks).

5 marks

For solutions, purchase a LIVE CHAT plan or contact us

Limited time offer:

Follow us on Instagram and tag 10 friends for a $50 voucher! No minimum purchase required.

bottom of page