Fortgeschrittene quantitative Methoden
Wintersemester 2024-2025
Bitte vervollständige den Code, sodass die Summe 10 ergibt.
Tipp: Rechne: 10-1-2-3
Pflichtlektüre: Hinz, T. (2018). Methoden der Arbeitsmarktforschung (S. 479-524). In: Abraham, M., Hinz, T. (eds) Arbeitsmarktsoziologie. Springer VS, Wiesbaden.
“eine systematische, transparente sowie dem jeweiligen Forschungsproblem angemessene Vorgehensweise, um zu einer belastbaren Antwort auf die Forschungsfrage zu gelangen” (S.480).
Forschungsfragen:
Beschreibend
Erklärend
Beispiele?
Zufallsstichprobe - Wofür?
= nicht für die sozialwissenschaftliche Analyse generierte Daten
## Packages
pkgs <- c(
"stevedata",
"tidyverse",
"broom.helpers",
"ggplot2"
)
## Install uninstalled packages
lapply(pkgs[!(pkgs %in% installed.packages())], install.packages)list()
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'stringr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
[[1]]
[1] "stevedata" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "methods" "base"
[[2]]
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stevedata" "stats"
[13] "graphics" "grDevices" "utils" "datasets" "methods" "base"
[[3]]
[1] "broom.helpers" "lubridate" "forcats" "stringr"
[5] "dplyr" "purrr" "readr" "tidyr"
[9] "tibble" "ggplot2" "tidyverse" "stevedata"
[13] "stats" "graphics" "grDevices" "utils"
[17] "datasets" "methods" "base"
[[4]]
[1] "broom.helpers" "lubridate" "forcats" "stringr"
[5] "dplyr" "purrr" "readr" "tidyr"
[9] "tibble" "ggplot2" "tidyverse" "stevedata"
[13] "stats" "graphics" "grDevices" "utils"
[17] "datasets" "methods" "base"
df <- stevedata::gss_wages %>% filter(year==2018) %>%
filter(!is.na(age)) %>% filter(!is.na(realrinc))
glimpse(df)Rows: 1,358
Columns: 11
$ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
$ realrinc <dbl> 45400.0, 54480.0, 8512.5, 17025.0, 908.0, 45400.0, 54480.0,…
$ age <dbl> 42, 63, 59, 43, 62, 55, 59, 34, 44, 75, 55, 40, 34, 40, 37,…
$ occ10 <dbl> 1106, 3320, 3600, 5610, 4600, 6700, 10, 9120, 1720, 50, 325…
$ occrecode <chr> "Professional", "Professional", "Service", "Office and Admi…
$ prestg10 <dbl> 61, 59, 48, 35, 35, 39, 72, 35, 72, 53, 64, 64, 38, 64, 35,…
$ childs <dbl> 2, 2, 6, 0, 4, 2, 2, 3, 2, 4, 0, 2, 1, 1, 0, 3, 3, 0, 0, 5,…
$ wrkstat <chr> "Full-Time", "Full-Time", "Full-Time", "Full-Time", "Full-T…
$ gender <chr> "Male", "Female", "Female", "Male", "Female", "Male", "Male…
$ educcat <chr> "Bachelor", "Bachelor", "High School", "High School", "Less…
$ maritalcat <chr> "Married", "Married", "Divorced", "Never Married", "Widowed…
… um die Beziehung zwischen Alter und Einkommen zu beschreiben
p <- ggplot(data = df,
mapping = aes(x = age, y = log(realrinc))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "purple", se = FALSE) +
labs(
x = "Age" ,
y = "Log(income)"
)
p`geom_smooth()` using formula = 'y ~ x'
Ein Regressionsmodel ist eine Funktion, die den Zusammenhang zwischen dem Outcome, \(Y\), und dem Prädiktor, \(X\), beschreibt.
\[\begin{aligned} Y &= \color{black}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{black}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{black}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned}\]
\[ \begin{aligned} Y &= \color{purple}{\textbf{Model}} + \text{Error} \\[8pt] &= \color{purple}{\mathbf{f(X)}} + \epsilon \\[8pt] &= \color{purple}{\boldsymbol{\mu_{Y|X}}} + \epsilon \end{aligned} \]
m <- lm(log(realrinc) ~ age, data = df)
df$predicted <- predict(m)
ggplot(data = df,
mapping = aes(x = age, y = log(realrinc))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "purple", se = FALSE) +
labs(x = "X", y = "Y") +
theme_minimal() +
theme(
axis.text = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank()
)`geom_smooth()` using formula = 'y ~ x'

\[\begin{aligned} Y &= \color{purple}{\textbf{Model}} + \color{blue}{\textbf{Error}} \\[8pt] &= \color{purple}{\mathbf{f(X)}} + \color{blue}{\boldsymbol{\epsilon}} \\[8pt] &= \color{purple}{\boldsymbol{\mu_{Y|X}}} + \color{blue}{\boldsymbol{\epsilon}} \\[8pt] \end{aligned}\]
`geom_smooth()` using formula = 'y ~ x'

\[\Large{Y = \beta_0 + \beta_1 X + \epsilon}\]
\[\Large{\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X}\]
ggplot(data = df, mapping = aes(x = age, y = log(realrinc))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "purple", se = FALSE) +
geom_segment(aes(x = age, xend = age, y = log(realrinc), yend = predict(m)), color = "steel blue") +
labs(x = "Alter", y = "Einkommen") +
theme(legend.position = "none")
ggplot(data = df, mapping = aes(x = age, y = log(realrinc))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "purple", se = FALSE) +
geom_segment(aes(x = age, xend = age, y = log(realrinc), yend = predict(m)), color = "steel blue") +
labs(x = "Alter", y = "Einkommen") +
theme(legend.position = "none")`geom_smooth()` using formula = 'y ~ x'

\[\text{Residuen} = \text{Beobachted} - \text{Vorhergesagt} = y - \hat{y}\]
\[e_i = \text{beobachtet} - \text{vorhergesagt} = y_i - \hat{y}_i\]
\[e^2_1 + e^2_2 + \dots + e^2_n\]
\[\widehat{\text{log(Einkommen)}} = 8.91 + 0.01 \times \text{Alter}\]
Call:
lm(formula = log(realrinc) ~ age, data = df)
Coefficients:
(Intercept) age
8.91235 0.01444
\[\widehat{\text{log(Einkommen)}} = 8.91 + 0.01 \times \text{Alter}\]
Jemand ist 28 Jahre alt. Wie hoch ist nach diesem Modell das Einkommen?
\[ \begin{aligned} \widehat{\text{log(income)}} &= 8.91 + 0.01 \times \text{age} \\ &= 8.91 + 0.01 \times 28 \\ &= 9.19 \end{aligned} \]
# Create a new data frame for age = 28
new_data <- data.frame(age = 28)
# Predict log(realrinc) for age = 28
predicted_log_income <- predict(m_income, newdata = new_data)
# Optionally reverse the log transformation
predicted_real_income <- exp(predicted_log_income)
# View results
predicted_log_income # log(realrinc) 1
9.316802
1
11123.35
Bitte ermittle anhand einer Regression den durchschnittlichen Geschlechterspezifischen Lohnunterschied.
Lösung:z.B.
library("stevedata")
df <- stevedata::gss_wages %>% filter(year==2018) %>%
filter(!is.na(age)) %>% filter(!is.na(realrinc))
# Lineare Regression
model <- lm(realrinc ~ gender, data = df)
# Zusammenfassung der Regressionsergebnisse
summary(model)
# Ergebnisse übersichtlich darstellen
broom::tidy(model)
library("stevedata")
df <- stevedata::gss_wages %>% filter(year==2018) %>%
filter(!is.na(age)) %>% filter(!is.na(realrinc))
# Lineare Regression
model <- lm(realrinc ~ gender, data = df)
# Zusammenfassung der Regressionsergebnisse
summary(model)
# Ergebnisse übersichtlich darstellen
broom::tidy(model)Bitte stelle Geschlechterunterschiede im Einkommen Graphisch dar.
The gesuchte Variable lautet: gender
Lösung:
Ein Beispiel:
df <- stevedata::gss_wages %>% filter(year==2018) %>%
filter(!is.na(age)) %>% filter(!is.na(realrinc))
ggplot(df, aes(x = gender, y = realrinc, fill = gender)) +
geom_violin(trim = FALSE) +
labs(
title = "Geschlechterunterschiede beim Einkommen (2018)",
x = "Geschlecht",
y = "Reales Einkommen",
fill = "Geschlecht"
) +
theme_minimal()
df <- stevedata::gss_wages %>% filter(year==2018) %>%
filter(!is.na(age)) %>% filter(!is.na(realrinc))
ggplot(df, aes(x = gender, y = realrinc, fill = gender)) +
geom_violin(trim = FALSE) +
labs(
title = "Geschlechterunterschiede beim Einkommen (2018)",
x = "Geschlecht",
y = "Reales Einkommen",
fill = "Geschlecht"
) +
theme_minimal()