Title: | K-Fold Cross Validation for Factor Analysis |
---|---|
Description: | Provides functions to identify plausible and replicable factor structures for a set of variables via k-fold cross validation. The process combines the exploratory and confirmatory factor analytic approach to scale development (Flora & Flake, 2017) <doi:10.1037/cbs0000069> with a cross validation technique that maximizes the available data (Hastie, Tibshirani, & Friedman, 2009) <isbn:978-0-387-21606-5>. Also available are functions to determine k by drawing on power analytic techniques for covariance structures (MacCallum, Browne, & Sugawara, 1996) <doi:10.1037/1082-989X.1.2.130>, generate model syntax, and summarize results in a report. |
Authors: | Kyle Nickodem [aut, cre] and Peter Halpin [aut] |
Maintainer: | Kyle Nickodem <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.2 |
Built: | 2024-11-21 04:06:33 UTC |
Source: | https://github.com/knickodem/kfa |
The factor correlations aggregated over k-folds
agg_cors(models, flag = 0.9, type = "factor")
agg_cors(models, flag = 0.9, type = "factor")
models |
An object returned from |
flag |
threshold above which a factor correlation will be flagged |
type |
currently ignored; |
data.frame
of mean factor correlations for each factor model and vector
with count of folds with a flagged correlation
data(example.kfa) agg_cors(example.kfa)
data(example.kfa) agg_cors(example.kfa)
The factor loadings aggregated over k-folds
agg_loadings(models, flag = 0.3, digits = 2)
agg_loadings(models, flag = 0.3, digits = 2)
models |
An object returned from |
flag |
threshold below which loading will be flagged |
digits |
integer; number of decimal places to display in the report. |
data.frame
of mean factor loadings for each factor model and vector
with count of folds with a flagged loading
data(example.kfa) agg_loadings(example.kfa)
data(example.kfa) agg_loadings(example.kfa)
Summary table of model fit aggregated over k-folds
agg_model_fit(kfits, index = "all", digits = 2)
agg_model_fit(kfits, index = "all", digits = 2)
kfits |
an object returned from |
index |
character; one or more fit indices to summarize. Indices
must be present in the |
digits |
integer; number of decimal places to display in the report |
data.frame
of aggregated model fit statistics
data(example.kfa) fits <- k_model_fit(example.kfa, by.fold = TRUE) agg_model_fit(fits)
data(example.kfa) fits <- k_model_fit(example.kfa, by.fold = TRUE) agg_model_fit(fits)
The factor reliabilities aggregated over k-folds
agg_rels(models, flag = 0.6, digits = 2)
agg_rels(models, flag = 0.6, digits = 2)
models |
An object returned from |
flag |
threshold below which reliability will be flagged |
digits |
integer; number of decimal places to display in the report. |
data.frame
of mean factor (scale) reliabilities for each factor model and vector
with count of folds with a flagged reliability
data(example.kfa) agg_rels(example.kfa)
data(example.kfa) agg_rels(example.kfa)
Uses the factor loadings matrix, presumably from an exploratory factor analysis, to generate lavaan
compatible confirmatory factory analysis syntax.
efa_cfa_syntax( loadings, simple = TRUE, min.loading = NA, single.item = c("keep", "drop", "none"), identified = TRUE, constrain0 = FALSE )
efa_cfa_syntax( loadings, simple = TRUE, min.loading = NA, single.item = c("keep", "drop", "none"), identified = TRUE, constrain0 = FALSE )
loadings |
matrix of factor loadings |
simple |
logical; Should the perfect simple structure be returned (default) when converting EFA results to CFA syntax?
If |
min.loading |
numeric between 0 and 1 indicating the minimum (absolute) value of the loading for a variable on a factor
when converting EFA results to CFA syntax. Must be specified when |
single.item |
character indicating how single-item factors should be treated.
Use |
identified |
logical; Should identification check for rotational uniqueness a la Millsap (2001) be performed?
If the model is not identified |
constrain0 |
logical; Should variable(s) with all loadings below |
Millsap, R. E. (2001). When trivial constraints are not trivial: The choice of uniqueness constraints in confirmatory factor analysis. *Structural Equation Modeling, 8*(1), 1-17. doi:10.1207/S15328007SEM0801_1
loadings <- matrix(c(rep(.2, 3), rep(.6, 3), rep(.8, 3), rep(.3, 3)), ncol = 2) # simple structure efa_cfa_syntax(loadings) # allow cross-loadings and check if model is identified efa_cfa_syntax(loadings, simple = FALSE, min.loading = .25) # allow cross-loadings and ignore identification check efa_cfa_syntax(loadings, simple = FALSE, min.loading = .25, identified = FALSE)
loadings <- matrix(c(rep(.2, 3), rep(.6, 3), rep(.8, 3), rep(.3, 3)), ncol = 2) # simple structure efa_cfa_syntax(loadings) # allow cross-loadings and check if model is identified efa_cfa_syntax(loadings, simple = FALSE, min.loading = .25) # allow cross-loadings and ignore identification check efa_cfa_syntax(loadings, simple = FALSE, min.loading = .25, identified = FALSE)
Simulated responses for 900 observations on 20 variables loading onto a 3 factor
structure (see example in kfa
documentation for model).
The simulated data was run through kfa
with the call
kfa(sim.data, k = 2, m = 3) which tested 1-, 2-, and 3-factor structures over 2 folds.
data(example.kfa)
data(example.kfa)
An object of class "kfa"
, which is a four-element list
:
cfas lavaan
CFA objects for each k fold
cfa.syntax syntax used to produce CFA objects
model.names vector of names for CFA objects
efa.structures all factor structures identified in the EFA
data(example.kfa) agg_cors(example.kfa)
data(example.kfa) agg_cors(example.kfa)
This function is specifically for determining k in the context of factor analysis using change in RMSEA as the criterion for identifying the optimal factor model.
find_k( variables, n, p, m = NULL, est.pars = NULL, max.k = 10, min.nk = 200, rmsea0 = 0.05, rmseaA = 0.08, ... )
find_k( variables, n, p, m = NULL, est.pars = NULL, max.k = 10, min.nk = 200, rmsea0 = 0.05, rmseaA = 0.08, ... )
variables |
a |
n |
integer; number of observations. Ignored if |
p |
integer; number of variables to factor analyze. Ignored if |
m |
integer; maximum number of factors expected to be extracted from |
est.pars |
integer; number estimated model parameters. Default is 2 |
max.k |
integer; maximum number of folds. Default is 10. |
min.nk |
integer; minimum sample size per fold. Default is 200 based on simulations from Curran et al. (2003). |
rmsea0 |
numeric; RMSEA under the null hypothesis. |
rmseaA |
numeric; RMSEA under the alternative hypothesis. |
... |
other arguments passed to |
named vector with the number of folds (k), sample size suggested for each fold by the power analysis (power.nk),
the degrees of freedom used for power analysis, and the sample size for each fold used for determining k (nk)–the higher of power.nk
and min.nk
.
Curran, P. J., Bollen, K. A., Chen, F., Paxton, P., & Kirby, J. B. (2003). Finite sampling properties of the point estimates and confidence intervals of the RMSEA. Sociological Methods & Research, 32(2), 208-252. doi:10.1177/0049124103256130
MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of sample size for covariance structure modeling. Psychological Methods, 1(2), 130–149. doi:10.1037/1082-989X.1.2.130
find_k(n = 900, p = 11, m = 3) # adjust precision find_k(n = 900, p = 11, m = 3, rmsea0 = .03, rmseaA = .10) # adjust number of estimated parameters (e.g., constrain all factor loadings to be equal) find_k(n = 900, p = 11, m = 3, est.pars = 15)
find_k(n = 900, p = 11, m = 3) # adjust precision find_k(n = 900, p = 11, m = 3, rmsea0 = .03, rmseaA = .10) # adjust number of estimated parameters (e.g., constrain all factor loadings to be equal) find_k(n = 900, p = 11, m = 3, est.pars = 15)
Extract standardized factor loadings from lavaan object
get_std_loadings(object, type = "std.all", df = FALSE)
get_std_loadings(object, type = "std.all", df = FALSE)
object |
a |
type |
standardize on the latent variables ( |
df |
should loadings be returned as a |
A matrix
or data.frame
of factor loadings
data(HolzingerSwineford1939, package = "lavaan") HS.model <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed =~ x7 + x8 + x9 ' fit <- lavaan::cfa(HS.model, data = HolzingerSwineford1939) get_std_loadings(fit)
data(HolzingerSwineford1939, package = "lavaan") HS.model <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed =~ x7 + x8 + x9 ' fit <- lavaan::cfa(HS.model, data = HolzingerSwineford1939) get_std_loadings(fit)
Shows the fit indices available from kfa
object to report in kfa_report
index_available(models)
index_available(models)
models |
an object returned from |
character vector of index names
data(example.kfa) index_available(example.kfa)
data(example.kfa) index_available(example.kfa)
Model fit indices extracted from k-folds
k_model_fit(models, index = "default", by.fold = TRUE)
k_model_fit(models, index = "default", by.fold = TRUE)
models |
an object returned from |
index |
character; one or more fit indices to summarize in the report. Use |
by.fold |
Should each element in the returned lists be a fold (default) or a factor model? |
list
of data.frames with average model fit for each factor model
data(example.kfa) # customize fit indices to report k_model_fit(example.kfa, index = c("chisq", "cfi", "rmsea", "srmr")) # organize results by factor model rather than by fold k_model_fit(example.kfa, by.fold = FALSE)
data(example.kfa) # customize fit indices to report k_model_fit(example.kfa, index = c("chisq", "cfi", "rmsea", "srmr")) # organize results by factor model rather than by fold k_model_fit(example.kfa, by.fold = FALSE)
The function splits the data into k folds where each fold contains training data and test data.
For each fold, exploratory factor analyses (EFAs) are run on the training data. The structure for each model
is transformed into lavaan
-compatible confirmatory factor analysis (CFA) syntax.
The CFAs are then run on the test data.
kfa( data, variables = names(data), k = NULL, m = floor(length(variables)/4), seed = 101, cores = NULL, custom.cfas = NULL, power.args = list(rmsea0 = 0.05, rmseaA = 0.08), rotation = "oblimin", simple = TRUE, min.loading = NA, single.item = "none", ordered = FALSE, estimator = NULL, missing = "listwise", ... )
kfa( data, variables = names(data), k = NULL, m = floor(length(variables)/4), seed = 101, cores = NULL, custom.cfas = NULL, power.args = list(rmsea0 = 0.05, rmseaA = 0.08), rotation = "oblimin", simple = TRUE, min.loading = NA, single.item = "none", ordered = FALSE, estimator = NULL, missing = "listwise", ... )
data |
a |
variables |
character vector of column names in |
k |
number of folds in which to split the data. Default is |
m |
integer; maximum number of factors to extract. Default is 4 items per factor. |
seed |
integer passed to |
cores |
integer; number of CPU cores to use for parallel processing. Default is |
custom.cfas |
a single object or named |
power.args |
named |
rotation |
character (case-sensitive); any rotation method listed in
|
simple |
logical; Should the perfect simple structure be returned (default) when converting EFA results to CFA syntax?
If |
min.loading |
numeric between 0 and 1 indicating the minimum (absolute) value of the loading for a variable on a factor
when converting EFA results to CFA syntax. Must be specified when |
single.item |
character indicating how single-item factors should be treated.
Use |
ordered |
logical; Should items be treated as ordinal and the
polychoric correlations used in the factor analysis? When |
estimator |
if |
missing |
default is "listwise". See |
... |
other arguments passed to |
In order for custom.cfas
to be tested along with the EFA identified structures, each model supplied in custom.cfas
must
include all variables
in lavaan
-compatible syntax.
Deciding an appropriate m can be difficult, but is consequential for the possible factor structures to
examine, the power analysis to determine k, and overall computation time.
The n_factors
function in the parameters
package can assist with this decision.
When converting EFA results to CFA syntax (via efa_cfa_syntax
), the simple structure is
defined as each variable loading onto a single factor. This is determined using the largest factor loading for each variable.
When simple = FALSE
, variables are allowed to cross-load on multiple factors. In this case, all pathways with loadings
above the min.loading
are retained. However, allowing cross-loading variables can result in model under-identification.
The efa_cfa_syntax
) function conducts an identification check (i.e., identified = TRUE
) and
under-identified models are not run in the CFA portion of the analysis.
An object of class "kfa"
, which is a four-element list
:
cfas lavaan
CFA objects for each k fold
cfa.syntax syntax used to produce CFA objects
model.names vector of names for CFA objects
efa.structures all factor structures identified in the EFA
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # include a custom 2-factor model custom2f <- paste0("f1 =~ ", paste(colnames(sim.data)[1:10], collapse = " + "), "\nf2 =~ ",paste(colnames(sim.data)[11:20], collapse = " + ")) mods <- kfa(data = sim.data, k = NULL, # prompts power analysis to determine number of folds cores = 2, custom.cfas = custom2f)
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # include a custom 2-factor model custom2f <- paste0("f1 =~ ", paste(colnames(sim.data)[1:10], collapse = " + "), "\nf2 =~ ",paste(colnames(sim.data)[11:20], collapse = " + ")) mods <- kfa(data = sim.data, k = NULL, # prompts power analysis to determine number of folds cores = 2, custom.cfas = custom2f)
Generates a report summarizing the factor analytic results over k-folds.
kfa_report( models, file.name, report.title = file.name, path = NULL, report.format = "html_document", word.template = NULL, index = "default", plots = TRUE, load.flag = 0.3, cor.flag = 0.9, rel.flag = 0.6, digits = 2 )
kfa_report( models, file.name, report.title = file.name, path = NULL, report.format = "html_document", word.template = NULL, index = "default", plots = TRUE, load.flag = 0.3, cor.flag = 0.9, rel.flag = 0.6, digits = 2 )
models |
an object returned from |
file.name |
character; file name to create on disk. |
report.title |
character; title of the report |
path |
character; path of the directory where summary report will be saved. Default is working directory. |
report.format |
character; file format of the report. Default is HTML ("html_document"). See |
word.template |
character; file path to word document to use as a formatting template when |
index |
character; one or more fit indices to summarize in the report. Use |
plots |
logical; should plots of the factor models be included in the report? Default is |
load.flag |
numeric; factor loadings of variables below this value will be flagged. Default is .30 |
cor.flag |
numeric; factor correlations above this value will be flagged. Default is .90 |
rel.flag |
numeric; factor (scale) reliabilities below this value will be flagged. Default is .60. |
digits |
integer; number of decimal places to display in the report. |
A summary report of factor structures and model fit within and between folds.
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # include a custom 2-factor model custom2f <- paste0("f1 =~ ", paste(colnames(sim.data)[1:10], collapse = " + "), "\nf2 =~ ",paste(colnames(sim.data)[11:20], collapse = " + ")) mods <- kfa(data = sim.data, k = NULL, # prompts power analysis to determine number of folds cores = 2, custom.cfas = custom2f) ## Not run: kfa_report(mods, file.name = "example_sim_kfa_report", report.format = "html_document", report.title = "K-fold Factor Analysis - Example Sim") ## End(Not run)
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # include a custom 2-factor model custom2f <- paste0("f1 =~ ", paste(colnames(sim.data)[1:10], collapse = " + "), "\nf2 =~ ",paste(colnames(sim.data)[11:20], collapse = " + ")) mods <- kfa(data = sim.data, k = NULL, # prompts power analysis to determine number of folds cores = 2, custom.cfas = custom2f) ## Not run: kfa_report(mods, file.name = "example_sim_kfa_report", report.format = "html_document", report.title = "K-fold Factor Analysis - Example Sim") ## End(Not run)
Extract unique factor structures across the k-folds
model_structure(models)
model_structure(models)
models |
An object returned from |
data.frame
with the number of folds the unique factor structure was tested for each factor model.
data(example.kfa) model_structure(example.kfa)
data(example.kfa) model_structure(example.kfa)
This function is intended for use on independent samples rather than integrated with k-fold cross-validation.
run_efa( data, variables = names(data), m = floor(ncol(data)/4), rotation = "oblimin", simple = TRUE, min.loading = NA, single.item = c("keep", "drop", "none"), identified = TRUE, constrain0 = FALSE, ordered = FALSE, estimator = NULL, missing = "listwise", ... )
run_efa( data, variables = names(data), m = floor(ncol(data)/4), rotation = "oblimin", simple = TRUE, min.loading = NA, single.item = c("keep", "drop", "none"), identified = TRUE, constrain0 = FALSE, ordered = FALSE, estimator = NULL, missing = "listwise", ... )
data |
a |
variables |
character vector of column names in |
m |
integer; maximum number of factors to extract. Default is 4 items per factor. |
rotation |
character (case-sensitive); any rotation method listed in
|
simple |
logical; Should the perfect simple structure be returned (default) when converting EFA results to CFA syntax?
If |
min.loading |
numeric between 0 and 1 indicating the minimum (absolute) value of the loading for a variable on a factor
when converting EFA results to CFA syntax. Must be specified when |
single.item |
character indicating how single-item factors should be treated.
Use |
identified |
logical; Should identification check for rotational uniqueness a la Millsap (2001) be performed?
If the model is not identified |
constrain0 |
logical; Should variable(s) with all loadings below |
ordered |
logical; Should items be treated as ordinal and the
polychoric correlations used in the factor analysis? When |
estimator |
if |
missing |
default is "listwise". See |
... |
other arguments passed to |
When converting EFA results to CFA syntax (via efa_cfa_syntax
), the simple structure is
defined as each variable loading onto a single factor. This is determined using the largest factor loading for each variable.
When simple = FALSE
, variables are allowed to cross-load on multiple factors. In this case, all pathways with loadings
above the min.loading
are retained. However, allowing cross-loading variables can result in model under-identification.
An identification check is run by default, but can be turned off by setting identified = FALSE
.
A three-element list
:
efas lavaan
object for each m model
loadings (rotated) factor loading matrix for each m model
cfa.syntax CFA syntax generated from loadings
Millsap, R. E. (2001). When trivial constraints are not trivial: The choice of uniqueness constraints in confirmatory factor analysis. Structural Equation Modeling, 8(1), 1-17. doi:10.1207/S15328007SEM0801_1
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # Run 1-, 2-, and 3-factor models efas <- run_efa(sim.data, m = 3)
# simulate data based on a 3-factor model with standardized loadings sim.mod <- "f1 =~ .7*x1 + .8*x2 + .3*x3 + .7*x4 + .6*x5 + .8*x6 + .4*x7 f2 =~ .8*x8 + .7*x9 + .6*x10 + .5*x11 + .5*x12 + .7*x13 + .6*x14 f3 =~ .6*x15 + .5*x16 + .9*x17 + .4*x18 + .7*x19 + .5*x20 f1 ~~ .2*f2 f2 ~~ .2*f3 f1 ~~ .2*f3 x9 ~~ .2*x10" set.seed(1161) sim.data <- simstandard::sim_standardized(sim.mod, n = 900, latent = FALSE, errors = FALSE)[c(2:9,1,10:20)] # Run 1-, 2-, and 3-factor models efas <- run_efa(sim.data, m = 3)
Converts variable names to lavaan-compatible exploratory factor analysis syntax
write_efa(nf, vnames)
write_efa(nf, vnames)
nf |
integer; number of factors |
vnames |
character vector; names of variables to include in the efa |
character. Use cat()
to best examine the returned syntax.
vnames <- paste("x", 1:10) syntax <- write_efa(nf = 2, vnames = vnames) cat(syntax)
vnames <- paste("x", 1:10) syntax <- write_efa(nf = 2, vnames = vnames) cat(syntax)