n-fold cross-validation classification with LDA classifier
- For CoSMoMVPA's copyright information and license terms, #
- see the COPYING file distributed with CoSMoMVPA. #
Contents
Define data
config = cosmo_config(); data_path = fullfile(config.tutorial_data_path, 'ak6', 's01'); % Load the dataset with VT mask ds = cosmo_fmri_dataset([data_path '/glm_T_stats_perrun.nii'], ... 'mask', [data_path '/vt_mask.nii']); % remove constant features ds = cosmo_remove_useless_data(ds);
set sample attributes
ds.sa.targets = repmat((1:6)', 10, 1); ds.sa.chunks = floor(((1:60) - 1) / 6)' + 1; % Add labels as sample attributes classes = {'monkey', 'lemur', 'mallard', 'warbler', 'ladybug', 'lunamoth'}; ds.sa.labels = repmat(classes, 1, 10)'; % this is good practice after setting attributes manually cosmo_check_dataset(ds);
Part 1: 'manual' crossvalidation
nsamples = size(ds.samples, 1); % should be 60 for this dataset % allocate space for preditions for all 60 samples all_pred = zeros(nsamples, 1); % safety check: % to simplify this exercise, the code below assumes that .sa.chunks is in % the range 1..10; if that is not the case, the code may not work properly. % Therefore an 'assert' statement is used to verify that the chunks are as % required for the remainder of this exercise. assert(isequal(ds.sa.chunks, floor(((1:60) - 1) / 6)' + 1)); nfolds = numel(unique(ds.sa.chunks)); % should be 10 % run n-fold cross-validation % in the k-th fold (k ranges from 1 to 10), test the LDA classifier on % samples with chunks==k and after training on all other samples. % % (in this exercise this is done manually, but easier solutions involve % using cosmo_nfold_partitioner and cosmo_crossvalidation_measure) for fold = 1:nfolds % make a logical mask (of size 60x1) for the test set. It should have % the value true where ds.sa.chunks has the same value as 'fold', and % the value false everywhere else. Assign this to the variable % 'test_msk' % >@@> test_msk = ds.sa.chunks == fold; % <@@< % slice the input dataset 'ds' across samples using 'test_msk' so that % it has only samples in the 'fold'-th chunk. Assign the result to the % variable 'ds_test'; % >@@> ds_test = cosmo_slice(ds, test_msk); % <@@< % now make another logical mask (of size 60x1) for the training set. % the value true where ds.sa.chunks has a different value as 'fold', % and the value false everywhere else. Assign this to the variable % 'train_msk' % >@@> train_msk = ds.sa.chunks ~= fold; % (alternative: train_msk=~test_msk) % <@@< % slice the input dataset again using train_msk, and assign to the % variable 'ds_train' % >@@> ds_train = cosmo_slice(ds, train_msk); % <@@< % Use cosmo_classify_lda to get predicted targets for the % samples in 'ds_test'. To do so, use the samples and targets % from 'ds_train' for training (as first and second argument for % cosmo_classify_lda), and the samples from 'ds_test' for testing % (third argument for cosmo_classify_lda). % Assign the result to the variable 'fold_pred', which should be a 6x1 % vector. % >@@> fold_pred = cosmo_classify_lda(ds_train.samples, ds_train.sa.targets, ... ds_test.samples); % <@@< % store the predictions from 'fold_pred' in the 'all_pred' vector, % at the positions masked by 'test_msk'. % >@@> all_pred(test_msk) = fold_pred; % <@@< end % safety check: % for this exercise, the following code tests whether the predicted classes % is as they should be (i.e. the correct answer); if not an error is % raised. expected_pred = [1 1 1 2 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 2 4 3 1 3 3 4 3 3 3 3 4 4 2 4 4 4 4 4 3 2 5 5 5 5 5 6 5 5 5 5 6 6 6 6 6 6 6 6 6 6]; % check that the output is as expected if ~isequal(expected_pred(:), all_pred) error('expected predictions to be row-vector with [%s]''', ... sprintf('%d ', all_pred_alt)); end % Compute classification accuracy of all_pred compared to the targets in % the input dataset 'ds', and assign to a variable 'accuracy' % >@@> accuracy = mean(all_pred == ds.sa.targets); % <@@< % print the accuracy fprintf('\nLDA all categories n-fold: accuracy %.3f\n', accuracy); % Visualize confusion matrix % the cosmo_confusion_matrix convenience function is used to compute the % confusion matrix [confusion_matrix, classes] = cosmo_confusion_matrix(ds.sa.targets, all_pred); nclasses = numel(classes); % print confusion matrix to terminal window fprintf('\nLDA n-fold cross-validation confusion matrix:\n'); disp(confusion_matrix); % make a pretty figure figure; imagesc(confusion_matrix, [0 10]); title('confusion matrix'); set(gca, 'XTick', 1:nclasses, 'XTickLabel', classes); set(gca, 'YTick', 1:nclasses, 'YTickLabel', classes); ylabel('target'); xlabel('predicted'); colorbar;
LDA all categories n-fold: accuracy 0.833 LDA n-fold cross-validation confusion matrix: 8 2 0 0 0 0 1 9 0 0 0 0 1 0 7 2 0 0 0 2 1 7 0 0 0 0 0 0 9 1 0 0 0 0 0 10

Part 2: use cosmo_nfold_partitioner
% This exercise replicates the analysis done in Part 1, but now % the the 'cosmo_nfold_partitioner' function is used to create a struct % that defines the cross-validation scheme (it contains the indices % for the train and set samples in each fold) partitions = cosmo_nfold_partitioner(ds); % Show partitions fprintf('\nPartitions for n-fold cross-validation:\n'); cosmo_disp(partitions); % Count how many folds there are in 'partitions', and assign to the % variable 'nfolds' % >@@> nfolds = numel(partitions.train_indices); % should be 10 % <@@< % allocate space for predictions of each sample (pattern) in 'ds' all_pred = zeros(nsamples, 1); % As in part 1 (above), perform n-fold cross-validation using the LDA % classifier for fold = 1:nfolds % implement cross-validation and store predicted labels in 'all_pred'; % use the contents of partitions to slice the dataset in train and test % sets for each fold % from the 'partitions' struct, get the train indices for the % 'fold'-th fold and assign to a variable 'train_idxs' % >@@> train_idxs = partitions.train_indices{fold}; % <@@< % do the same for the test indices, and assign to a variable % 'test_idxs' % >@@> test_idxs = partitions.test_indices{fold}; % <@@< % slice the dataset twice: % - once using 'train_idxs'; assign the result to 'ds_train' % - once using 'test_idxs' ; assign the result to 'ds_test' % >@@> ds_train = cosmo_slice(ds, train_idxs); ds_test = cosmo_slice(ds, test_idxs); % <@@< % compute predictions for the samples in 'ds_test' after training % using the samples and targets in 'ds_train' % >@@> fold_pred = cosmo_classify_lda(ds_train.samples, ds_train.sa.targets, ... ds_test.samples); % <@@< % store the predictions from 'fold_pred' in the 'all_pred' vector, % at the positions indexed by 'test_idxs'. % >@@> all_pred(test_idxs) = fold_pred; % <@@< end % Compute classification accuracy of all_pred compared to the targets in % the input dataset 'ds', and assign to a variable 'accuracy' % >@@> accuracy = mean(all_pred == ds.sa.targets); % <@@< fprintf(['\nLDA all categories n-fold (with partitioner): '... 'accuracy %.3f\n'], accuracy); % Note: cosmo_crossvalidation_measure can perform the above operations as % well (and in an easier way), but using that function is not part of % this exercise.
Partitions for n-fold cross-validation: .train_indices { [ 7 [ 1 [ 1 ... [ 1 [ 1 [ 1 8 2 2 2 2 2 9 3 3 3 3 3 : : : : : : 58 58 58 58 58 52 59 59 59 59 59 53 60 ]@54x1 60 ]@54x1 60 ]@54x1 60 ]@54x1 60 ]@54x1 54 ]@54x1 }@1x10 .test_indices { [ 1 [ 7 [ 13 ... [ 43 [ 49 [ 55 2 8 14 44 50 56 3 9 15 45 51 57 4 10 16 46 52 58 5 11 17 47 53 59 6 ] 12 ] 18 ] 48 ] 54 ] 60 ] }@1x10 LDA all categories n-fold (with partitioner): accuracy 0.833