top of page
Yazarın fotoğrafıAlejandro Romero

Data is Nor 2D Neither 3D but 4D... or Higher

In prior posts it has been analyzed the data of several price changes for securities in Europe, USA and Latin America. According to it, when one considers the variable as non euclidean Sklearn libraries as SVM or Decision Trees give outputs that are more reliable with higher f1 scores and recall. Therefore, it means that the higher dimensionality of the data makes it incomprehensible as our 3D reality limits us.


However, what one can do is to get a projection of it on a surface or volume that can be analyzed and understood. For practicality, as the Sklearn library is the most widely used, a surface is chosen as the instance Decision Boundary Display is the best suited in terms of straight forwardness. Consequently, as the variable has ten features: five classes and five probability of ocurrence of each class (more details here), two has to be chosen to graph.


Therefore, which features to chose to graph? The decision can be suggested but there is no actual procedure for that, reason why after testing several hundreds of pairs of them the conclusion was the obvious one: the fatures that represent the probability of ending up in the most important two classes. The reader that has no access to the source code may wonder or excessively trying to visualize what is described so far, so:


f1 score, recall, sklearn, sklearn metrics, python, jupyter notebook, machine learning
Section of the 'Non-Euclidean' Distances f1 Scores Code

The matrix 'X' of 'n' instances and 10 features comes from the 'distances' and 'freq values' matrices. The former is calculated with a novel method that can be checked in detail in the link at the end of the second paragraph while the later comes from the historical frequency distribution of each security. As there are 'm' securities, it is multiplied 's' times such that it equals 'n' -in the future frequency tables that change with each projection may be tested-.


Additionally, as data is not balanced:


instances, python, jupyter notebook, colab, sklearn, numpy
Number of instances of each class before balancing

it should be fixed, otherwise results won't be consistent. Such target is reached with also a novel method that can be checked in the source code. The final outcome is as below:


instances, python, jupyter notebook, colab, sklearn, numpy
Number of instances of each class after balancing

And after finding the best parameters through Grid Search:


instances, python, jupyter notebook, colab, sklearn, numpy, decision trees, support vector machines, svm, grid search
Grid Search Process for Decision Trees

The boundaries displayed for Euclidean, Manhattan, +k, and -k are:


The decision whether which method is best-suited is relative e.g. to explain this process the current post considers that there are two features that are crucial: classes '0' and '3'. The former because it represents the biggest weekly loss and the later one because as it is historically the most frequent one; thus, if a prediction model can't foresee it accurately, it should be discarded.


Following this criteria, the Euclidean one is discarded due to a poor class '3' f1 score, as happened with Manhattan and -k; consequently, +k is best suited. One strong argument to support the fact that this data is not Euclidean is the big area corresponding to the class '0' (purple). Finally, it is difficult to deduct more outputs from a 2D projection.





ความคิดเห็น


bottom of page