*Remember that when an interactice chart is cited on the post, by clicking on it the source code will be shown, In order to visualize it on the right way, download the file as html and open it with your browser.*
In the prior post it was proven that, as a stochastic variable, a stock quote is impossible to be predicted using regression. It is based on the fact that it is wrongly assumed that all information regarding a stock value is reflected in the price. Indeed, the latter is a consequence of the dispute between those who took the decision to buy and to sell, thus, at this point noise is added from different sources i.e. technological means, speculator's background and mood, etc. reason why focusing on price changes eliminates this.
Have a look at the following data:
Most of the data is mostly grouped around the third and fourth intervals showing a clear tendency to a central value which is zero, a behavior called as the Central Limit Theorem (CLT). Thus, these tables reflect two main takeaways:
The central value of zero confirms that prices are purely stochastic as the probability of going up or below zero is expected to be the same, something called market efficiency.
Some ETFs have their frequency tables shifted towards left or right, what one would call a right/left tailed distribution e.g. the top right table shows that over 64% of the data is accumulated in the fourth interval while the bottom right is well centered with 60% of it located in the third interval.
Now, one would be tempted to state that the top-right option is better than the bottom-right but there are other features to take into account and the following distribution graphs will give a better hint:
Markov chain states - green colorscaled- move towards the historic values -pink dots- and a manually constructed spline gives the probability distribution for the absolute returns of a certain asset. Consequently, two new features can be developed from here:
Probability of a return above zero -probability of earning money-.
Return with maximum likelihood -two values here the return and the likelihood-.
These, added to the estimated return calculated with the spline, its jacobian, and other features computed by the MGM code, allow the user to accurately classify price changes for the next periods depending on the timeframe used.
However, as the title suggests, to master the maths involved is of utmost importance to adjust the standard models according to the problem that is being solved. In fact, it is known that sklearn provides different methodologies for a wide range of problems; nevertheless, they are not enough to certain complex cases. Although it allows one to set parameters as penalties, learning rates, and others, the same level of accuracy -f1 score- has been proven to be lower than that one from the MGM code.
By using the standard models, only two manage to get a f1 score above 0.3, considered low. Confusion matrices are shown below:
As expected -from the low f1 score- both methodologies performed quite poor; therefore, an underfit model suggests that a more complex method is necessary; hence, by combining features and changing parameters for the two loss functions that performed best, the maximum score that could be got was of 0.39, an improvement of 30% but still deficient:
The key here is that one is constrained by sklearn's loss functions as nine are available, same as penalties -three available plus no penalty-, and other parameters. Dominating the math behind the libraries is important to identify flaws -that don't adjust to your case- and change them accordingly. In this regard, a good place to start is the following book, several acquaintances have asked me to teach them to code and trade, and depending on their backgrounds, I always recommend them a good literature to start.
All these calculations are based on probabilities, which can fail sometimes; however, the developed algorithm to reach those numbers has been thought to reduce such failures to their lowest level.