Breaking the Curse of Small Sample Conditions in Machine learning
Synopsis
For the past few years, the new age of big data has led to vast and complex data that require quicker and better decision-making. However, it is not always possible to collect data samples for an extended period before applying a forecasting or decision-making model. For example, a large-scale photovoltaic (PV) station's output needs to be forecasted within days/months of its first operation than years for optimization. Furthermore, in many cases, the complex internal dependency of the small dataset makes it difficult to apply any existing techniques available. For example, a smart watering decision system for a park has so many dependency variables with a complicated relationship that influences watering. General machine learning approaches are based on Probably Approximately Correct (PAC) learnable principle, requiring the data distribution of training data and test data to be stable and time-invariant. Under this condition, the generalization hypothesis learned from the training dataset has a high probability and works equally well for unseen instances. However, if there is a difference in distribution, the model's reliability will be significantly affected. The collection of long-term data to obtain adequate information on the target site limits the practical value of these approaches. If the dependency on data volume can be significantly reduced, the practicality of the solution can be enhanced. To overcome the problem of a small sample, the change of loss function, generating synthetic data, ensemble techniques, transfer learning, meta-learning, and up or downsample could be used to get the model accuracy. However, the applicability of these techniques is not widely tested in small samples with complex dependency relationships between data parameters and time-dependent variability. A robust prediction model under low sample conditions with satisfying results will increase the practicality of machine learning. In this project, you will:
- Review of the related literature to define the best methodology to approach and frame the problem.
- Develop an improved machine learning technique to handle small sample data with complex relations for prediction/forecasting.
- Use small sample IoT live data of different kinds to test the improved algorithm for generalization.