ML Design Patterns

Table of content

Slides of a recent session on this topic

ML_Patterns_Principles_with_Reza.pdf

Writing proposal 

1) Structuring a ML project/thesis proposal

I have organized my proposal in the following order:

a) Intro
b) Background and related works
c) Hypothesis
d) Approach
e) Dataset
f) Timeline 

There are two important points here. Firstly, for writing your approach use hedging which will save you from some unwanted problems. Secondly, dataset generation has many challenges. If you are not planning to curate a dataset, make sure that a dataset for your experiment is readily available. I was thinking of generating a dataset in 2 hours, but there were uninspected issues that made me spent 2 full months generating another dataset for my work.

2) Latex and overleaf

Watch my YouTube video on how to use Latex and Overleaf. If you have any question, please comment bellow this video on YouTube.

3) Research Papers

Most of the computer science, specially machine learning studies, fall into the category of Experimental Studies rather than Theoretical Studies. 

In a theoretical study, like proving a problem in math, we use deductive or inductive reasoning to drive the conclusion by using a set of rules from our limited axioms or the predicates that have been derived from those axioms using our set of rules. Apart from usefulness of the generated result in this setup, a theoretical result can be either correct or incorrect by investigating the approach and we can determine that, in most of the cases, with low cost and time

In experimental setup, on the other hand, we have a hypothesis that we want to test. The way that this usually is tested is by having two groups; experimental group and controlled group. Then, we test the effect of the hypothesis and in case of failure to achieve the desired outcome, we reject that. But, bare in mind that we don't have any rigid correctness or incorrectness in experimental setups. As an example, if we measure the effect of a vaccine for two groups, and reach an outcome, we can only say that that outcome is most likely (based on your confidence interval) to happen exactly under than certain condition that we made the experiment and we cannot generalize the result to all the circumstances.  Another problem that besets the experiments is we cannot replicate them with low cost and time to determine their correctness. In most of the cases, we have to spend a lot of time and resource to replicate the experiment this does not end here. We have to make sure that there is no confounding variable which requires the researcher to do everything that they possibly could to address this legitimate skepticism. One confounding variable in ML research papers is GPU!

Project Management

1) Documentation (GitHub + code comments) 

Believe it or not, documentation is the hardest part. You have to make sure that you follow CI/CD methodology which daily documentation is a part of it. The importance of this doubles when you work remotely on your research. If somebody tells you that you did not work for a week in which you were paid, how are you gonna prove that? By showing your local documents? How are they supposed to make sure that you did not create those documents in another time? 

Using a private GitHub repository allows you to have a clear track of your work for your self and proving to your product owner, or whoever is paying you.

2) Methodology (Agile / Waterfall

 Waterfall and Agile are two different approaches to software development. You can also use these approaches to manage your research paper or even your life in general! 

I personally find Agile methodology more effective but in academia they usually use Waterfall approach. After you set the requirements and expectations and articulate them in your proposal, you are supposed to do and read more papers, complete the code sequentially, document your results, write your proposal, then defend your work. This totally sequential approach might fall short when you face a blocker in one part work and cause boredom by not having flexibility and variability in the type of tasks. 

Agile development, on the other hand, is an iterative and flexible approach. In my bachelor thesis, that was on Multi-domain Multi-modal chatbots using GPT and CLIP, we were using agile development. We were supposed to do the work, assess our approach, document, do a bit of writing, and present our work and progress bi-weekly in front of all 20 other lab members so that they can criticize our work. So, in my defense there was nothing that I was not asked about it before due to iteratively working on all parts of the project.

3) Communications (Expectations, Reports, ...) 

Have you heard the slogan "The customer is always right" ?  This puts the customer's satisfaction of paramount importance. It does not matter that you have done the best work in the world or not if your customer is not happy with the work. The only think that matters in a successful business, or a research as it's is a sort of business, is the customer's satisfaction. The only way that you can ensure this satisfaction is you sit down with the product owner and customer and ask one question:

"What is your definition of done for this project?"

Then you have to get detailed on this question so that no unspoken assumption about the details is left. And continuously reporting the progress helps both sides to stay on track and if there has been any misunderstanding on the both side, both sides can retrieve and address that issue promptly, with the procedures that has been previously agreed upon.

4) ML project workflow/pipeline (we can modularize/containerize the components of the pipeline) 

I have brought the pipeline and skeleton  of my project with its code, that some personal or private parts of it removed. 


The first part is:
1) Libraries and packages 

2) Organizing args, hyper-parameters, and global variables

3) Utilities 

4) Handling data

5) Modeling and your custom model-related components (such as custom loss function)

I have not brought my thesis's Neural Networks model as It is a open case and I have not officially finished my two-year program.

6) Train / Validation loops

7) Evaluation

8) Driver Function

Computation Resources

1) Local Servers

Having a dedicated local server helped me a lot in running my prototypes with no queue time (unlike Compute Canada) and with no timeout (unlike Google Colab). But it also had its own challenges. I had to maintain the software and sometimes hardware! In the picture bellow, there was an failure on one of the physical ports of the server for which there was no YouTube video or Stackoverflow page as to how fix it. :(


2) Cloud services (AWS, Compute Canada, Colab, …) 

In any case, please do not train on your laptop since it might get toast, like my previous PC :/

Remote Connection

1) Preliminary concepts and definitions (SSH, SLURM, Bach, …) 

If you are not comfortable with basic Linux commands and using Google Colab's linux kernel, you can watch a video that I have made before. The concepts and syntax of SSH and SLURM are covered in the videos that I have made for Part 2 and 3 of this section.

2) Connecting to a local server 

3) Connecting to a cloud server 

Navigating ML repositories

The SimpleTOD GitHub repository, from Saleforce company, is a good example to navigate. 

If you are choosing a research paper as your baseline, make sure that they have a good standing in the "Issues" section.

Libraries and Packages

Libraries and packages can have a out-of-date or buggy code. Don't forget that in python you can pop the libraries open , look into the code, and possibly debug them.

The following picture is from python API of a push notification service that we were using for our Software Engineering Project in my Bacholar program. We were getting multiple errors with their API and after many tickets, we realized that they don't get into this issue on time. So, I popped their package open and I realized that some part of their python code was written in an older version of python which was not compatible with the newer version of python. I fixed the code and it worked :) 

In another case, I was getting an error from a library and I went to their code, read the logic behind their code, and implement the same logic myself to resolve the issue 😅🤣

Arguments and Hyperparameters Organization

Watch my YouTube video in which I will cover:
    + Where does the args and hyper params show up?

+ Necessity of factoring them out from the code and organizing them under args object

+ ArgumentParser object, defining args, final args object

+ Passing values to our python code (exporting them into os.environ or passing them when executing the python file.)

+ Some notes about running your code on google colab cells 

+ Touching on logging and log files.


Utilities

+ This part is discussed in the slides.

Handling data

+ This part is discussed in the slides.

1) Caching and excluding static operations 

2) Batching

3) Padding and Masking

4) Imbalanced data 

Modeling: Parameters & Hyperparameters

+ This part is discussed in the slides.

1) Your model/network (different architecture/algorithms)

2) Learning rate scheduler 

3) Regularization

4) Loss function

5) Vectorization 

6) Parralization 

Train/Validation loops

+ This part is discussed in the slides.

1) Train Loop 

2) Validation Loop 

Evaluation

+ This part is discussed in the slides.

Driver Function

+ This part is discussed in the slides.

Deployment

+ This part is discussed in the slides.

Miscellaneous

1) Efficient use of RAM: gc.collect(), … 

2) Efficient use of GPU: cuda.emty_cache(), … 

3) Efficient use of Hard Drive: Managing checkpoints, lower precision tensors, … 

4) Improving Run time: Profiling to find the bottlenecks, … 

References and external links

[1] Mathematics of Machine Learning course from Dr. Erfan Salavati @ Tehran Polytechnic University 

[2] Deep learning specialization from Andrew NG @ Coursera 

[3] "Bobby Miraftab's webpage - Compute Canada (bobby-miraftab.com) " Initially used this webpage for understanding how to create a Virtual Environment on Compute Canada.

And credits to many other people. 

Some of Q&A questions: