Week 9 - First Success in Sklearn

This week, we are honored to be able to listen to a talk by Jim Hall, founder of the FreeDOS project. I was very curious about how open source projects are initally started, and how they can flourish, and Mr. Hall resolved my doubts. Meanwhile, my teammates and I have started taking issues of scikit-learn, and I have already made a succesful pull request, which is encouraging.

Reflections on Jim Hall’s Talk

There is a question that has been troubling me for so long, that is, how can open source projects flourish? People tend to contribute to large open source projects, so large open source projects can get developed rapidly and become even larger and attract more contributions. However, if an open source project has just been started, it’s hard to imagine that the public will notice it and contribute to it. Therefore, it seems that there is no way for it to become a success. Though I did not directly ask Mr. Hall this question, his talk about his early experience with FreeDos even before the “GitHub age” resolved by doubts. The point is, people or enterprises have to see the value in getting involved in the project. As an illustration, Jim Hall created his FreeDOS project in response to Microsoft’s decision to discontinue support for MS-DOS, while Windows has provided only limited backward support. People thus were able to see the value of having such a disk operating system, and there would naturally be people contributing to it.

Moreover, to make an open source project flourish, one can actively promote the project through social media, online forums, and other channels where developers and enthusiasts gather. Clear documentation, well-organized code, and a welcoming community can also add to the appeal of the project.

My First Contribution to scikit-learn

Last week, I have been searching for open issues in the scikit-learn issue tracker. There are so many of them, so I added a few restrictions in my search. First, I ignored the issues that have already been assigned to somebody else. Secondly, I temporarily ignored the issues with the “needs triage” label since this would mean they cannot be reproduced on the main branch or maintainers may need further discussion. Thirdly, I did not look at the issues with completely no response. Without some maintainer responding to the issue, it’s hard to say whether making a PR for it will ever be merged. Finally, I found this issue about IsolationForest raising spurious feature names warning. Fortunately, one of the maintainers have mentioned a similar issue that has been fixed before, but for a different class. I took a similar approach: I added a private method that does not validate the data, and replaced the call to the original public method in IsolationForest.fit. Finally, I modified the original public method to validate the data and then call this newly implemented private method, in order to avoid repetitive code.

The pull request I made for this issue was merged shortly, and I learned a lot from this experience. There could have been many approaches to solve an issue, but one would have to pick the best approach for the project. For instance, another possible approach to solve the issue above is to split the feature names and the data, then pass the feature names along with the data into the scoring funtion. That would work, but involves changing the signature of a public function. For a large project like scikit-learn with many users using various different versions of it, changing the function signature like that would be user-unfriendly and possibly backward-incompatible. Therefore, thinking twice or talking with maintainers before deciding the solution is very important. Maintainers are much more familar with the overall structure of the project and have a lot more experience of solving similar issues, so they can usually give useful suggestions.

Summary

In this blog post, I have talked about my reflections on Jim Hall’s talk. I have also shared my experience of searching for issues to work on, getting suggestions from the maintainers, and making a first successful pull request in scikit-learn. Finally, thank you very much for your meticulous reading and hope you enjoy it.

Written before or on March 26, 2023