Week 7 - Starting with Sklearn

This week, we have been assigned groups for the final open source project. After some discussion with other group members, we have dicided to work on scikit-learn (a.k.a. sklearn), an open source and free machine learning library for Python. In this blog, I’m going to discuss how we determined our project, as well as my hopes and worries about it.

Picking the Project

In the project evaluations, one of my teammate Romee and I both chose pandas, a data analysis library for Python. Our other teammate Jiawei chose scikit-learn, so we dicided to finally pick one from these two. Since they are both open source Python libraries, none of us had any preference for one over the other, so we looked at their repositories to double check.

Both of them provide very detailed contributing instructions from preparing the development environment to taking issues and making pull requests. However, we found that scikit-learn is actually a friendlier community among these two. Looking at the issues and pull requests, maintainers of pandas tend to be very strict on what issues they want. For instance, there was an issue for feature request with a reasonable solution, but the maintainers said they needed to have a meeting to see if that would affect future maintenance, and thus the issue way delayed. In contrast, scikit-learn is much friendlier and open, in which most issues will be either resolved or accepted as long as they are reasonable, and few of them would be pending.

Moreover, I was previously worried about whether we would be assigned issues successfully, given that some organizations assign issues only to their staff. However, this seems not to be a problem for scikit-learn, since in the documentation it said that by typing /take, we would be instantly assigned the issue by the bot.

For the reasons above, we finally pick scikit-learn to work on. That being said, if we failed to make progress contributing to scikit-learn, we may still turn to pandas after that.

Some Hopes

I’m really excited to be (possibly) able to make contributions to a project that many people are really using. Given that I’m enrolled in a graduate machine learning course this semester, I believe contributing to scikit-learn will also help me gain deeper insights in machine learning (and probably helping with that other course).

Through this project, I’m looking forward to improving my programming skills. This does not only include “programming”, but also include writing meaningful comments and documentations to make users and collborators understand what I am doing. Moreover, I do think this is a great oppotunity to make use of my double majors and combine my math and computer science skills. Though I have learned courses such as Numerical Analysis that stands on the boundary of my double majors, this would actually be the first time that I can put my skills in both subject into practice.

(Of course, I’m also looking forward to write this experience into my CV, so I will definitely do my best.)

As for group work, Jiawei is actually a machine learning expert. I’m really looking forward to working with and learning from him. Our other teammate Romee is actually my girlfriend and I’ve worked with her multiple times, so there’s not much to talk about that.

Some Worries

Having talked about all those hopes, there are indeed worries about this project. Fixing some bugs may be relatively easy, but contributing new features would definitely be difficult from the very beginning when choosing a feature to work on. The easiest way would be to pick from the issues, but sometimes while you are still thinking about a feasible solution, someone else may have already taken that issue. Therefore, we must have the ability to quickly figure out what features are “contributable”, as well as the courage to take some issues even if we are not 100% percent sure that we can come up with some quick solution. After all, as a final project, it would be time-consuming to make significant contributions, and I’m readily prepared to devote much time into this.

As for group work, though we are assigned so that we would have non-intersecting free schedules, the fact that we all take different courses and have different lifestyles will still affect the efficiency of out communications. Moreover, I’ve been used to doing individual projects and tended to make everything (format, structures, etc.) as I want. Such a bad habit clearly needs to be fixed in group work.

Summary

In this blog post, I have talked about our final project selection, as well as my hopes and worries about the project and the group work. Finally, thank you very much for your meticulous reading and hope you enjoy it.

Written before or on March 12, 2023