What I Have Learned From a Senior Data Scientist

Intro

I joined the AI team at Dataminr last fall (2019) for an internship before I graduate from my Ph.D. program. I was lucky to be able to work with an extraordinary group of people at Dataminr and learned a lot in the AI team. Here I will briefly list the useful tips and experiences I have learned from my mentor –  a senior data scientist in the group during this internship. The order does not denote any priority.

Project Documentation

Documentation is a relatively important part of the project itself. Usually, this is done by cascading a Google slides file, including the initial ideas of the project, the methods we have looked up, then tie these methods back to the literature in another literature review file. Not only the working part of the project but also the sample points where the methodology does not work. Documentation, in a sense, is the tracking log of the project. In a commercial setting, every decision-making process needs to have its corresponding logical reason behind the scene, and the documents are the place to store all the foundations of these ideas and decisions.

More importantly, when communicating the ideas or different aspects of the project to other co-workers or stakeholders in the company, a well-documented file comes handy delivering convincing reasons while illustrating project progress.

For example, the project I was working on has its progress document, literature review, idea board, as well as weekly report deck. We are not only putting good stuff into the document, but also those ugly results from different methods. Later on, when we summarize and explain the project to our manager, we could refer back to the documentation to demonstrate the specific reasons or advantages of us picking one technique or pipeline over another. 

Daily Schedules

Here I mean something different from your Google Calendar but the “real” schedule which made by no one else but you and you want to stick yourself to. 

After the first week of the intern, my mentor sent me another Google doc with the following pattern:

schedule

The screenshot is just a replicate, but the idea is similar: there is a todo list for the whole week, which should be decided beforehand either in the previous meeting or at the beginning of the week — followed by some existing issues that need to be solved within this week. The items under each day are listed by me each morning around 8:30, or on the previous day before I leave work.

This is similar to the project documentation but working on ourselves. In a self-driven environment, I found these predefined schedules extremely helpful in terms of keep track of issues and deliver progress rigorously and on time. 

Effective Communication

It is widely known that communication and presentation skills are essential and crucial for a data scientist, only after working along with good data scientists in the industry I got the chance to know that it is always easier said than done. Here, “effective” is more towards the higher-level colleagues, such as supervisor or manager. This group of people is always under busy schedules, and how to utilize their limited time window to get your ideas and opinions out clear and understood is quite a challenge sometimes, at least to me.

At the end of my intern, I asked my mentor, what was the essential quality working as a data scientist in a business setting, and his answer was “effective communication” towards skate holders and management of the organization. The previous tips, including project documentation and persuasive presentations, all contribute to this point.

One personal advice, when communicating with your supervisor or manager, try not to overthink their words. At the beginning of the internship, I was trying to figure out what does “meet you again later this week” mean from my manager. I would be worried that he was not happy about my proposal, or he wanted me to come up with something different next time we meet. And my mentor told me, “do not spend your time and energy on stuff like this, just keep working on your own pace, and he was just tempted to talk to you again.”

Of course, I do not mean that being considerate is not good, but there should be a clear boundary between considerate and too sensitive. Communication is a critical soft skill, and I still need to keep working on it. 

Meeting Notes

Some people always bring their Mac Books into a meeting room and would never stop typing during the meeting, while some others would walk into the room with their bare hands and do nothing except listening. Is there a certain appropriate behavior during a meeting? I guess the answer is no, hence here I give out the way I found useful and favorite to me from my mentor.

Unless I were to present during the meeting, I would NOT bring either my laptop or my cell phone with me. The reason is that if someone is presenting to me, I should be focusing on the projection instead of my laptop or cell phone. The only things I will take with me is a notebook and a pen. Taking notes during the meeting is always a good habit, including the points you are confused about, or a good point that can light you up, or write down the core ideas of the talk. Knowing how to summarize and extract essential information from a lecture or discussion is crucial to a data scientist.

I noticed that I need to improve the way of joining a meeting after the first reporting meeting of my internship. We presented the initial version of the project to the whole group and asked for feedback. After the meeting, my mentor asked me, “how do you feel about the meeting, did you get something useful from the audience?”, and I replied, “almost nothing.” And my mentor said, “well, that is not true, we got at least three useful feedback.” And those points are those I heard during the meeting but did not realize that I should write them down, and by the time we had the discussion, I almost forgot about them. My mentor was the savior since he wrote them down clearly. 

Remember the daily schedule file where each day has its own section? These summarized meeting notes would go there.

Command Documentation

As a data scientist, mastering multiple languages is a must, including programming languages such as Python, R, C/C++, Java or JavaScript, scripting languages such as Bash Script, IPython Shell, even web editing languages such as HTML and CSS. And each language has its tricks and commands that are useful and worth noting. Someone can rarely memorize everything by a glimpse and never forget. Therefore it is essential and practical to have a document that keeps all the useful and specific commands or tricks we have used in the past.

The advantage of keeping such a document showed up obviously when I started to work with Docker. For different services or images, the commands for Docker to run the container can be quite different. For instance, some containers might need to be mounted with some volume on the host machine while some others might need to have some different ports. Unless I am working with Docker daily, I would perhaps forget how to deploy a specific service after a month or so. But if I have a document that I can easily refer back to, I can grab the necessary information immediately. Of course, while some of this information about Docker is usually documented on Git, some might not.

For another example, I wrote down the command of how to use ffmpeg to convert an mp4 file into a gif file with the best quality in the terminal into my most recent command document, so I can directly look into it next time if I need to convert something else. 

Yet Another Communication

During one of the in-person interviews with the research scientist in Dataminr, I asked the interviewer the same question that I asked my mentor: what is the essential skill for a scientist in Dataminr, or the industry in general. And he gave me the same answer: communication. However, the explanation he gave about the same word has been different from the last time I heard about it.

The communication here is more about a working attitude: positively involving in the team conversations and get to know the whole team with a leadership mentality.

Specifically, being “involved” in the team means to express yourself more and let the team heard your voice, no matter if it is closely related to the current work or if it is totally off the topic, the idea of “thinking out of the box” is essential and need not be suppressed.

On the other hand, “leadership mentality” does not mean that we should be too ambitious of being the manager of the group from the beginning. Instead, we should take part in every aspect of the team as if we were the leaders. Get to know what every group member is working on is vital in terms of collaboration.

Chatting with colleagues about their work seems not a high-efficiency event, but it is an essential part of the team-player mentality. 

The Ability to Learn

Two weeks ago, I had an interview with another senior data scientist in the industry. By the end of the interview, I asked the same question, which resulted in a different answer: he thinks there are so many skills that are essential and important to a data scientist, and if he had to pick one, he would rank the ability to learn as the top.

He somehow illustrated the history of machine learning and deep learning in the recent few years. He pointed out that we data scientists are working right in the golden era of this industry where vast improvements in technologies are updated every couple of weeks. If we are working as a research type of data scientist, we need to keep up with the pace of technology advancement. 

The space for understanding and better utilizing artificial intelligence is still huge. What we have seen and learned is just the tip of the iceberg. With higher and higher scores been broken through by numerous SOTA models, being able to adapt to the new and better algorithms and take advantage of better performance is crucial.