Data science basic tools to start

Kirill Bondarenko
6 min readMay 16, 2019

--

What tools do you need to get initial effectiveness in data science

Introduction

Hello everyone !

In this article I want to tell you a little story about my entering to the data science world and what I did.

And the main idea of the article is not to explain what ML and DS are, but to show some useful tools to make your work more efficient.

I had a long trip and sometimes made wrong decisions and suppositions about data science and machine learning. But finally I have got an understanding what to do and what this science is.

If you are an intern or just a student who wants to discover data science/machine learning and you have some basic understanding about mathematical sense of it, this article is exactly for you.

My beginning

My programming career began with Java and Android applications development. It was interesting for me, exactly after being an economist or sales manager. But my first meeting with AI was accidental and it caught my mind totally. I gave up Android and started learning mathematics of artificial neural networks. And after this part is going the main topic explanation: what to do to work more efficient in data science.

Basic: Python

Why python ?

It have many advantages and disadvantages like an any programming language. But I want to tell only some really important things that you need to know about it in data science.

Advantages:

  • First of all it very simple. I spent a month after using Java to learn it for a normal level. There are many good books and articles to learn it. In the bottom of the article I put some useful sources.
  • Very powerful. There are all needed functions, data types, data structures. And python is from a family of dynamic languages. It means like you may write a pseudo code and it will work (joke with a big part of truth).
  • And additionally there are a huge pool of libraries written for it (like numpy, pandas e t.c.).

Disadvantages:

At the beginning you probably will not see it, but it is.

  • Python is slow. Why ?

That’s why it is a dynamic language. It means while you create a variable it consists of many parts that helps interpreter to understand code logic.

In this way python code “eats” more system memory then usual C code.

But these pros and cons are all about classic python. In data science you will need much more tools.

Numpy

Numpy — numerical python. It is the most popular python package in the world. After you will understand python basics, will start learn numpy.

Numpy is very powerful. It was written on C++ and have two main advantages: power and speed.

Quantity of numpy functions are huge, like random arrays with any shape, pairwise tensors values multiplication, scalar multiplication, e t.c.

Картинки по запросу python numpy examples
Example of numpy array vs python classic list

And it’s faster then python. It “eats” less memory then python. But speed sometimes reduces to save memory (e.g. use for loop and standard python append to standard array with 100 thousands of arrays with 100 random integers and make the same using numpy and you will see the difference).

Pandas

Pandas is the second popular package after numpy to install it. It consists of numpy and other packages to make your work with tables more efficient.

It is a standard task: you have an Excel/google spreadsheet and you need to download it once in your code without any long loops and readers. With pandas you can do it easy.

You have a pandas data frame and in a fast way need to see the histogram of the X column values ? your_data_frame.plot(kind= ‘bar’ ) and it’s everything. Short example:

Example of grouping data and visualizing it

Learn pandas and become more powerful.

Matplotlib

This package is very popular. Main function of it is to visualize your data via plots, histograms, charts, dots e t.c. Even 3D data visualization.

Different plots in matplotlib

Matplotlib package is very simple to understand, just make a goal to do it.

Scikit-learn

Scikit-learn — is a very popular package to really start your ML/DS big journey. It consist of a huge number of algorithms and tools to solve all main tasks like these:

It has many useful materials to start learning and increase your knowledge in it.

Keras

Smoothly we come to the part of data science and machine learning — deep learning. It is a wide branch and deserves the tones of special articles.

Keras is simple (in comparative point of view). It is easier to learn than Tensorflow (keras using tensorflow on a high level to simplify your life).

Keras helps to create the artificial neural networks with deep structure to solve complex tasks like image classification, weather forecast, audio speech transcription e t.c.

My experience

I have told you the right order to start learning it.

Now I will tell you the my way.

I started from … keras. After I understood that Java is not good to solve ML tasks I would use keras with python without any python knowledge. Finally I understood that I need deeper python understanding and learned python. In that time I didn’t know about every tool that I listed here, only numpy, but it was scaring me a lot.

I thought that ML and DS and DL are only neural networks and some statistics. But next I understood one important thing:

If you can solve a task just using statistics/math, don’t build the neural networks only because it’s popular and everyone tells you about it.

What next ?

When you think your knowledge in data science are cool after understanding this packages, just wait some time and you will see that you know nothing about data science.

It’s not a depressive mind. Just want to tell you.

It is a rare event when you work with a clear and good data. When the whole data are in one table. In the most cases you will face with terrible tasks that don’t have clear solution. You need to spend many time to work with this data to make it really good and usable for a complex models.

All these packages are only the top of an iceberg and you must write code with them like with a patch on your eyes: fast and right. And remember:

If you spend less then 50% of your project time to work with data — you act wrong and nothing good will happen.

Resources to learn

These two books are “must have” attributes to read and read them again.

If you want to start right now, there are a lot of free resources and documentations:

Hope, you found here a useful material and it will help you.

Good luck !

Bondarenko K. , machine learning engineer :)

--

--

No responses yet