How to Import Kaggle Datasets Directly into Google Colab?
Introduction:
Almost all aspiring data scientists use Kaggle. Datasets for every domain are kept there. Every possible use case, including those in the medical field, e-commerce, and astrophysics, has a dataset available. Users demonstrate their data science and machine learning expertise by practicing on diverse datasets.
Kaggle datasets come in a variety of sizes. Some datasets might range from less than 1 MB to 100 GB. Additionally, specific Deep Learning techniques demand GPU support, which increases training time. A promising technology, Google Colab, can assist newcomers in testing their programs in a cloud setting.
In this tutorial, we will learn how to import Kaggle datasets into Google Colab notebooks.
Choose a Dataset from Kaggle
Selecting your dataset from Kaggle should be your first and greatest priority. Additionally, you can choose datasets from contests. I've selected two datasets for this tutorial: one at random and one from the current competition.
Download API Credentials.
You must log in to the Kaggle services to download data from Kaggle. You require an API token for this. You may quickly generate this token from your Kaggle account's profile page. Simply to your Kaggle profile, and from there,
After choosing the Account tab, go down to the API section (from the Kaggle profile)
The login and the API key will be downloaded in a file called "kaggle.json"
You just need to do this step once; you don't need to create credentials each time you download a dataset.
Configure the Colab Notebook
Start Google Colab, then connect it to the cloud instance (basically start the notebook interface). Uploading the "kaggle.json" file that you just obtained from Kaggle.
You can now execute the commands required to load the dataset. The following instructions:
Note: All Linux commands, including installation commands beginning with "!" will be executed here. All Linux commands can be executed in the code cells because Colab instances are Linux-based.
1. Install the Kaggle library first.
! pip install kaggle
2. Create the directory ".kaggle."
! mkdir ~/.kaggle
3. Insert "kaggle.json" into the newly created directory.
! cp kaggle.json ~/.kaggle/
4. Give this file the necessary permissions.
! chmod 600 ~/.kaggle/kaggle.json
Now that Kaggle datasets can be downloaded, the collab notebook is prepared.
The complete set of commands for configuring the Colab notebook
Download Datasets
Both competitions and datasets are hosted by Kaggle. There are only minor modifications to the download process for any type.
Obtaining the Competitions Dataset
Here, the competition's name is not indicated by the bold title that is shown over the background. After the "/c/," it is the slug of the competition link. Take a look at our example link:
The challenge that must be completed in the Kaggle command is called "google-smartphone-decimeter-challenge." The data within the allotted storage in the instance will begin downloading as a result of the following:
Getting Datasets:
There is no competition for these datasets. Downloading these datasets is possible by:
! kaggle datasets download <name-of-dataset>
The "user-name/dataset-name" in this case refers to the name of the dataset. Simply duplicate the information that comes after "www.kaggle.com/." Consequently, in our case,
Bonus Advice 1: Download Particular Files
You just learned how to use Google Colab to download datasets from Kaggle. It's likely that you only want to download one particular file since you're only worried about it. Then, you can use the "-f" flag with the filename. This will only download that particular file. Both the contests and datasets command support the "-f" flag.
!kaggle competitions download google-smartphone-decimeter-challenge -f baseline_locations_train.csv
! mkdir ~/.kaggle
The second piece of advice: import Google Drive login data into Kaggle.
You uploaded the "kaggle.json" file when you launched the notebook in step 3 of the process. When the notebook is turned off, all of the files that were uploaded to the storage space while it was active are not retained.
It suggests that the JSON file needs to be submitted each time the notebook is reloaded or resumed. Keeping away from this manual labor
- Add the "kaggle.json" file to your Google Drive by simply doing so. Place it in the root folder rather than another organisational structure to make uploading easier.
- Mount the drive on your laptop next:
- The initial command is the same to create the ".kaggle" directory and set up the Kaggle library:
! install kaggle using pip
- The "kaggle.json" file must now be copied from the mounted Google Drive to the active instance storage. The "./content/drive/MyDrive" directory is where the Google drive is mounted. Run the Linux copy command as follows:
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
Using the Kaggle competitions and datasets command, you can now rapidly download the datasets. This approach also has the benefit of not requiring the credential file to be uploaded with each restart of the notebook.
Advantages of Google Colab:
An excellent tool for practicing data science problems is Google Colab. One of the main benefits of the Colab is the free GPU support. Google Colab helps data science aspirants with their hardware issues because they are initially limited on computation resources. You can easily interact with the kernel and perform all the standard Linux commands because Linux instances power the Colab laptops.
For practice datasets, the RAM and disc space are more than sufficient, but if your research calls for additional processing capability, you can choose the more expensive "Colab pro" application.
Disadvantages of Google Colab:
Google Colab's exclusive environment has some drawbacks. Anyone may create and execute arbitrary Python code in a browser using Google Colab. It is still a somewhat closed environment, though, as machine learning experts can only use the Python package that has already been installed on the Colab. One cannot simply add their own Python package and begin executing the code. As a result, the platform can offer basic tools but is unsuitable for specialised use.
Some of the major disadvantages of Google Colab are:
- Repetitive Tasks: Consider having to carry out a task by continually repeating the same set of actions; this would not only be time-consuming and exhausting, but it would also be difficult. Similar to this, a programmer must instal all of the unique libraries that aren't part of the standard Python package for each new session in Google Colab.
- No Live-Editing: You can collaborate by writing a piece of code and then sharing it with a team or a partner. However, Google Colab does not offer the option for live editing, which limits the number of concurrent code writers or editors to two. As a result, there is a lot of back and forth sharing. These features are furthermore offered by its rivals, such as CoCalc.
- Saving & Storage Issues: Since Google Colab does not offer permanent storage, uploaded files are deleted when the session is reopened. As a result, losing data whenever the device is turned off can be a nightmare for many. Additionally, a downloaded file that has to be utilised later must be saved prior to the session expiration in Google Storage as one uses the current session. Additionally, because all Colaboratory notebooks are saved in Google Drive, a user must always be logged into their Google account.
- Limited Time & Space: The Google Colab platform stores files in Google Drive with a free space allowance of 15GB; nevertheless, working on larger datasets takes additional space, making it challenging to execute. This can then store and perform the majority of sophisticated functions.
- Users of Google Colab can use their devices for a maximum of 12 hours each day; however, if they need to work for longer than that, they must purchase the premium Colab Pro version, which enables connectivity for 24 hours. The platform's inability to run codes or function effectively on a mobile device is the last and least discussed disadvantage of the system.
Conclusion
With the sole purpose of giving machine learning practitioners a platform and tools to enhance their machine learning capabilities, Google Colab entered the market. However, throughout time, as data volume, intensity, and quality evolved, so did the demands placed on ML practitioners to discover solutions to challenging issues. It is simple to release a commercial version, but for the greater good, it must be upgraded and made freely available to everyone if the machine learning ecosystem is to develop as a whole.
Here we had also discussed regarding the disadvantages of google colab so that one will know what is the google colab really means.
And we also had discussed regarding the importing of google coab using Kaggle datasets.