Predictive Hacks

How to get Data from Different Sources in R

get data

The data that we want to get could be in different places and in different formats. We will provide some examples of how you can get data from different sources.

Get Data from SQL

It is very common for the data to be stored in an SQL database. We have provided an extensive example of how you can connect R with SQL.

Get csv/text Data from HTTP(s) URL

We can easily get structured data like csv or txt files that are under an HTTP(S) URL. I have created a public S3 bucket where I stored some dummy data called movie_metadata.csv. Let’s see how we can get them.

myURL<-"https://gpipisbucket.s3.amazonaws.com/movie_metadata.csv"

df<-read.csv(url(myURL))
 

Get/Download Data

If the data are of different formats, like .jpg , png , pdf, xlsx etc , usually, it’s better to download them in a file. Let’s see how we can do it. Note that we use the download.file command.

myURL<-"https://gpipisbucket.s3.amazonaws.com/movie_metadata.csv"
download.file(myURL, destfile = "movie_metadata.csv")
 

Now, we have created a file called “movie_metadata.csv” in our working directory.


Get Data from JSON

On the web, most of the data are in a json format. Let’s see how we can get them. We need the httr library.

library(httr)
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&amp;t=Annie+Hall&amp;y=&amp;plot=short&amp;r=json"
resp <- GET(url)

# Store it to myresults
myresults<-content(resp)

myresults
 

Notice that in the content function you can define the type like raw, application/json etc.


Get Data from S3 to R

You can also get data from S3 provided that you know the access_key_id and the secret_access_key. You will need to work with the aws.s3 library:

library(aws.s3)
Sys.setenv("AWS_ACCESS_KEY_ID" = "xxxxxxx",
           "AWS_SECRET_ACCESS_KEY" = "xxxxxxx")
 
 
# you need your path and your bucket
obj <- get_object("path", bucket = "my_bucket")
 
 
df=read.csv(text = rawToChar(obj), sep=",", header = FALSE)
 

Get Data from Hive to R

Assume that your data are stored in Hive under Hadoop. You need to download the RJDBC and rJava packages.

Then you can follow these steps:

library(RJDBC)
library(rJava)
#start VM
.jinit()

# set the maximum memory
options(java.parameters = "-Xmx8000m")

# add classpath
for(l in list.files('/opt/hivejdbc/')){ .jaddClassPath(paste("/opt/hivejdbc/",l,sep=""))}

#load driver
drv <- JDBC("com.cloudera.hive.jdbc4.HS2Driver","/opt/hivejdbc/HiveJDBC4.jar",
            identifier.quote="`")


conn <- dbConnect(drv, "jdbc:hive2://path/my_data_base", "username", "password")

# show_databases <- dbGetQuery(conn, "show databases")
 

 
my_table <- dbGetQuery(conn, "select * from  my_data_base.my_table")
  

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “How to get Data from Different Sources in R”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s