Vejju: Software Installations

Hadoop Installation :

First we have to login in to the Amazon Web Services and create a new virtual machine with EC2 option.Select Ubuntu(with free-tier access) as OS and do the further process such as configuration properties,security groups and key paris,choose an Instance type, select “t2.medium”.

Install putty : Putty

Locate your private key file (key_pair_bigdata.pem). The wizard automatically detects the key you used to launch the instance.(Use FileZilla for transferring the files from Windows to Ubuntu(putty) and vice-versa)

Your key must not be publicly viewable for SSH to work. Use this command if needed: chmod 400 key_pair_bigdata.pem

Connect to your instance using its Public DNS:

ssh -i "key_pair_bigdata.pem" ubuntu@instancename

Install base packages (java 8)

# sudo add-apt-repository ppa:webupd8team/java

# sudo apt-get update

# sudo apt-get install oracle-java8-installer

# java –version

Using FileZilla , transfer files of hadoop-version.tar.gz or use the command as

Eg : # wget http://wwwus.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

#tar -xf hadoop-version.tar.gz (extract the file)

Add the Hadoop related environment variables in your bash file.

#nano ~/.bashrc

Eg: export HADOOP_HOME=$HOME/hadoop-2.8.0

export HADOOP_CONF_DIR=$HOME/hadoop-2.8.0/etc/hadoop

export HADOOP_MAPRED_HOME=$HOME/hadoop-2.8.0

export HADOOP_COMMON_HOME=$HOME/hadoop-2.8.0

export HADOOP_HDFS_HOME=$HOME/hadoop-2.8.0

export YARN_HOME=$HOME/hadoop-2.8.0

export PATH=$PATH:$HOME/hadoop-2.8.0/bin

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

export PATH=/usr/lib/jvm/java-8-oracle/bin:$PATH

Save and exit and use this command to refresh the bash settings.

# source ~/.bashrc

Setting hadoop environment for password less ssh access. Password less SSH Configuration is a mandatory installation requirement. However it is more useful in distributed environment.

# ssh-keygen -t rsa -P ‘’

# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

-> Modify the config file

# sudo vim /etc/ssh/sshd_config

-> Find the following line PasswordAuthentication to quickly find the wanted line.

PasswordAuthentication yes

-> Save the config file. Then restart the ssh service for the update to take action.

# sudo service ssh restart

-> check password less ssh access to localhost

# ssh localhost

-> exit from inner localhost shell

# exit

Check the Hadoop version

#Hadoop version

Set the hadoop config files. We need to set the below files in order for hadoop to function properly.

• core-site.xml

• hadoop-env.sh

• yarn-site.xml

• hdfs-site.xml

• mapred-site.xml

-> go to directory where all the config files are present (cd /home/ubuntu/hadoop-2.7.3/etc/hadoop)

• Copy and paste the below configurations in core-site.xml

# nano core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hadoop.tmp.dir</name>

<value>/home/ubuntu/hadoop/hadooptmp/hadoop-${user.name}</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.defaultFS</name>

<value>hdfs://ec2-52-59-253-110.eu-central-1.compute.amazonaws.com:8020</value> </property>

<name>hadoop.proxyuser.ubuntu.groups</name>

</property>

<name>hadoop.proxyuser.ubuntu.hosts</name> <value>*</value>

</property>

</configuration>

• Copy and paste the below configurations in mapred-site.xml

# cp mapred-site.xml.template mapred-site.xml

# nano mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapreduce.framework.name</name>

</property>

<name>mapred.job.tracker</name>

<value>ec2-54-93-105-1.eu-central-1.compute.amazonaws.com:9001</value>

</property>

</configuration>

• Copy and paste the below configurations in yarn-site.xml

<?xml version="1.0">

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

• Copy and paste the below configurations in hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.replication</name>

</property>

<name>dfs.permission</name>

<value>false</value>

</property>

</configuration>

-> Need to set JAVA_HOME in hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Formatting the HDFS file system via NameNode (after installing hadoop, for the first time we have to format the HDFS file system to make it work)

Eg: # mkdir hadoop

# cd hadoop

# mkdir hadoopdata

#mkdir hadooptmp

# cd hadoop2.8.0/bin

# hadoop namenode -format

To start Hadoop issue the following Commands:

Eg: # cd hadoop-2.8.0/sbin

# ./start-all.sh or

# ./start-dfs.sh ./start-yarn.sh

Check for hadoop processes /daemons running on hadoop with Java Virtual Machine Process Status Tool.

# jps

Hive Installation:

Use FileZilla,transfer the file apache-hive-version-bin.tar.gz or use the command

#wget http://archive.apache.org/dist/hive/hive-2.1.0/apache-hive-2.1.0-bin.tar.gz

# tar -xf apache-hive-version-bin.tar.gz (extract the file)

Edit the “.bashrc” file to update the environment variables

Also, make sure that hadoop path is also set.

export HIVE_HOME=$HOME/apache-hive-2.1.0-bin

export PATH=$PATH:$HOME/apache-hive-2.1.0-bin/bin

Run the below command to make the changes work in same terminal

# source ~/.bashrc

Check Hive version

# hive --version

Create Hive directories within HDFS. The directory ‘warehouse’ is the location to store the table or data related to hive.

hdfs dfs -mkdir -p /user/hive/warehouse

hdfs dfs -mkdir /tmp

Set read/write permissions for table.we have to give write permission to the group:

# hdfs dfs -chmod g+w /user/hive/warehouse

# hdfs dfs -chmod g+w /tmp

Set Hadoop path in hive-env.sh

# cd apache-hive-2.1.0-bin/

# nano hive-env.sh

Edit hive-site.xml

hive-site.xml

By default, Hive uses Derby database. Initialize Derby database.If the metastore is set to mysql or postgresql any database rather than Derby

# cd apache-hive-2.1.0-bin/bin/schematool -initSchema -dbType derby or mysql

Start the hive metastore server and hive server as follows:

# hive --service metastore &

# hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console &

If we want to connect hive through beeline then as follows

# cd apache-hive-2.1.0-bin/bin

# beeline

# !connect jdbc:hive2://ec2-18-195-20-124.eu-central-1.compute.amazonaws.com:10000

Whenever if we want to start the hive , first we have to format namenode and start the datanode and secondary namenode , resource manager.

Issue the commands as follows:

$ clear

$ cd hadoop

$ sudo rm-rf hadoopdata

$ sudo rm -rf hadooptmp

$ cd hadoop2.8.0/bin

$ hadoop namenode -format

$ cd hadoop2.8.0/sbin

$./start-all.sh

$cd apache-hive-2.3.0/bin

$hive --service metastore &

$ hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console &

Spark Installation

First we have to check java version ,hadoop version and hive version
Use FileZilla, transfer the spark-version-bin-hadoopversion.tar.gz i.e., pre built for hadoop
Then extract the tar.gz file with tar -xf spark-version-bin-hadoopversion.tar.gz
Then change the ~/.bashrc file with the following properties

export SPARK_HOME=$HOME/spark-version-bin-hadoopversion

export PATH=$PATH:$HOME/spark-version-bin-hadoopversion./bin

Then start the hadoop namenode ,datanode ,secondary namenode and resource manager.
Start the spark nodes as follows

# cd spark-2.2.1-bin-hadoop2.7/sbin

# ./start-master.sh

# ./start-slaves.sh spark://ip-172-31-45-215.eu-central-1.compute.internal:7077

Now we have to start the metastore of hive with

# hive --service metastore &

Then start the hive thrift server with thrift server port number rather than ‘10000’ i.e., ‘10001’

#hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.root.logger=INFO,console

Now start the spark thrift server as follows(in sbin folder)

#./start-thriftserver.sh --master spark://ip-172-31-45-215.eu-central-1.compute.internal:7077

--hiveconf hive.server2.thrift.port=10001 --hiveconf hive.root.logger=INFO,console

Now the spark server is running at port number 10001.

Postgres Installation:

https://tecadmin.net/install-postgresql-server-on-ubuntu/

CREATE ROLE biplus_user WITH LOGIN CREATEDB ENCRYPTED PASSWORD 'Gokulam@123';

postgres-# \q

CREATE DATABASE biplus;

create user biplus_user with password 'password';

CREATE USER biplus_user WITH LOGIN SUPERUSER CREATEDB CREATEROLE INHERIT NOREPLICATION PASSWORD 'Gokulam@123';

GRANT ALL PRIVILEGES ON DATABASE postgres TO biplus_user;

grant usage on schema <schema_name> to biplus_user; grant select on all tables in schema <schema_name> to biplus_user;

The first command grants the ability to list tables from a schema to a user
The second command grants the read ability to all existing tables from that schema to the user

sudo service postgresql restart

sudo -i -u postgres / sudo su - postgres
psql -U postgres biplus
Psql
show search_path;
set search_path='pragmatic';

CREATE TABLE schema.table_new AS TABLE schema.table_existing;

SELECT * FROM pg_stat_activity WHERE state != 'idle';

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '16 minutes' and state != 'idle' and query like '%pragmatic.%';

SELECT pg_database_size('postgres');

SELECT pg_size_pretty(pg_database_size('postgres'));

SELECT pg_size_pretty(pg_total_relation_size('pragmatic_group.daily_revenue_summary'));

SELECT pg_size_pretty(pg_relation_size('pragmatic_group.daily_revenue_summary')); // not using index

Mysql Installation:

https://www.digitalocean.com/community/tutorials/how-to-install-mysql-on-ubuntu-18-04

select user, host from mysql.user;

1. drop USER 'root'@'localhost';
flush privileges;

CREATE USER 'root'@'localhost' IDENTIFIED BY 'Gokulam@123';

grant select on my_database.schema_name to 'bi_user'@'%';

GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost';

2. drop USER 'root'@'%';

flush privileges;

CREATE USER 'root'@'%' IDENTIFIED BY 'Gokulam@123';

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%';

flush privileges;

SELECT user,authentication_string,plugin,host FROM mysql.user;

Vertica on Amazon EC2 Installation

Check the java version

# java -version

Download vertica_version_amd64.deb from vertica Site https://my.vertica.com/download/vertica/

Use FileZilla for transferring the vertica_version_amd64.deb

Issue the following commands

# sudo apt-get install dialog

# sudo dpkg -i vertica_9.3.0-0_amd64.deb

# sudo nano /etc/debian_version

Change the version to stretch

# sudo /opt/vertica/sbin/install_vertica --hosts instance name -u dbadmin -password-disabled --ssh-identity ~/key_pair.pem --failure-threshold NONE

For QA (192.168.32.20)
sudo /opt/vertica/sbin/install_vertica --hosts 192.168.32.20 -u dbadmin -password-disabled --failure-threshold NONE

pragmaticplay:

sudo /opt/vertica/sbin/install_vertica --hosts 195.201.187.148 -u dbadmin -password-disabled --failure-threshold NONE

Troubleshoot:

ulimit -n (check weather value is 65536 or not)

sudo vi /etc/security/limits.conf (Add below lines)

* soft nofile 65536

* hard nofile 65536

Log out and login the terminal

sysctl -p

sudo -su dbadmin (log in as dbadmin)

Example o/p: dbadmin@firstpin-Server:~$

/opt/vertica/bin/admintools
Follow the screen
Run Vertica by creating a New Database.
Every time we have to start a Database that we have to query the files i.e., Database starting is compulsory to run Vertica queries.

CREATE SCHEMA pp_vertica;

CREATE USER biuser IDENTIFIED BY 'Gokulam@123';

GRANT ALL ON SCHEMA pp_vertica to biuser;

COMMANDS

SELECT GET_COMPLIANCE_STATUS();

Connect to verticadb /opt/vertica/bin ./admintools -t connect_db -d vertica connect to vertica vertica user dbadmin password 'password' on 'host',port; DDLs export from verticadb SELECT EXPORT_OBJECTS( '/home/dbadmin/sql_objects_all.sql', '', 'true'); Execute the ddls to verticadb \i '/home/dbadmin/sql_objects_all.sql' COPY vertica.pp_vertica.test FROM vertica vertica.pp_vertica.test DIRECT; CREATE TABLE pp_vertica.am_assignments_backup AS SELECT * FROM pp_vertica.am_assignments; SELECT COPY_TABLE('pp_vertica.test', 'pp_vertica.test_copy');

source to destination data copy ./vsql -U dbadmin -w password -h sourceHost -d vertica -At -c "SELECT * from schema.table" \ | ./vsql -U dbadmin -w password -d vertica -c "COPY schema.table FROM STDIN DELIMITER '|';"

admintool -t list_allnodes

Cloudera Impala on Amazon Ec2 Installation (Redhat Linux)

Installation of Cloudera Impala in Redhat(Version 7) follow this link:

Impala Installation

Detailed information or list of commands to follow for installing and configuring Cloudera Impala as follows:

First we have to create an ec2 instance with instance type t2.medium or more (not less than t2.medium) and do the further procedure to create it .
Next , we have to do same as in the link:

connect to e2 instance

Now we have to install jdk i.e., java (OpenJDK 7) using the command

root@~] # yum install java-1.6.0-openjdk* (we have to go to root user to run this command) Or

ec2-user@ip172…]$ sudo yum install java-1.6.0-openjdk*

Next we have to run the following commands:

#cd /etc/yum.repos.d/

#nano cdh.repo (this for cloudera hadoop repository for downloading the hadoop,hive and impala. In this we have to paste the code that is given in cloudera site based on our redhat or centOS version cdh.repo file data )

#yum install hadoop-hdfs-namenode

#yum install hadoop-hdfs-datanode

#cd /etc/hadoop/conf

#nano core-site.xml/hdfs-site.xml/yarn-site.xml/mapred-site.xml

Core-site.xml : https://drive.google.com/file/d/124RIuVRv4gZuEizNc8TWderGj3ShsOTa/view

Hdfs-site.xml:

https://drive.google.com/file/d/1gAn2WrONGk9wDNo6TiZdnGnWjrcSFOOL/view

Mapred-site.xml

https://drive.google.com/file/d/1OKKse2s0YC4kq3FlZ7ufZYNqp8GBuzJj/view

Need not to change the yarn-site.xml if it is already configured

#sudo -u hdfs hadoop namenode -format

# /etc/init.d/hadoop-hdfs-namenode start / status (to check whether it is running or not )

# /etc/init.d/hadoop-hdfs-datanode start

# sudo -u hdfs hadoop fs -mkdir /user/

# sudo -u hdfs hadoop fs -mkdir /user/ec2-user

# sudo -u hdfs hadoop fs -chown ec2-user /user/ec2-user

# yum install hadoop-yarn-resourcemanager

# yum install hadoop-yarn-nodemanager

# yum install hadoop-mapreduce-historyserver

# yum install hadoop-yarn-proxyserver

# yum install hadoop-client

# yum install hadoop-mapreduce

# /etc/init.d/hadoop-yarn-nodemanager start / status (to check whether it is running or not)

# /etc/init.d/hadoop-yarn-resourcemanager start

Run jps command to check whether all nodes of hadoop are running are not .

# yum install hive

# yum install hive-server2 hive-metastore

#nano hive-site.xml

Hive-site.xml

https://drive.google.com/file/d/1Fn2tb33LmCL3E1xjgm4ki7YRmrK8B0H2/view

# /etc/init.d/hive-metastore start / status (to check whether it is running or not)

# /etc/init.d/hive-server2 start

# sudo -u hdfs hadoop fs -mkdir -p /user/hive/warehouse

# sudo -u hdfs hadoop fs -mkdir -p /tmp

# sudo -u hdfs hadoop fs -chmod -R 777 /user/hive/warehouse

# sudo -u hdfs hadoop fs -chmod -R 777 /tmp

# yum install impala-server

# yum install impala-state-store

# yum install impala-catalog

# ln -s /etc/hadoop/conf/core-site.xml /etc/impala/conf/core-site.xml

# ln -s /etc/hadoop/conf/hdfs-site.xml /etc/impala/conf/hdfs-site.xml

# ln -s /etc/hive/conf/hive-site.xml /etc/impala/conf/hive-site.xml

#yum install impala-shell

# /etc/init.d/impala-state-store start / status (to check whether it is running or not)

# /etc/init.d/impala-catalog start

#/etc/init.d/impala-server start

#impala-shell

Then it is connected to impala-shell and the impala server is running .

PrestoDB on Amazon EC2 Installation

First we have to select the Connector to which PrestoDB can run. Mainly Hive Connector is commonly using.
Now we have to install Hadoop and Hive on that instance for Hive Connector . If the connector is not hive then need not to install them.
Install python3.6 using the commands as follows:

# sudo add-apt-repository ppa:jonathonf/python-3.6

# sudo apt-get update

#sudo apt-get install python

Check the python version

#python -V

Now unzip the presto-server-0.194.tar.gz or newer version file using

# tar -xf presto-server-0.194.tar.gz

Now following Url will show how to configure prestoDB:

presto configuration

Run the presto with command:

# sudo bin/launcher start

And follow the procedure to connect presto from command line interface as in link:

preston cli

Vejju

Sunday, May 31, 2020

Software Installations

Edit the “.bashrc” file to update the environment variables

No comments:

Security Certificates

Search This Blog