Hadoop Installation :
- First we have to login in to the Amazon Web Services and create a new virtual machine with EC2 option.Select Ubuntu(with free-tier access) as OS and do the further process such as configuration properties,security groups and key paris,choose an Instance type, select “t2.medium”.
- Locate your private key file (key_pair_bigdata.pem). The wizard automatically detects the key you used to launch the instance.(Use FileZilla for transferring the files from Windows to Ubuntu(putty) and vice-versa)
- Your key must not be publicly viewable for SSH to work. Use this command if needed: chmod 400 key_pair_bigdata.pem
- Connect to your instance using its Public DNS:
ssh -i "key_pair_bigdata.pem" ubuntu@instancename
- Install base packages (java 8)
# sudo add-apt-repository ppa:webupd8team/java
# sudo apt-get update
# sudo apt-get install oracle-java8-installer
# java –version
- Using FileZilla , transfer files of hadoop-version.tar.gz or use the command as
#tar -xf hadoop-version.tar.gz (extract the file)
- Add the Hadoop related environment variables in your bash file.
#nano ~/.bashrc
Eg: export HADOOP_HOME=$HOME/hadoop-2.8.0
export HADOOP_CONF_DIR=$HOME/hadoop-2.8.0/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.8.0
export HADOOP_COMMON_HOME=$HOME/hadoop-2.8.0
export HADOOP_HDFS_HOME=$HOME/hadoop-2.8.0
export YARN_HOME=$HOME/hadoop-2.8.0
export PATH=$PATH:$HOME/hadoop-2.8.0/bin
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=/usr/lib/jvm/java-8-oracle/bin:$PATH
- Save and exit and use this command to refresh the bash settings.
# source ~/.bashrc
- Setting hadoop environment for password less ssh access. Password less SSH Configuration is a mandatory installation requirement. However it is more useful in distributed environment.
# ssh-keygen -t rsa -P ‘’
# cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
-> Modify the config file
# sudo vim /etc/ssh/sshd_config
-> Find the following line PasswordAuthentication to quickly find the wanted line.
PasswordAuthentication yes
-> Save the config file. Then restart the ssh service for the update to take action.
# sudo service ssh restart
-> check password less ssh access to localhost
# ssh localhost
-> exit from inner localhost shell
# exit
- Check the Hadoop version
#Hadoop version
- Set the hadoop config files. We need to set the below files in order for hadoop to function properly.
• core-site.xml
• hadoop-env.sh
• yarn-site.xml
• hdfs-site.xml
• mapred-site.xml
-> go to directory where all the config files are present (cd /home/ubuntu/hadoop-2.7.3/etc/hadoop)
• Copy and paste the below configurations in core-site.xml
# nano core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hadoop/hadooptmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ec2-52-59-253-110.eu-central-1.compute.amazonaws.com:8020</value> </property>
<property>
<name>hadoop.proxyuser.ubuntu.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.ubuntu.hosts</name> <value>*</value>
</property>
</configuration>
• Copy and paste the below configurations in mapred-site.xml
# cp mapred-site.xml.template mapred-site.xml
# nano mapred-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>ec2-54-93-105-1.eu-central-1.compute.amazonaws.com:9001</value>
</property>
</configuration>
• Copy and paste the below configurations in yarn-site.xml
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
• Copy and paste the below configurations in hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
-> Need to set JAVA_HOME in hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
- Formatting the HDFS file system via NameNode (after installing hadoop, for the first time we have to format the HDFS file system to make it work)
Eg: # mkdir hadoop
# cd hadoop
# mkdir hadoopdata
#mkdir hadooptmp
# cd hadoop2.8.0/bin
# hadoop namenode -format
- To start Hadoop issue the following Commands:
Eg: # cd hadoop-2.8.0/sbin
# ./start-all.sh or
# ./start-dfs.sh ./start-yarn.sh
- Check for hadoop processes /daemons running on hadoop with Java Virtual Machine Process Status Tool.
# jps
Hive Installation:
- Use FileZilla,transfer the file apache-hive-version-bin.tar.gz or use the command
# tar -xf apache-hive-version-bin.tar.gz (extract the file)
Edit the “.bashrc” file to update the environment variables
Also, make sure that hadoop path is also set.
export HIVE_HOME=$HOME/apache-hive-2.1.0-bin
export PATH=$PATH:$HOME/apache-hive-2.1.0-bin/bin
- Run the below command to make the changes work in same terminal
# source ~/.bashrc
- Check Hive version
# hive --version
- Create Hive directories within HDFS. The directory ‘warehouse’ is the location to store the table or data related to hive.
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
- Set read/write permissions for table.we have to give write permission to the group:
# hdfs dfs -chmod g+w /user/hive/warehouse
# hdfs dfs -chmod g+w /tmp
- Set Hadoop path in hive-env.sh
# cd apache-hive-2.1.0-bin/
# nano hive-env.sh
- Edit hive-site.xml
- By default, Hive uses Derby database. Initialize Derby database.If the metastore is set to mysql or postgresql any database rather than Derby
# cd apache-hive-2.1.0-bin/bin/schematool -initSchema -dbType derby or mysql
- Start the hive metastore server and hive server as follows:
# hive --service metastore &
# hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console &
- If we want to connect hive through beeline then as follows
# cd apache-hive-2.1.0-bin/bin
# beeline
# !connect jdbc:hive2://ec2-18-195-20-124.eu-central-1.compute.amazonaws.com:10000
Whenever if we want to start the hive , first we have to format namenode and start the datanode and secondary namenode , resource manager.
Issue the commands as follows:
$ clear
$ cd hadoop
$ sudo rm-rf hadoopdata
$ sudo rm -rf hadooptmp
$ cd hadoop2.8.0/bin
$ hadoop namenode -format
$ cd hadoop2.8.0/sbin
$./start-all.sh
$cd apache-hive-2.3.0/bin
$hive --service metastore &
$ hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=INFO,console &
Spark Installation
- First we have to check java version ,hadoop version and hive version
- Use FileZilla, transfer the spark-version-bin-hadoopversion.tar.gz i.e., pre built for hadoop
- Then extract the tar.gz file with tar -xf spark-version-bin-hadoopversion.tar.gz
- Then change the ~/.bashrc file with the following properties
export SPARK_HOME=$HOME/spark-version-bin-hadoopversion
export PATH=$PATH:$HOME/spark-version-bin-hadoopversion./bin
- Then start the hadoop namenode ,datanode ,secondary namenode and resource manager.
- Start the spark nodes as follows
# cd spark-2.2.1-bin-hadoop2.7/sbin
# ./start-master.sh
# ./start-slaves.sh spark://ip-172-31-45-215.eu-central-1.compute.internal:7077
- Now we have to start the metastore of hive with
# hive --service metastore &
- Then start the hive thrift server with thrift server port number rather than ‘10000’ i.e., ‘10001’
#hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.root.logger=INFO,console
- Now start the spark thrift server as follows(in sbin folder)
#./start-thriftserver.sh --master spark://ip-172-31-45-215.eu-central-1.compute.internal:7077
--hiveconf hive.server2.thrift.port=10001 --hiveconf hive.root.logger=INFO,console
- Now the spark server is running at port number 10001.
Postgres Installation:
CREATE ROLE biplus_user WITH LOGIN CREATEDB ENCRYPTED PASSWORD 'Gokulam@123';
postgres-# \q
CREATE DATABASE biplus;
create user biplus_user with password 'password';
CREATE USER biplus_user WITH LOGIN SUPERUSER CREATEDB CREATEROLE INHERIT NOREPLICATION PASSWORD 'Gokulam@123';
GRANT ALL PRIVILEGES ON DATABASE postgres TO biplus_user;
grant usage on schema <schema_name> to biplus_user;
grant select on all tables in schema <schema_name> to biplus_user;
- The first command grants the ability to list tables from a schema to a user
- The second command grants the read ability to all existing tables from that schema to the user
sudo service postgresql restart
sudo -i -u postgres / sudo su - postgres
psql -U postgres biplus
Psql
show search_path;
set search_path='pragmatic';
psql -U postgres biplus
Psql
show search_path;
set search_path='pragmatic';
CREATE TABLE schema.table_new AS TABLE schema.table_existing;
SELECT * FROM pg_stat_activity WHERE state != 'idle';
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '16 minutes' and state != 'idle' and query like '%pragmatic.%';
SELECT pg_database_size('postgres');
SELECT pg_size_pretty(pg_database_size('postgres'));
SELECT pg_size_pretty(pg_total_relation_size('pragmatic_group.daily_revenue_summary'));
SELECT pg_size_pretty(pg_relation_size('pragmatic_group.daily_revenue_summary')); // not using index
Mysql Installation:
select user, host from mysql.user;
1. drop USER 'root'@'localhost';
flush privileges;
flush privileges;
CREATE USER 'root'@'localhost' IDENTIFIED BY 'Gokulam@123';
grant select on my_database.schema_name to 'bi_user'@'%';
GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost';
2. drop USER 'root'@'%';
flush privileges;
CREATE USER 'root'@'%' IDENTIFIED BY 'Gokulam@123';
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%';
flush privileges;
SELECT user,authentication_string,plugin,host FROM mysql.user;
Vertica on Amazon EC2 Installation
- Check the java version
# java -version
- Use FileZilla for transferring the vertica_version_amd64.deb
- Issue the following commands
# sudo apt-get install dialog
# sudo dpkg -i vertica_9.3.0-0_amd64.deb
# sudo nano /etc/debian_version
- Change the version to stretch
# sudo /opt/vertica/sbin/install_vertica --hosts instance name -u dbadmin -password-disabled --ssh-identity ~/key_pair.pem --failure-threshold NONE
For QA (192.168.32.20)
sudo /opt/vertica/sbin/install_vertica --hosts 192.168.32.20 -u dbadmin -password-disabled --failure-threshold NONE
sudo /opt/vertica/sbin/install_vertica --hosts 192.168.32.20 -u dbadmin -password-disabled --failure-threshold NONE
pragmaticplay:
sudo /opt/vertica/sbin/install_vertica --hosts 195.201.187.148 -u dbadmin -password-disabled --failure-threshold NONE
Troubleshoot:
ulimit -n (check weather value is 65536 or not)
sudo vi /etc/security/limits.conf (Add below lines)
* soft nofile 65536
* hard nofile 65536
Log out and login the terminal
sysctl -p
- Log in as ‘dbadmin’
sudo -su dbadmin (log in as dbadmin)
Example o/p: dbadmin@firstpin-Server:~$
- /opt/vertica/bin/admintools
- Follow the screen
- Run Vertica by creating a New Database.
- Every time we have to start a Database that we have to query the files i.e., Database starting is compulsory to run Vertica queries.
CREATE SCHEMA pp_vertica;
CREATE USER biuser IDENTIFIED BY 'Gokulam@123';
GRANT ALL ON SCHEMA pp_vertica to biuser;
COMMANDS
SELECT GET_COMPLIANCE_STATUS();
Connect to verticadb
/opt/vertica/bin ./admintools -t connect_db -d vertica
connect to vertica vertica user dbadmin password 'password' on 'host',port;
DDLs export from verticadb
SELECT EXPORT_OBJECTS( '/home/dbadmin/sql_objects_all.sql', '', 'true');
Execute the ddls to verticadb
\i '/home/dbadmin/sql_objects_all.sql'
COPY vertica.pp_vertica.test FROM vertica vertica.pp_vertica.test DIRECT;
CREATE TABLE pp_vertica.am_assignments_backup AS SELECT * FROM pp_vertica.am_assignments;
SELECT COPY_TABLE('pp_vertica.test', 'pp_vertica.test_copy');
source to destination data copy
./vsql -U dbadmin -w password -h sourceHost -d vertica -At -c "SELECT * from schema.table" \
| ./vsql -U dbadmin -w password -d vertica -c "COPY schema.table FROM STDIN DELIMITER '|';"
admintool -t list_allnodes
Cloudera Impala on Amazon Ec2 Installation (Redhat Linux)
Installation of Cloudera Impala in Redhat(Version 7) follow this link:
Detailed information or list of commands to follow for installing and configuring Cloudera Impala as follows:
- First we have to create an ec2 instance with instance type t2.medium or more (not less than t2.medium) and do the further procedure to create it .
- Next , we have to do same as in the link:
- Now we have to install jdk i.e., java (OpenJDK 7) using the command
root@~] # yum install java-1.6.0-openjdk* (we have to go to root user to run this command) Or
ec2-user@ip172…]$ sudo yum install java-1.6.0-openjdk*
- Next we have to run the following commands:
#cd /etc/yum.repos.d/
#nano cdh.repo (this for cloudera hadoop repository for downloading the hadoop,hive and impala. In this we have to paste the code that is given in cloudera site based on our redhat or centOS version cdh.repo file data )
#yum install hadoop-hdfs-namenode
#yum install hadoop-hdfs-datanode
#cd /etc/hadoop/conf
#nano core-site.xml/hdfs-site.xml/yarn-site.xml/mapred-site.xml
Hdfs-site.xml:
Mapred-site.xml
Need not to change the yarn-site.xml if it is already configured
#sudo -u hdfs hadoop namenode -format
# /etc/init.d/hadoop-hdfs-namenode start / status (to check whether it is running or not )
# /etc/init.d/hadoop-hdfs-datanode start
# sudo -u hdfs hadoop fs -mkdir /user/
# sudo -u hdfs hadoop fs -mkdir /user/ec2-user
# sudo -u hdfs hadoop fs -chown ec2-user /user/ec2-user
# yum install hadoop-yarn-resourcemanager
# yum install hadoop-yarn-nodemanager
# yum install hadoop-mapreduce-historyserver
# yum install hadoop-yarn-proxyserver
# yum install hadoop-client
# yum install hadoop-mapreduce
# /etc/init.d/hadoop-yarn-nodemanager start / status (to check whether it is running or not)
# /etc/init.d/hadoop-yarn-resourcemanager start
- Run jps command to check whether all nodes of hadoop are running are not .
# yum install hive
# yum install hive-server2 hive-metastore
#nano hive-site.xml
Hive-site.xml
# /etc/init.d/hive-metastore start / status (to check whether it is running or not)
# /etc/init.d/hive-server2 start
# sudo -u hdfs hadoop fs -mkdir -p /user/hive/warehouse
# sudo -u hdfs hadoop fs -mkdir -p /tmp
# sudo -u hdfs hadoop fs -chmod -R 777 /user/hive/warehouse
# sudo -u hdfs hadoop fs -chmod -R 777 /tmp
# yum install impala-server
# yum install impala-state-store
# yum install impala-catalog
# ln -s /etc/hadoop/conf/core-site.xml /etc/impala/conf/core-site.xml
# ln -s /etc/hadoop/conf/hdfs-site.xml /etc/impala/conf/hdfs-site.xml
# ln -s /etc/hive/conf/hive-site.xml /etc/impala/conf/hive-site.xml
#yum install impala-shell
# /etc/init.d/impala-state-store start / status (to check whether it is running or not)
# /etc/init.d/impala-catalog start
#/etc/init.d/impala-server start
#impala-shell
Then it is connected to impala-shell and the impala server is running .
PrestoDB on Amazon EC2 Installation
- First we have to select the Connector to which PrestoDB can run. Mainly Hive Connector is commonly using.
- Now we have to install Hadoop and Hive on that instance for Hive Connector . If the connector is not hive then need not to install them.
- Install python3.6 using the commands as follows:
# sudo add-apt-repository ppa:jonathonf/python-3.6
# sudo apt-get update
#sudo apt-get install python
- Check the python version
#python -V
- Now unzip the presto-server-0.194.tar.gz or newer version file using
# tar -xf presto-server-0.194.tar.gz
Now following Url will show how to configure prestoDB:
Run the presto with command:
# sudo bin/launcher start
And follow the procedure to connect presto from command line interface as in link:
No comments:
Post a Comment