
Python, R, and SQL are sometimes cited because the most-used languages for processing, modeling, and exploring knowledge. Whereas that could be true, there is no such thing as a purpose that others cannot be — or usually are not being — used to do that work.
The Bash shell is a Unix and Unix-like working system shell, together with the instructions and programming language that associate with it. Bash scripts are packages written utilizing this Bash shell scripting language. These scripts are executed sequentially by the Bash interpreter, and might embody all the constructs sometimes present in different programming languages, together with conditional statements, loops, and variables.
Widespread Bash script makes use of embody:
automating system administration duties
performing backups and upkeep
parsing log recordsdata and different knowledge
creating command-line instruments and utilities
Bash scripting can be used to orchestrate the deployment and administration of complicated distributed programs, making it an extremely helpful talent within the arenas of knowledge engineering, cloud computing environments, and DevOps.
On this article, we’re going to try 5 totally different knowledge science-related scripting-friendly duties, the place we must always see how versatile and helpful Bash may be.
Clear and Format Uncooked Knowledge
Right here is an instance bash script for cleansing and formatting uncooked knowledge recordsdata:
# Set the enter and output file paths
input_file=”raw_data.csv”
output_file=”clean_data.csv”
# Take away any main or trailing whitespace from every line
sed ‘s/^[ t]*//;s/[ t]*$//’ $input_file > $output_file
# Exchange any commas inside quoted fields with a placeholder
sed -i ‘s/”,”/,/g’ $output_file
# Exchange any newlines inside quoted fields with a placeholder
sed -i ‘s/”,”/ /g’ $output_file
# Take away the quotes round every area
sed -i ‘s/”//g’ $output_file
# Exchange the placeholder with the unique comma separator
sed -i ‘s/,/”,”/g’ $output_file
echo “Knowledge cleansing and formatting full. Output file: $output_file”
This script:
assumes that your uncooked knowledge file is in a CSV file referred to as raw_data.csv
saves the cleaned knowledge as clean_data.csv
makes use of the sed command to:
take away main/trailing whitespace from every line and change any commas inside quoted fields with a placeholder
change newlines inside quoted fields with a placeholder
take away the quotes round every area
change the placeholder with the unique comma separator
prints a message indicating that the info cleansing and formatting is full, together with the placement of the output file
Automate Knowledge Visualization
Right here is an instance bash script for automating knowledge visualization duties:
# Set the enter file path
input_file=”knowledge.csv”
# Create a line chart of column 1 vs column 2
gnuplot -e “set datafile separator ‘,’; set time period png; set output ‘line_chart.png’; plot ‘$input_file’ utilizing 1:2 with strains”
# Create a bar chart of column 3
gnuplot -e “set datafile separator ‘,’; set time period png; set output ‘bar_chart.png’; plot ‘$input_file’ utilizing 3:xtic(1) with bins”
# Create a scatter plot of column 4 vs column 5
gnuplot -e “set datafile separator ‘,’; set time period png; set output ‘scatter_plot.png’; plot ‘$input_file’ utilizing 4:5 with factors”
echo “Knowledge visualization full. Output recordsdata: line_chart.png, bar_chart.png, scatter_plot.png”
The above script:
assumes that your knowledge is in a CSV file referred to as knowledge.csv
makes use of the gnuplot command to create three several types of plots:
a line chart of column 1 vs column 2
a bar chart of column 3
a scatter plot of column 4 vs column 5
outputs the plots in png format and saves them as line_chart.png, bar_chart.png, and scatter_plot.png respectively
prints a message indicating that the info visualization is full and the placement of the output recordsdata
Please notice that for this script to perform, one would wish to regulate the column numbers and sorts of charts primarily based in your knowledge and necessities.
Statistical Evaluation
Right here is an instance bash script for performing statistical evaluation on a dataset:
# Set the enter file path
input_file=”knowledge.csv”
# Set the output file path
output_file=”statistics.txt”
# Use awk to calculate the imply of column 1
imply=$(awk -F’,’ ‘{sum+=$1} END {print sum/NR}’ $input_file)
# Use awk to calculate the usual deviation of column 1
stddev=$(awk -F’,’ ‘{sum+=$1; sumsq+=$1*$1} END {print sqrt(sumsq/NR – (sum/NR)**2)}’ $input_file)
# Append the outcomes to the output file
echo “Imply of column 1: $imply” >> $output_file
echo “Commonplace deviation of column 1: $stddev” >> $output_file
# Use awk to calculate the imply of column 2
imply=$(awk -F’,’ ‘{sum+=$2} END {print sum/NR}’ $input_file)
# Use awk to calculate the usual deviation of column 2
stddev=$(awk -F’,’ ‘{sum+=$2; sumsq+=$2*$2} END {print sqrt(sumsq/NR – (sum/NR)**2)}’ $input_file)
# Append the outcomes to the output file
echo “Imply of column 2: $imply” >> $output_file
echo “Commonplace deviation of column 2: $stddev” >> $output_file
echo “Statistical evaluation full. Output file: $output_file”
This script:
assumes that your knowledge is in a CSV file referred to as knowledge.csv
makes use of the awk command to calculate the imply and normal deviation of two columns
separates the info by a comma
saves the outcomes to a textual content file statistics.txt.
prints a message indicating that the statistical evaluation is full and the placement of the output file
Word that you would be able to add extra awk instructions to calculate different statistical values or for extra columns.
Handle Python Bundle Dependencies
Right here is an instance bash script for managing and updating dependencies and packages required for knowledge science initiatives:
# Set the trail of the digital surroundings
venv_path=”venv”
# Activate the digital surroundings
supply $venv_path/bin/activate
# Replace pip
pip set up –upgrade pip
# Set up required packages from necessities.txt
pip set up -r necessities.txt
# Deactivate the digital surroundings
deactivate
echo “Dependency and bundle administration full.”
This script:
assumes that you’ve got a digital surroundings arrange, and a file named necessities.txt containing the bundle names and variations that you just wish to set up
makes use of the supply command to activate a digital surroundings specified by the trail venv_path.
makes use of pip to improve pip to the most recent model
installs the packages specified within the necessities.txt file
makes use of the deactivate command to deactivate the digital surroundings after the packages are put in
prints a message indicating that the dependency and bundle administration is full
This script ought to be run each time you wish to replace your dependencies or set up new packages for an information science challenge.
Handle Jupyter Pocket book Execution
Right here is an instance bash script for automating the execution of Jupyter Pocket book or different interactive knowledge science environments:
# Set the trail of the pocket book file
notebook_file=”evaluation.ipynb”
# Set the trail of the digital surroundings
venv_path=”venv”
# Activate the digital surroundings
supply $venv_path/bin/activate
# Begin Jupyter Pocket book
jupyter-notebook $notebook_file
# Deactivate the digital surroundings
deactivate
echo “Jupyter Pocket book execution full.”
The above script:
assumes that you’ve got a digital surroundings arrange and Jupyter Pocket book put in in it
makes use of the supply command to activate a digital surroundings, specified by the trail venv_path
makes use of the jupyter-notebook command to begin Jupyter Pocket book and open the required notebook_file
makes use of the deactivate command to deactivate the digital surroundings after the execution of Jupyter Pocket book
prints a message indicating that the execution of Jupyter Pocket book is full
This script ought to be run each time you wish to execute a Jupyter Pocket book or different interactive knowledge science environments.
I am hoping that these easy scripts have been sufficient to point out you the simplicity and energy of scripting with Bash. It may not be your go-to resolution for each state of affairs, nevertheless it definitely has its place. Better of luck in your scripting.
Matthew Mayo (@mattmayo13) is a Knowledge Scientist and the Editor-in-Chief of KDnuggets, the seminal on-line Knowledge Science and Machine Studying useful resource. His pursuits lie in pure language processing, algorithm design and optimization, unsupervised studying, neural networks, and automatic approaches to machine studying. Matthew holds a Grasp’s diploma in pc science and a graduate diploma in knowledge mining. He may be reached at editor1 at kdnuggets[dot]com.