Thursday, March 30, 2023
No Result
View All Result
Get the latest A.I News on A.I. Pulses
  • Home
  • A.I News
  • Computer Vision
  • Machine learning
  • A.I. Startups
  • Robotics
  • Data science
  • Natural Language Processing
  • Home
  • A.I News
  • Computer Vision
  • Machine learning
  • A.I. Startups
  • Robotics
  • Data science
  • Natural Language Processing
No Result
View All Result
Get the latest A.I News on A.I. Pulses
No Result
View All Result

Parallel Processing Massive File in Python

February 22, 2023
141 9
Home Data science
Share on FacebookShare on Twitter


Picture by Creator
 

For parallel processing, we divide our job into sub-units. It will increase the variety of jobs processed by this system and reduces general processing time. 

For instance, in case you are working with a big CSV file and also you need to modify a single column. We are going to feed the information as an array to the operate, and it’ll parallel course of a number of values directly primarily based on the variety of out there  staff. These staff are primarily based on the variety of cores inside your processor.  

Observe: utilizing parallel processing on a smaller dataset won’t enhance processing time.

 

On this weblog, we’ll discover ways to scale back processing time on massive information utilizing multiprocessing, joblib, and tqdm Python packages. It’s a easy tutorial that may apply to any file, database, picture, video, and audio.  

Observe: we’re utilizing the Kaggle pocket book for the experiments. The processing time can differ from machine to machine.  

 

 

We will likely be utilizing the US Accidents (2016 – 2021) dataset from Kaggle which consists of two.8 million information and 47 columns. 

We are going to import multiprocessing, joblib, and tqdm for parallel processing, pandas for knowledge ingestions, and re, nltk, and string for textual content processing. 

# Parallel Computing
import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.pocket book import tqdm

# Knowledge Ingestion
import pandas as pd

# Textual content Processing
import re
from nltk.corpus import stopwords
import string

Earlier than we leap proper in, let’s set n_workers by doubling cpu_count(). As you may see, we’ve 8 staff.

n_workers = 2 * mp.cpu_count()

print(f”{n_workers} staff can be found”)

>>> 8 staff can be found

Within the subsequent step, we’ll ingest massive CSV information utilizing the pandas read_csv operate. Then, print out the form of the dataframe, the identify of the columns, and the processing time. 

 

Observe: Jupyter’s magic operate %%time can show CPU occasions and wall time on the finish of the method. 

 

%%time
file_name=”../enter/us-accidents/US_Accidents_Dec21_updated.csv”
df = pd.read_csv(file_name)

print(f”Form:{df.form}nnColumn Names:n{df.columns}n”)

Output

Form:(2845342, 47)

Column Names:

Index([‘ID’, ‘Severity’, ‘Start_Time’, ‘End_Time’, ‘Start_Lat’, ‘Start_Lng’,
‘End_Lat’, ‘End_Lng’, ‘Distance(mi)’, ‘Description’, ‘Number’, ‘Street’,
‘Side’, ‘City’, ‘County’, ‘State’, ‘Zipcode’, ‘Country’, ‘Timezone’,
‘Airport_Code’, ‘Weather_Timestamp’, ‘Temperature(F)’, ‘Wind_Chill(F)’,
‘Humidity(%)’, ‘Pressure(in)’, ‘Visibility(mi)’, ‘Wind_Direction’,
‘Wind_Speed(mph)’, ‘Precipitation(in)’, ‘Weather_Condition’, ‘Amenity’,
‘Bump’, ‘Crossing’, ‘Give_Way’, ‘Junction’, ‘No_Exit’, ‘Railway’,
‘Roundabout’, ‘Station’, ‘Stop’, ‘Traffic_Calming’, ‘Traffic_Signal’,
‘Turning_Loop’, ‘Sunrise_Sunset’, ‘Civil_Twilight’, ‘Nautical_Twilight’,
‘Astronomical_Twilight’],
dtype=”object”)

CPU occasions: consumer 33.9 s, sys: 3.93 s, whole: 37.9 s
Wall time: 46.9 s

 

The clean_text is an easy operate for processing and cleansing the textual content. We are going to get English stopwords utilizing nltk.copus the use it to filter out cease phrases from the textual content line. After that, we’ll take away particular characters and further areas from the sentence. Will probably be the baseline operate to find out processing time for serial, parallel, and batch processing. 

def clean_text(textual content):
# Take away cease phrases
stops = stopwords.phrases(“english”)
textual content = ” “.be a part of([word for word in text.split() if word not in stops])
# Take away Particular Characters
textual content = textual content.translate(str.maketrans(”, ”, string.punctuation))
# eradicating the additional areas
textual content = re.sub(‘ +’,’ ‘, textual content)
return textual content

 

For serial processing, we will use the pandas .apply() operate, however if you wish to see the progress bar, you must activate tqdm for pandas after which use the .progress_apply() operate. 

We’re going to course of the two.8 million information and save the outcome again to the “Description” column column. 

%%time
tqdm.pandas()

df[‘Description’] = df[‘Description’].progress_apply(clean_text)

Output

It took 9 minutes and 5 seconds for the high-end processor to serial course of 2.8 million rows. 

100% 2845342/2845342 [09:05<00:00, 5724.25it/s]

CPU occasions: consumer 8min 14s, sys: 53.6 s, whole: 9min 7s
Wall time: 9min 5s

 

There are numerous methods to parallel course of the file, and we’re going to study all of them. The multiprocessing is a built-in python bundle that’s generally used for parallel processing massive information. 

We are going to create a multiprocessing Pool with 8 staff and use the map operate to provoke the method. To show progress bars, we’re utilizing tqdm.

The map operate consists of two sections. The primary requires the operate, and the second requires an argument or listing of arguments. 

Study extra by studying documentation. 

%%time
p = mp.Pool(n_workers)

df[‘Description’] = p.map(clean_text,tqdm(df[‘Description’]))

Output

We have now improved our processing time by nearly 3X. The processing time dropped from 9 minutes 5 seconds to three minutes 51 seconds.   

100% 2845342/2845342 [02:58<00:00, 135646.12it/s]

CPU occasions: consumer 5.68 s, sys: 1.56 s, whole: 7.23 s
Wall time: 3min 51s

 

We are going to now study one other Python bundle to carry out parallel processing. On this part, we’ll use joblib’s Parallel and delayed to duplicate the map operate. 

The Parallel requires two arguments: n_jobs = 8 and backend = multiprocessing.
Then, we’ll add clean_text  to the delayed operate. 
Create a loop to feed a single worth at a time. 

The method under is kind of generic, and you may modify your operate and array in keeping with your wants. I’ve used it to course of hundreds of audio and video information with none concern. 

Beneficial: add exception dealing with utilizing strive: and besides:

def text_parallel_clean(array):
outcome = Parallel(n_jobs=n_workers,backend=”multiprocessing”)(
delayed(clean_text)
(textual content)
for textual content in tqdm(array)
)
return outcome

Add the “Description” column to text_parallel_clean(). 

%%time
df[‘Description’] = text_parallel_clean(df[‘Description’])

Output

It took our operate 13 seconds greater than multiprocessing the Pool. Even then, Parallel is 4 minutes and 59 seconds quicker than serial processing. 

100% 2845342/2845342 [04:03<00:00, 10514.98it/s]

CPU occasions: consumer 44.2 s, sys: 2.92 s, whole: 47.1 s
Wall time: 4min 4s

 

There’s a higher approach to course of massive information by splitting them into batches and processing them parallel. Let’s begin by making a batch operate that may run a clean_function on a single batch of values. 

 

Batch Processing Perform

 

def proc_batch(batch):
return [
clean_text(text)
for text in batch
]

Splitting the File into Batches

 

The operate under will cut up the file into a number of batches primarily based on the variety of staff. In our case, we get 8 batches. 

def batch_file(array,n_workers):
file_len = len(array)
batch_size = spherical(file_len / n_workers)
batches = [
array[ix:ix+batch_size]
for ix in tqdm(vary(0, file_len, batch_size))
]
return batches

batches = batch_file(df[‘Description’],n_workers)

>>> 100% 8/8 [00:00<00:00, 280.01it/s]

Operating Parallel Batch Processing

 

Lastly, we’ll use Parallel and delayed to course of batches. 

 

Observe: To get a single array of values, we’ve to run listing comprehension as proven under. 

 

%%time
batch_output = Parallel(n_jobs=n_workers,backend=”multiprocessing”)(
delayed(proc_batch)
(batch)
for batch in tqdm(batches)
)

df[‘Description’] = [j for i in batch_output for j in i]

Output

We have now improved the processing time. This method is known for processing complicated knowledge and coaching deep studying fashions. 

100% 8/8 [00:00<00:00, 2.19it/s]

CPU occasions: consumer 3.39 s, sys: 1.42 s, whole: 4.81 s
Wall time: 3min 56s

 

tqdm takes multiprocessing to the subsequent degree. It’s easy and highly effective. I’ll advocate it to each knowledge scientist. 

Take a look at the documentation to study extra about multiprocessing. 

The process_map requires:

Perform identify
Dataframe column
max_workers
chucksize is much like batch measurement. We are going to calculate the batch measurement utilizing the variety of staff or you may add the quantity primarily based in your choice. 

%%time
from tqdm.contrib.concurrent import process_map
batch = spherical(len(df)/n_workers)

df[“Description”] = process_map(
clean_text, df[“Description”], max_workers=n_workers, chunksize=batch
)

Output

With a single line of code, we get one of the best outcome. 

100% 2845342/2845342 [03:48<00:00, 1426320.93it/s]

CPU occasions: consumer 7.32 s, sys: 1.97 s, whole: 9.29 s
Wall time: 3min 51s

 

That you must discover a stability and choose the approach that works greatest on your case. It may be serial processing, parallel, or batch processing. The parallel processing can backfire in case you are working with a smaller, much less complicated dataset. 

On this mini-tutorial, we’ve realized about varied Python packages and methods that permit us to parallel course of our knowledge features. 

If you’re solely working with a tabular dataset and need to enhance your processing efficiency, then I’ll recommend you strive Dask, datatable, and RAPIDS 

 

Reference 

 

  Abid Ali Awan (@1abidaliawan) is an authorized knowledge scientist skilled who loves constructing machine studying fashions. Presently, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in Know-how Administration and a bachelor’s diploma in Telecommunication Engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college students scuffling with psychological sickness. 



Source link

Tags: FileLargeParallelProcessingPython
Next Post

What Do Staff Actually Suppose?

Environment friendly amphibious robotic channels its interior snake and centipede

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent News

Heard on the Avenue – 3/30/2023

March 30, 2023

Strategies for addressing class imbalance in deep learning-based pure language processing

March 30, 2023

A Suggestion System For Educational Analysis (And Different Information Sorts)! | by Benjamin McCloskey | Mar, 2023

March 30, 2023

AI Is Altering the Automotive Trade Endlessly

March 29, 2023

Historical past of the Meeting Line

March 30, 2023

Lacking hyperlinks in AI governance – a brand new ebook launch

March 29, 2023

Categories

  • A.I News
  • A.I. Startups
  • Computer Vision
  • Data science
  • Machine learning
  • Natural Language Processing
  • Robotics
A.I. Pulses

Get The Latest A.I. News on A.I.Pulses.com.
Machine learning, Computer Vision, A.I. Startups, Robotics News and more.

Categories

  • A.I News
  • A.I. Startups
  • Computer Vision
  • Data science
  • Machine learning
  • Natural Language Processing
  • Robotics
No Result
View All Result

Recent News

  • Heard on the Avenue – 3/30/2023
  • Strategies for addressing class imbalance in deep learning-based pure language processing
  • A Suggestion System For Educational Analysis (And Different Information Sorts)! | by Benjamin McCloskey | Mar, 2023
  • Home
  • DMCA
  • Disclaimer
  • Cookie Privacy Policy
  • Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2022 A.I. Pulses.
A.I. Pulses is not responsible for the content of external sites.

No Result
View All Result
  • Home
  • A.I News
  • Computer Vision
  • Machine learning
  • A.I. Startups
  • Robotics
  • Data science
  • Natural Language Processing

Copyright © 2022 A.I. Pulses.
A.I. Pulses is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In