By David Wendt and Gregory Kimball
Environment friendly processing of string information is important for a lot of information science functions. To extract worthwhile info from string information, RAPIDS libcudf gives highly effective instruments for accelerating string information transformations. libcudf is a C++ GPU DataFrame library used for loading, becoming a member of, aggregating, and filtering information.
In information science, string information represents speech, textual content, genetic sequences, logging, and lots of different varieties of info. When working with string information for machine studying and have engineering, the information should continuously be normalized and remodeled earlier than it may be utilized to particular use circumstances. libcudf gives each basic function APIs in addition to device-side utilities to allow a variety of customized string operations.
This publish demonstrates tips on how to skillfully remodel strings columns with the libcudf basic function API. You’ll achieve new information on tips on how to unlock peak efficiency utilizing customized kernels and libcudf device-side utilities. This publish additionally walks you thru examples of tips on how to finest handle GPU reminiscence and effectively assemble libcudf columns to hurry up your string transformations.
libcudf shops string information in gadget reminiscence utilizing Arrow format, which represents strings columns as two baby columns: chars and offsets (Determine 1).
The chars column holds the string information as UTF-8 encoded character bytes which are saved contiguously in reminiscence.
The offsets column accommodates an growing sequence of integers that are byte positions figuring out the beginning of every particular person string inside the chars information array. The ultimate offset component is the full variety of bytes within the chars column. This implies the dimensions of a person string at row i is outlined as (offsets[i+1]-offsets[i]).
Determine 1. Schematic exhibiting how Arrow format represents strings columns with chars and offsets baby columns
For instance an instance string transformation, take into account a perform that receives two enter strings columns and produces one redacted output strings column.
The enter information has the next kind: a “names” column containing first and final names separated by an area and a “visibilities” column containing the standing of “public” or “personal.”
We suggest the “redact” perform that operates on the enter information to supply output information consisting of the primary preliminary of the final title adopted by an area and your entire first title. Nevertheless, if the corresponding visibility column is “personal” then the output string ought to be totally redacted as “X X.”
Desk 1. Instance of a “redact” string transformation that receives names and visibilities strings columns as enter and partially or totally redacted information as output
First, string transformation will be completed utilizing the libcudf strings API. The overall function API is a superb place to begin and baseline for evaluating efficiency.
The API capabilities function on a whole strings column, launching at the very least one kernel per perform and assigning one thread per string. Every thread handles a single row of knowledge in parallel throughout the GPU and outputs a single row as a part of a brand new output column.
To finish the redact instance perform utilizing the final function API, comply with these steps:
Convert the “visibilities” strings column right into a Boolean column utilizing accommodates
Create a brand new strings column from the names column by copying “X X” every time the corresponding row entry within the boolean column is “false”
Break up the “redacted” column into first title and final title columns
Slice the primary character of the final names because the final title initials
Construct the output column by concatenating the final initials column and the primary names column with area (” “) separator.
auto const seen = cudf::string_scalar(std::string(“public”));
auto const allowed = cudf::strings::accommodates(visibilities, seen);
// redact names
auto const redaction = cudf::string_scalar(std::string(“X X”));
auto const redacted = cudf::copy_if_else(names, redaction, allowed->view());
// break up the primary title and final preliminary into two columns
auto const sv = cudf::strings_column_view(redacted->view())
auto const first_last = cudf::strings::break up(sv);
auto const first = first_last->view().column(0);
auto const final = first_last->view().column(1);
auto const last_initial = cudf::strings::slice_strings(final, 0, 1);
// assemble a outcome column
auto const television = cudf::table_view({last_initial->view(), first});
auto outcome = cudf::strings::concatenate(television, std::string(” “));
This strategy takes about 3.5 ms on an A6000 with 600K rows of knowledge. This instance makes use of accommodates, copy_if_else, break up, slice_strings and concatenate to perform a customized string transformation. A profiling evaluation with Nsight Programs exhibits that the break up perform takes the longest period of time, adopted by slice_strings and concatenate.
Determine 2 exhibits profiling information from Nsight Programs of the redact instance, exhibiting end-to-end string processing at as much as ~600 million parts per second. The areas correspond to NVTX ranges related to every perform. Gentle blue ranges correspond to durations the place CUDA kernels are operating.
Determine 2. Profiling information from Nsight Programs of the redact instance
The libcudf strings API is a quick and environment friendly toolkit for remodeling strings, however typically performance-critical capabilities must run even sooner. A key supply of additional work within the libcudf strings API is the creation of at the very least one new strings column in international gadget reminiscence for every API name, opening up the chance to mix a number of API calls right into a customized kernel.
Efficiency limitations in kernel malloc calls
First, we’ll construct a customized kernel to implement the redact instance transformation. When designing this kernel, we should remember that libcudf strings columns are immutable.
Strings columns can’t be modified in place as a result of the character bytes are saved contiguously, and any adjustments to the size of a string would invalidate the offsets information. Due to this fact the redact_kernel customized kernel generates a brand new strings column through the use of a libcudf column manufacturing unit to construct each offsets and chars baby columns.
On this first strategy, the output string for every row is created in dynamic gadget reminiscence utilizing a malloc name contained in the kernel. The customized kernel output is a vector of gadget pointers to every row output, and this vector serves as enter to a strings column manufacturing unit.
The customized kernel accepts a cudf::column_device_view to entry the strings column information and makes use of the component methodology to return a cudf::string_view representing the string information on the specified row index. The kernel output is a vector of sort cudf::string_view that holds tips to the gadget reminiscence containing the output string and the dimensions of that string in bytes.
The cudf::string_view class is just like the std::string_view class however is carried out particularly for libcudf and wraps a hard and fast size of character information in gadget reminiscence encoded as UTF-8. It has lots of the identical options (discover and substr capabilities, for instance) and limitations (no null terminator) because the std counterpart. A cudf::string_view represents a personality sequence saved in gadget reminiscence and so we will use it right here to document the malloc’d reminiscence for an output vector.
Malloc kernel
__global__ void redact_kernel(cudf::column_device_view const d_names,
cudf::column_device_view const d_visibilities,
cudf::string_view redaction,
cudf::string_view* d_output)
{
// get index for this thread
auto index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= d_names.dimension()) return;
auto const seen = cudf::string_view(“public”, 6);
auto const title = d_names.component(index);
auto const vis = d_visibilities.component(index);
if (vis == seen) {
auto const space_idx = title.discover(‘ ‘);
auto const first = title.substr(0, space_idx);
auto const last_initial = title.substr(space_idx + 1, 1);
auto const output_size = first.size_bytes() + last_initial.size_bytes() + 1;
char* output_ptr = static_cast(malloc(output_size));
// construct output string
d_output[index] = cudf::string_view{output_ptr, output_size};
memcpy(output_ptr, last_initial.information(), last_initial.size_bytes());
output_ptr += last_initial.size_bytes();
*output_ptr++ = ‘ ‘;
memcpy(output_ptr, first.information(), first.size_bytes());
} else {
d_output[index] = cudf::string_view{redaction.information(), redaction.size_bytes()};
}
}
__global__ void free_kernel(cudf::string_view redaction, cudf::string_view* d_output, int rely)
{
auto index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= rely) return;
auto ptr = const_cast(d_output[index].information());
if (ptr != redaction.information()) free(ptr); // free every part that does match the redaction string
}
This would possibly look like an inexpensive strategy, till the kernel efficiency is measured. This strategy takes about 108 ms on an A6000 with 600K rows of knowledge—greater than 30x slower than the answer offered above utilizing the libcudf strings API.
free_kernel 45.5ms
make_strings_column 0.5ms
The primary bottleneck is the malloc/free calls inside the 2 kernels right here. The CUDA dynamic gadget reminiscence requires malloc/free calls in a kernel to be synchronized, inflicting parallel execution to degenerate into sequential execution.
Pre-allocating working reminiscence to get rid of bottlenecks
Eradicate the malloc/free bottleneck by changing the malloc/free calls within the kernel with pre-allocated working reminiscence earlier than launching the kernel.
For the redact instance, the output dimension of every string on this instance ought to be no bigger than the enter string itself, because the logic solely removes characters. Due to this fact, a single gadget reminiscence buffer can be utilized with the identical dimension because the enter buffer. Use the enter offsets to find every row place.
Accessing the strings column’s offsets entails wrapping the cudf::column_view with a cudf::strings_column_view and calling its offsets_begin methodology. The scale of the chars baby column can be accessed utilizing the chars_size methodology. Then a rmm::device_uvector is pre-allocated earlier than calling the kernel to retailer the character output information.
auto const offsets = scv.offsets_begin();
auto working_memory = rmm::device_uvector(scv.chars_size(), stream);
Pre-allocated kernel
cudf::column_device_view const d_visibilities,
cudf::string_view redaction,
char* working_memory,
cudf::offset_type const* d_offsets,
cudf::string_view* d_output)
{
auto index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= d_names.dimension()) return;
auto const seen = cudf::string_view(“public”, 6);
auto const title = d_names.component(index);
auto const vis = d_visibilities.component(index);
if (vis == seen) {
auto const space_idx = title.discover(‘ ‘);
auto const first = title.substr(0, space_idx);
auto const last_initial = title.substr(space_idx + 1, 1);
auto const output_size = first.size_bytes() + last_initial.size_bytes() + 1;
// resolve output string location
char* output_ptr = working_memory + d_offsets[index];
d_output[index] = cudf::string_view{output_ptr, output_size};
// construct output string into output_ptr
memcpy(output_ptr, last_initial.information(), last_initial.size_bytes());
output_ptr += last_initial.size_bytes();
*output_ptr++ = ‘ ‘;
memcpy(output_ptr, first.information(), first.size_bytes());
} else {
d_output[index] = cudf::string_view{redaction.information(), redaction.size_bytes()};
}
}
The kernel outputs a vector of cudf::string_view objects which is handed to the cudf::make_strings_column manufacturing unit perform. The second parameter to this perform is used for figuring out null entries within the output column. The examples on this publish don’t have null entries, so a nullptr placeholder cudf::string_view{nullptr,0} is used.
redact_kernel<<>>(*d_names,
*d_visibilities,
d_redaction.worth(),
working_memory.information(),
offsets,
str_ptrs.information());
auto outcome = cudf::make_strings_column(str_ptrs, cudf::string_view{nullptr,0}, stream);
This strategy takes about 1.1 ms on an A6000 with 600K rows of knowledge and due to this fact beats the baseline by greater than 2x. The approximate breakdown is proven under:
make_strings_column 400us
The remaining time is spent in cudaMalloc, cudaFree, cudaMemcpy, which is typical of the overhead for managing short-term cases of rmm::device_uvector. This methodology works properly if all the output strings are assured to be the identical dimension or smaller because the enter strings.
Total, switching to a bulk working reminiscence allocation with RAPIDS RMM is a big enchancment and answer for a customized strings perform.
Optimizing column creation for sooner compute instances
Is there a manner to enhance this even additional? The bottleneck is now the cudf::make_strings_column manufacturing unit perform which builds the 2 strings column parts, offsets and chars, from the vector of cudf::string_view objects.
In libcudf, many manufacturing unit capabilities are included for constructing strings columns. The manufacturing unit perform used within the earlier examples takes a cudf::device_span of cudf::string_view objects after which constructs the column by performing a collect on the underlying character information to construct the offsets and character baby columns. A rmm::device_uvector is robotically convertible to a cudf::device_span with out copying any information.
Nevertheless, if the vector of characters and the vector of offsets are constructed straight, then a distinct manufacturing unit perform can be utilized, which merely creates the strings column with out requiring a collect to repeat the information.
The sizes_kernel makes a primary cross over the enter information to compute the precise output dimension of every output row:
Optimized kernel: Half 1
cudf::column_device_view const d_visibilities,
cudf::size_type* d_sizes)
{
auto index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= d_names.dimension()) return;
auto const seen = cudf::string_view(“public”, 6);
auto const redaction = cudf::string_view(“X X”, 3);
auto const title = d_names.component(index);
auto const vis = d_visibilities.component(index);
cudf::size_type outcome = redaction.size_bytes(); // init to redaction dimension
if (vis == seen) {
auto const space_idx = title.discover(‘ ‘);
auto const first = title.substr(0, space_idx);
auto const last_initial = title.substr(space_idx + 1, 1);
outcome = first.size_bytes() + last_initial.size_bytes() + 1;
}
d_sizes[index] = outcome;
}
The output sizes are then transformed to offsets by performing an in-place exclusive_scan. Word that the offsets vector was created with names.dimension()+1 parts. The final entry would be the complete variety of bytes (all of the sizes added collectively) whereas the primary entry will probably be 0. These are each dealt with by the exclusive_scan name. The scale of the chars column is retrieved from the final entry of the offsets column to construct the chars vector.
auto offsets = rmm::device_uvector(names.dimension() + 1, stream);
// compute output sizes
sizes_kernel<<>>(
*d_names, *d_visibilities, offsets.information());
thrust::exclusive_scan(rmm::exec_policy(stream), offsets.start(), offsets.finish(), offsets.start());
The redact_kernel logic remains to be very a lot the identical besides that it accepts the output d_offsets vector to resolve every row’s output location:
Optimized kernel: Half 2
cudf::column_device_view const d_visibilities,
cudf::size_type const* d_offsets,
char* d_chars)
{
auto index = threadIdx.x + blockIdx.x * blockDim.x;
if (index >= d_names.dimension()) return;
auto const seen = cudf::string_view(“public”, 6);
auto const redaction = cudf::string_view(“X X”, 3);
// resolve output_ptr utilizing the offsets vector
char* output_ptr = d_chars + d_offsets[index];
auto const title = d_names.component(index);
auto const vis = d_visibilities.component(index);
if (vis == seen) {
auto const space_idx = title.discover(‘ ‘);
auto const first = title.substr(0, space_idx);
auto const last_initial = title.substr(space_idx + 1, 1);
auto const output_size = first.size_bytes() + last_initial.size_bytes() + 1;
// construct output string
memcpy(output_ptr, last_initial.information(), last_initial.size_bytes());
output_ptr += last_initial.size_bytes();
*output_ptr++ = ‘ ‘;
memcpy(output_ptr, first.information(), first.size_bytes());
} else {
memcpy(output_ptr, redaction.information(), redaction.size_bytes());
}
}
The scale of the output d_chars column is retrieved from the final entry of the d_offsets column to allocate the chars vector. The kernel launches with the pre-computed offsets vector and returns the populated chars vector. Lastly, the libcudf strings column manufacturing unit creates the output strings columns.
This cudf::make_strings_column manufacturing unit perform builds the strings column with out making a replica of the information. The offsets information and chars information are already within the right, anticipated format and this manufacturing unit merely strikes the information from every vector and creates the column construction round it. As soon as accomplished, the rmm::device_uvectors for offsets and chars are empty, their information having been moved into the output column.
auto chars = rmm::device_uvector(output_size, stream);
redact_kernel<<>>(
*d_names, *d_visibilities, offsets.information(), chars.information());
// from pre-assembled offsets and character buffers
auto outcome = cudf::make_strings_column(names.dimension(), std::transfer(offsets), std::transfer(chars));
This strategy takes about 300 us (0.3 ms) on an A6000 with 600K rows of knowledge and improves over the earlier strategy by greater than 2x. You would possibly discover that sizes_kernel and redact_kernel share a lot of the identical logic: as soon as to measure the dimensions of the output after which once more to populate the output.
From a code high quality perspective, it’s useful to refactor the transformation as a tool perform known as by each the sizes and redact kernels. From a efficiency perspective, you is likely to be shocked to see the computational value of the transformation being paid twice.
The advantages for reminiscence administration and extra environment friendly column creation usually outweigh the computation value of performing the transformation twice.
Desk 2 exhibits the compute time, kernel rely, and bytes processed for the 4 options mentioned on this publish. “Complete kernel launches” displays the full variety of kernels launched, together with each compute and helper kernels. “Complete bytes processed” is the cumulative DRAM learn plus write throughput and “minimal bytes processed” is a mean of 37.9 bytes per row for our check inputs and outputs. The perfect “reminiscence bandwidth restricted” case assumes 768 GB/s bandwidth, the theoretical peak throughput of the A6000.
Desk 2. Compute time, kernel rely, and bytes processed for the 4 options mentioned on this publish
“Optimized Kernel” gives the best throughput as a result of decreased variety of kernel launches and the less complete bytes processed. With environment friendly customized kernels, the full kernel launches drop from 31 to 4 and the full bytes processed from 12.6x to 1.75x of the enter plus output dimension.
Because of this, the customized kernel achieves >10x greater throughput than the final function strings API for the redact transformation.
The pool reminiscence useful resource in RAPIDS Reminiscence Supervisor (RMM) is one other instrument you should use to extend efficiency. The examples above use the default “CUDA reminiscence useful resource” for allocating and releasing international gadget reminiscence. Nevertheless, the time wanted to allocate working reminiscence provides vital latency in between steps of the string transformations. The “pool reminiscence useful resource” in RMM reduces latency by allocating a big pool of reminiscence up entrance, and assigning suballocations as wanted throughout processing.
With the CUDA reminiscence useful resource, “Optimized Kernel” exhibits a 10x-15x speedup that begins to drop off at greater row counts as a result of growing allocation dimension (Determine 3). Utilizing the pool reminiscence useful resource mitigates this impact and maintains 15x-25x speedups over the libcudf strings API strategy.
Determine 3. Speedup from the customized kernels “Pre-Allotted Kernel” and “Optimized Kernel” with the default CUDA reminiscence useful resource (stable) and the pool reminiscence useful resource (dashed), versus the libcudf string API utilizing the default CUDA reminiscence useful resource
With the pool reminiscence useful resource, an end-to-end reminiscence throughput approaching the theoretical restrict for a two-pass algorithm is demonstrated. “Optimized Kernel” reaches 320-340 GB/s throughput, measured utilizing the dimensions of inputs plus the dimensions of outputs and the compute time (Determine 4).
The 2-pass strategy first measures the sizes of the output parts, allocates reminiscence, after which units the reminiscence with the outputs. Given a two-pass processing algorithm, the implementation in “Optimized Kernel” performs near the reminiscence bandwidth restrict. “Finish-to-end reminiscence throughput” is outlined because the enter plus output dimension in GB divided by the compute time. *RTX A6000 reminiscence bandwidth (768 GB/s).
Determine 4. Reminiscence throughput for “Optimized Kernel,” “Pre-Allotted Kernel,” and “libcudf strings API” as a perform of enter/output row rely
This publish demonstrates two approaches for writing environment friendly string information transformations in libcudf. The libcudf basic function API is quick and simple for builders, and delivers good efficiency. libcudf additionally gives device-side utilities designed to be used with customized kernels, on this instance unlocking >10x sooner efficiency.
Apply your information
To get began with RAPIDS cuDF, go to the rapidsai/cudf GitHub repo. When you’ve got not but tried cuDF and libcudf in your string processing workloads, we encourage you to check the newest launch. Docker containers are offered for releases in addition to nightly builds. Conda packages are additionally obtainable to make testing and deployment simpler. If you happen to’re already utilizing cuDF, we encourage you to run the brand new strings transformation instance by visiting rapidsai/cudf/tree/HEAD/cpp/examples/strings on GitHub.
David Wendt is a senior techniques software program engineer at NVIDIA growing C++/CUDA code for RAPIDS. David holds a grasp’s diploma in electrical engineering from Johns Hopkins College.
Gregory Kimball is a software program engineering supervisor at NVIDIA engaged on the RAPIDS crew. Gregory leads improvement for libcudf, the CUDA/C++ library for columnar information processing that powers RAPIDS cuDF. Gregory holds a PhD in utilized physics from the California Institute of Know-how.
Authentic. Reposted with permission.