Monday 4 March 2024

Generating programming tips for dealing with large datasets


The tricks that work in a general programming context still apply for data science. Several might be worded slightly differently, but the principles are essentially the same for all programmers. This section recapitulates those tricks that are important in a data science context.
You can divide the general tricks into three parts, as shown in the figure mind map:

  1. Don’t reinvent the wheel. Use tools and libraries developed by others.
  2. Get the most out of your hardware. Your machine is never used to its full potential; with simple adaptions you can make it work harder.
  3. Reduce the computing need. Slim down your memory and processing needs as much as possible.


1. Avoid duplicating existing efforts / Don’t reinvent the wheel

“Avoid repetition” is likely superior to “avoid repeating yourself.” Act in a way that adds significance and worth. Revisiting an issue that has previously been resolved is inefficient. As a data scientist, there are two fundamental principles that can enhance your productivity while working with enormous datasets:

  • Harness the potential of databases. Most data scientists first choose to create their analytical base tables within a database when dealing with huge data sets. This strategy is effective for preparing straightforward features. Determine if user-defined functions and procedures may be utilized while using advanced modeling in this preparation. The last example in this chapter demonstrates how to include a database into your workflow.
  • Utilize optimized libraries. Developing libraries such as Mahout, Weka, and other machine learning algorithms demands effort and expertise. The products are highly optimized and utilize best practices and cutting-edge technologies. Focus your attention on accomplishing tasks rather than duplicating or reiterating the labor of others, unless it is for the purpose of comprehending processes.

Then you must take into account your hardware constraints.


2. Get the most out of your hardware

Over-utilization of resources can slow down programs and cause them to fail.

Shifting workload from overtaxed to underutilized resources can be achieved using techniques.

  1. Feeding CPU compressed data: Shift more work from hard disk to CPU to avoid CPU starvation.
  2. Utilizing GPU: Switch to GPU for parallelize computations due to its higher throughput.
  3. Using CUDA Packages: Use CUDA packages like PyCUDA for parallelization.
  4. Using Multiple Threads: Parallelize computations on CPU using normal Python threads

3. Reduce the computing need

  • Utilize a profiler to identify and remediate slow code parts.
  • Use compiled code, especially when loops are involved, and functions from packages optimized for numerical computations.
  • If a package is not available, compile the code yourself.
  • Use computational libraries like LAPACK, BLAST, Intel MKL, and ATLAS for high performance.
  • Avoid pulling data into memory when working with data that doesn't fit in memory.
  • Use generators to avoid intermediate data storage by returning data per observation instead of in batches.
  • Use as little data as possible if no large-scale algorithm is available.
  • Use math skills to simplify calculations


Post a Comment

Note: only a member of this blog may post a comment.


Follow US

Join 12,000+ People Following





Java Tutorial


Digital Logic design Tutorial




ANU Materials