08-21-2020 05:49 AM
It is clear regarding the important work. I believe, it is rather optimized at the moment. However, in any case, even the work is useless, the processing speed should increase if larger amount of cores would be implemented. Am I correct?
08-21-2020 10:29 AM
@Serge1 wrote:
I believe, it is rather optimized at the moment.
"Belief" belongs to religion and is insufficient for code optimization..
@Serge1 wrote:
However, in any case, even the work is useless, the processing speed should increase if larger amount of cores would be implemented. Am I correct?
No, that strongly depends on the problem. Certain tasks cannot be parallelized, such as the communication with the instrument or if all calculations depend on each other. Do you know the CPU overhead of the communication? How do you know that your 100% CPU use does real work instead of just mostly pumping bits around. Splitting the problem up into parallelizable chunks and reassembling the result can cause overhead that can make things worse. How much CPU does you program use if you leave out the analysis and just keep he raw data?
What makes you think your code is optimized at all? Did you run extensive benchmarks of your inner code with synthetic data using a reliable test harness?. Did you compare at least 5 different implementations? The LabVIEW compiler is fantastic and can often optimize poorly written code, but sometimes better code can be orders of magnitude faster. A skilled programmer is aware of where data copes are made and will optimize for inplaceness. Sometimes simply rearranging code can better utilize of SSE instructions.
For some ideas, have a look at our NI Week talk from a few years ago.
08-21-2020 11:57 AM
>> When I need to extract 3 digital channels (2 for the trigger) and perform a search there, I got problems.
Sorry I read your message as the slow down occurred when you switched to collecting data from 3 channels rather than 1 channel. I know the 3000A scope shares its sampling rate across channels and thus it does appear to "slow down" when you use multiple channels.
A few questions...
=============
- What does your calculation do? I don't have the PicoScope drivers installed, so its a bit hard to follow your code. If you give us a description of the calculation, perhaps we can suggest alternatives.
- Does your experiment require such high resolution data? Could you sample less finely and get good results? (Fewer points to crunch per loop.) People often get hung up on using the full resolution of instruments. But I don't know enough about your signals to comment here, but personally I try to get away with the lowest resolution (least data) I can.
Code optimizing..
===============
- I think your code can use some optimizing. You have Altenbach (>40k posts!) interested - take advantage of his advice! Clean up your code to follow best practices as he suggested. Comment it also and perhaps you can get more help. (With all those feedback nodes I personally find it hard to determine what you are doing.)
- Why not add some timing to benchmark each loop and show us the results? (Then you would know where to focus your optimizing time.
- One thing I would try is to check the effect of saving the data in your loop. Why not move that save to at the end of the experiment, or move to a producer consumer scheme (where data saving happens in consumer and thus outside the data collection/calculation loop) or just save a dataset every 1/N times.
Other notes..
===========
- There's also this issue with PicScopes drivers for LabVIEW - https://forums.ni.com/t5/LabVIEW/When-is-dataflow-not-data-flow-Updating-LabVIEW-Arrays-through/td-p...
- Not sure how recent your driver is, but it looks like Pico updated some things as recently as 16 days ago - https://github.com/picotech/picosdk-ni-labview-examples
Craig
08-21-2020 12:51 PM
>"Belief" belongs to religion and is insufficient for code optimization..
In the complex code it is almost impossible to proof that everything is optimized.and it is almost always possible to get a bit of further improve. So, in spite of having some arguments, I can only say that I believe that the code is close to the optimum. So, with a higher complexity we are closer to God :).
I am saying that I believe that my code is close to the optimum as I have performed some optimization. Now, the code is performing about 100 times faster compared to the original version. It almost a week as I cannot get any further speed improvement.
I know that a lot of time is lost in the extraction of the necessary bit from byte. I am doing it using "and" function it was proven to be the fastest way. Do you know any faster way? Also a lot of time is lost when I search the position of the triggers in the incoming array. Again here I found a simple search in the array function to perform in the fastest way.
I just thought that the number of operations needed to perform any code is independent from the number of cores of the processor. Based on this assumption, I decided that if my 8 cores are 100% busy, the increasing of the number of cores would likely linearly speed up my calculations.
08-21-2020 01:02 PM
Thank you for the help.
Yes, I really need 8 ns sampling. I am collection data from TOF mass spectrometer. Getting fewer data points would reduce the resolution.
Code optimizing..
===============
Sure, I will still try to work in this direction. In the previous post, i sad what takes a lot of time in my code.
>>- One thing I would try is to check the effect of saving the data in your loop. Why not move that save to at the end of the experiment, or move to a producer consumer scheme (where data saving happens in consumer and thus outside the data collection/calculation loop) or just save a dataset every 1/N times.
The data are saved only when the control is activated (I press the button), otherwise I do not save them.
I will check a new driver. However, I know that most of the time is lost in the data processing not during the data collection.
08-21-2020 01:42 PM - edited 08-21-2020 02:16 PM
@Serge1 wrote:
I know that a lot of time is lost in the extraction of the necessary bit from byte. I am doing it using "and" function it was proven to be the fastest way. Do you know any faster way? Also a lot of time is lost when I search the position of the triggers in the incoming array. Again here I found a simple search in the array function to perform in the fastest way.
I don't see any bitwise operations on blue wires. Does that happen in a subVI that you have not attached? (e.g. the "digitazing5.vi" (sic) with the coercion dot) How does the data look like? Can you show an image how the data looks like and what you are searching for? How are the subVIs setup (debugging disabled? front panel closed? inlined?) Your subVI "digitazing5.vi" always operates on the same data, so unless it keeps internal state or relies on global data, it belongs outside the toplevel loop, right? Code is sloppy and suspect due to all these unnecessary coercion dots (only the one of the "+" is really needed). You are squeezing simple integer arrays into dynamic data, greatly inflating the data structures while all typical metadata (x0, dx, etc.) is missing anyway. What a waste!
None of your inner code can run in parallel, so if you see 100% usage on all cores, something else is wrong.
You really need to find the code that consumes all CPU. What we are seeing here is not it.
there are also potential race condition. For example the "reset device index" executes in parallel to your main code and there is no way to tell what happens first. Can you risk arbitrary execution order?
08-21-2020 01:55 PM
It looks like you are taking 1M pts/trigger at 8ns. There's an option to downsample the data afterwards. You would still have 8ns resolution, but fewer data points total. Just a thought.
True, your data is saved only when the button is pressed. BUT the data is transformed to dynamic data & merged on every iteration regardless of the button state. Not sure what the overhead penalty for that is.
Since I'm stuck listening to a zoom meeting I cleaned up your code a bit and added timing and the ability to test with/without the data save bits.
You said what takes time, not how much. Of course who knows if your algorithm could be changed to save 1s/loop..you haven't explain it.
Craig
08-23-2020 07:14 AM
This extraction of bit from byte is indeed performed in the subVI. This is one of the most time consuming operation. The figure is attached. Originally, I extracted 3 digital channels (3 first bits) from the data but it took too long. So, now I keep the extraction of only one digital channel. This is a compromise to reach a higher speed. Ideally, I need to extract first 3 bits. If you know any faster way, it would help a lot.
It does not work with the same data. Picoscope continuously stream the data. So, on each loop iteration, I have a new array with one million bytes, which have to be processed faster then in 8 ms.
Thank you to you and to Cstorey. I will move the saving one loop outer. However, I do not think it takes so much time, as originally, I tested it without saving data. I did not notice much speed change after I had added the saving.
08-23-2020 09:11 AM
Thank you very much. Unfortunately, I cannot open this, as your version of LabView is newer. I am using LabView 2015 SP1 Version 15.0.1f1.
Next week, I will install LabView 2020. So, I would be able to read your code.
08-23-2020 10:46 AM - edited 08-23-2020 10:47 AM
Hi Serge,
@Serge1 wrote:
This extraction of bit from byte is indeed performed in the subVI. This is one of the most time consuming operation.
It looks like this:
Have you tried to to the same WITHOUT any coercion dots?
Right now there are two type conversions: one at the AND function, and one more at the indicator! Both conversions involve arrays, which might take its time (due to larger memory shuffling)…
Try to benchmark this subVI after using all U16 datatypes (for the "1" and the indicator).