Looking for recommendations on profiling tools.

Started by
7 comments, last by CDRZiltoid 2 years, 9 months ago

Working on a custom engine and have noticed an issue where intermittently my frametime will tank to the point that the screen appears to pause for 250-500ms or so. It's very strange. If I'm running something for 5 minutes, it might happen once or twice. Very inconsistent. I'm unsure if it's something within my code (although I cannot think of anything that I'm doing inconsistently, like no random garbage collection or anything like that), or possibly a bug within the windowing library I'm using (GLFW, which I probably need to update…will try that).

But, all of the above leads me to my question. How do you guys troubleshoot issues like this?

Are there any profiling libraries that you would recommend that could aid in sniffing out this intermittent issue? After a quick google, I found these two on github:

Does anyone have any experience with either of these or know of anything else?

Advertisement

I had the need for this a long time ago when I started to wrote my custom engine (which now has become a team project). I'm used to work with Unity's profiler as well as the Playstation 4 SDK so my requirements to a profiling solution has been based on those.

You should first understand that involving a profiler will slow-down your code and so the first and foremost feature a profiler needs to offer is to opt-in/opt-out in a way that will not impact your performance too much, especially if you don't need it. I don't know about those libraries you mentioned but having a look at it for this cirteria may help you to decide to use them or make your own built-in profiler.

Spoiler: I however wrote my own.

I have two profiler systems which may be opt-in into the build by setting a compiler flag. My LiveProfiler can be toggeled at runtime (and may have a slight performance impact) while the StaticProfiler is always on. Those systems each define a macro if enabled which are

FRM_BEGIN_PROFILE(Severity, VerbosityMask, NameTag)

to enter a profiling frame and

FRM_END_PROFILE(Severity, VerbosityMask, NameTag)

to exit the frame. Those macros redirect to the coresponding profiler function

void SetMarker(VerbosityFlags::TYPE verbosity, const byte* extra, uint8 extraSize, const char* tag = 0)

and it's overrides. An override for example is an inline function which involves a severity evaluator but I'll explain that later.

template<typename SeverityEvaluator> inline void SetMarker(VerbosityFlags::TYPE verbosity, const byte* extra, uint8 extraSize, const char* tag = 0)
{
    if (SeverityEvaluator::Process())
        SetMarker(verbosity, extra, extraSize, tag);
}

Markers work like this: I'm setting a marker for example at the beginning of a function I want to profile. The marker will have the verbosity flag set to VerbosityFlags::Start to indicate the beginning of a block. I also may add some extra data, for example a function pointer to be written into the data stream. If I leave the function or the desired profiling part, I set a new marker with VerbosityFlags::End and the same data.

The verbosity flag can also be used to filter the profiling data, for example when a team works on the project, the verbosity may reflect different team members and the team member may turn off profiling display from other team members. This is however different from the severity flag, which is the profiling level.

The overall interface is quite simple, regardless of which profiler you use.

namespace Profiler
{
    typedef FRM_CALLBACK<void (ProfilerData&)> ProfileToDeviceCallback;

    /**
     Callback that outputs certain profiling buffer to target
    */
    api void ProfileCallback(ProfileToDeviceCallback callback);

    /**
     Gets the level of severity for profiling
    */
    api uint32 Severity();

    /**
     Sets the level of severity for profiling
    */
    api void Severity(uint32 value);

    /**
     Creates a new profiling marker of given verbosity and tag
    */
    api void SetMarker(VerbosityFlags::TYPE verbosity, const char* tag = 0);

    /**
     Creates a new profiling marker of given verbosity, tag and extra data
    */
    api void SetMarker(VerbosityFlags::TYPE verbosity, const byte* extra, uint8 extraSize, const char* tag = 0);
}

Coming to the severity, there are a couple of profiling levels which can be turned on/off in code and control the amount of output. I wrote a template meta evaluator for this in order to have the compiler optimize calls away, that hit a severity which isn't currently activated. (This however only works for the StaticProfiler)

#define Constant(V) enum { Value = (V) }; \
static inline bool Process() { return Value; }

namespace Profiler
{
    namespace SeverityFlags {
    enum TYPE
    {
        Default = 0x1,
        Deep = 0x2,
        System = 0x3
    }; }

    uint32 Severity();

    template <SeverityFlags::TYPE Flag> struct DynamicSeverityEvaluator { static inline bool Process() { return (Flag <= Profiler::Severity()); } };

    template <SeverityFlags::TYPE Flag, int Level> struct StaticSeverityEvaluator { Constant(Flag <= Level); };
    template <SeverityFlags::TYPE Flag> struct StaticSeverityEvaluator<Flag, 0> { Constant(false); };
}

And finally getting data out of the engine into a nice looking UI or log file: I spawn a worker (e.g. a task in our task system) which populates the profiling data to the destination target. In order to store data packages, there needs to be some sort of buffer allocated first.

ProfilerBuffer buffers[PROFILER_MAX_THREADS];

The buffer is a round robin queue of certain size to not need to allocate something when profiling. Those buffer(s) are defined in static memory for the number of threads which should be profiled. For example if we're running a build which limits the number of thread-pool threads (task worker threads) to 32, I'm allocating 34 buffers (in order to have some extra space left). The profiling data is quiet small except you also want to have a stack trace added. Since every thread is calling the same method and we don't want the profiler to block (and impact the game too much), it is mapping the thread ID to the right buffer index and accesses the data in a lock-free fashion.

const uint32 threadId = Async::GetThreadLocalId();
ProfilerBuffer& buffer = buffers[threadId];

#if defined(PROFILER_WAIT_ON_OVERFLOW) && PROFILER_WAIT_ON_OVERFLOW == 1
while(!buffer.CanWrite(1));
#else
if(!buffer.CanWrite(1))
    return;
#endif

ProfilerData& data = *(buffer.Buffer() + buffer.WritePointer());
data.Timestamp = Atomic::Rdtsc();
data.Verbosity = static_cast<uint8>(verbosity);
data.NameTag = tag;
data.ThreadId = threadId;

#if PROFILER_STACK_SIZE > 0
const uint16 size = StackTrace(data.Trace);
if(size < PROFILER_STACK_SIZE - 1) 
    data.Trace[size] = 0;
#endif

data.Extra = extra;
data.ExtraSize = extraSize;

What you can see here is some optional behavior. The thread can wait on the buffer to be flushed if there are no more slots in it or just ignore the overflow and the package is discarded. Adding the stack trace to the current function is also possible, this can help the tool you later view the data in, to differentiate better between function calls.

Everything else is straight forward. I'm using the Rdtsc CPU instruction in order to get a more reliable timing since QueryPerformanceCounter is some portions slower than Rdtsc and we want precise data. The thread ID will help to identify and display the section in the proper row in our profiling viewer.

Everything left is now to send the data to the right output channel. You may have recognized the profile callback which can be set in the profiler itself. This is a callback function which is feeded from the worker thread in order to push the data packages to the right output channels.

void Profiler::Iterate()
{
    if(callback)
    {
        for(uint8 i = 0; i < LOG_MAX_THREADS; i++)
        {
            ProfilerBuffer& buffer = buffers[i];
            if(buffer.CanRead())
            {
                callback(*(buffer.Buffer() + buffer.ReadPointer()));
                buffer.ReadPointer(1);
            }
        }
    }
    ThreadPool::Enqueue(THREADPOOL_JOBCONTEXT_TYPE<void ()>::Create<&Iterate>());
}
void Profiler::Iterate(ThreadWorker* worker)
{
    Async::SetThreadLocalId(ThreadPool::GetWorkerId(worker));
    bool loop;
    do
    {
        loop = false;
        Async::Suspend(PROFILER_MAX_THREADS);
        if(callback)
        {
            for(uint8 i = 0; i < LOG_MAX_THREADS; i++)
            {
                ProfilerBuffer& buffer = buffers[i];
                if(buffer.CanRead())
                {
                    callback(*(buffer.Buffer() + buffer.ReadPointer()));
                    buffer.ReadPointer(1);
                    loop |= buffer.CanRead();
                }
            }
        }
    }
    while(!ThreadPool::Signaled() || loop);
}

I've decided for a callback solution because what I liked to do is to send the packages via UDP to a destination machine and port. On that port our C# profiling tool was listening to the packages and displayed them in a nice graph we were able to move around and zoom in/out

@shaarigan Looks nice, I think. Have you made any special tools to inspect the output, and how did you present it? You mention a nice graph.

I find that depending on what the inspection concerns, and depending on the project, ideal visualisations tend to change a lot.
It's often nice to indicate some sort of nesting though; that one frame is part of another etc. It also makes sense to be able to compare profiles and frames side-by-side…

SuperVGA said:
Have you made any special tools to inspect the output, and how did you present it?

Some people use Chrome Browser to visualize profiling data. (random search: https://www.gamasutra.com/view/news/176420/Indepth_Using_Chrometracing_to_view_your_inline_profiling_data.php )

Though, i prefer some bars on screen, to see the fluctuations at runtime.

JoeJ said:

SuperVGA said:
Have you made any special tools to inspect the output, and how did you present it?

Some people use Chrome Browser to visualize profiling data. (random search: https://www.gamasutra.com/view/news/176420/Indepth_Using_Chrometracing_to_view_your_inline_profiling_data.php )

That looks nice. First off I actually thought you meant using web stuff to visualize, which I something I actually do a lot; it's so easy to output some divs, a little svg and some text to convey something. But Chrometracing seems interesting for sure…

Though, i prefer some bars on screen, to see the fluctuations at runtime.

That definitely has a lot of utility, and it also serves to indicate exactly where issues occur - if you manage to trigger them. But it can be difficult to use that to get an overview of, and compare various scenes, target machines and general timeline of running a particular use case.

JoeJ said:
Some people use Chrome Browser to visualize profiling data

Sounds like something which fits into this talk ?

https://deplinenoise.files.wordpress.com/2017/03/webtoolspostmortem.pdf

SuperVGA said:
Have you made any special tools to inspect the output, and how did you present it?

Yes, I wrote a quick (and mostly dirty) WinForms C# application which received the data packages via UDP from the running engine, added them to the thread-graph they belong to and ensured that every begin marker also had a coresponding end marker. The UI is a timeline which has a row for each thread we were profiling and looked somehow similar to this (per row)

(This is taken from Google as I don't have a proper screenshot from our tool yet)

The entire timeline could be moved forth and back, zoomed in and out (increased the time steps to be able to show smaller time frames as well) and was live updating whenever a section arrived via UDP (begin and end marker present).

I haven't added a frame marker yet (which isn't a problem to do so) but JSON export so you can save and later cross-load profiles in direct comparsion

@shaarigan That looks very useful. I've never done flame diagrams, but I have done sequence stuff and multiple tracks illustrating parallel work. I'll consider your approach on my own project!

@shaarigan Thank you for sharing this! I'm sure I'll come back to it once I require a more advanced profiler, but for the time being I was able to identify my bottleneck via the micro-profiler package linked to above. I however am stumped as to what could be causing it. Created a new thread here with more details as to not stray from the primary topic of this one.

This topic is closed to new replies.

Advertisement