Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Issue with incorrect temperature thresholds for AMD r9 380
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ZenoOfElea
n00b
n00b


Joined: 20 Jan 2017
Posts: 4

PostPosted: Sat Jan 21, 2017 12:24 am    Post subject: Issue with incorrect temperature thresholds for AMD r9 380 Reply with quote

I recently installed a AMD R9 380 Tonga-pro graphics card and upgraded to 4.8.17-gentoo-source kernel and after some tinkering I was able to get lm-sensors / sysfs to expose the hwmon information for the card however I think I did not configure something correctly as the threshold temperatures were populated at 0.0c.

The only error message logged to dmesg is the following
Code:
amdgpu 0000:02:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

My concern for this is two fold is my card uses "passive cool" technology with a thermal trip to start the fans if the GPU reaches a certain temperature and I am concerned if the threshold temperatures are invalid the cards fans might not trip and I want to use the sysfs to set the minimum fan levels to be higher than 0% and I want to make sure any script I make has the right hardware information.

I was not sure if I did not build modules for the i2c chips so I just compiled all the ones for AMD but that did not help and I do not really know where to go from here.
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Sat Jan 21, 2017 2:25 pm    Post subject: Reply with quote

Please try with a more recent kernel. thank you. => kernel.org stable kernel version.

Those read out values may be trustworthy or not.

Quote:
My concern for this is two fold is my card uses "passive cool" technology with a thermal trip to start the fans if the GPU reaches a certain temperature and I am concerned if the threshold temperatures are invalid the cards fans might not trip and I want to use the sysfs to set the minimum fan levels to be higher than 0% and I want to make sure any script I make has the right hardware information.


Sounds like you played with the firmware of the card which you should not have done in the first place.

When the card overheats, it may throttle (again the firmware of the device)

You may mount some additional fans and undervolt them so they are not that noisy.

A proper airflow needs some thinking and decent preparation. A proper case with proper placement of the components helps too

--

Such things are usually handled by the firmware.
The shipped gpu bios should usually handle everything

--

When you want to have a proper reading of temperatures. you can mount a temperature sensor with a data logger.
Back to top
View user's profile Send private message
Hu
Moderator
Moderator


Joined: 06 Mar 2007
Posts: 13607

PostPosted: Sat Jan 21, 2017 11:15 pm    Post subject: Reply with quote

Zeno: bogus sensor data reported to the user is unfortunately not that uncommon. This could be an issue with the kernel driver mishandling the data, or that the sensor's firmware is non-standard and requires a kernel quirk to compensate for the vendor's ideas of how to express the data. Assuming that the pieces are all from the same vendor and that the issue is that the vendor expressed the data in a non-standard way, then the vendor's firmware is probably able to do the right thing regarding starting the fans when needed. If the fans are configured correctly by the vendor, they ought to start in plenty of time to prevent hardware damage, especially if you terminate the test quickly. You will need an independent way to validate the actual temperature, since we cannot currently trust any of the data reported in software. I suggest checking the documented temperature tolerances before designing any tests. Find the temperature at which the fans are supposed to engage and, if possible, find the temperature at which the hardware could begin to suffer permanent damage. Once you know how large a margin exists between those temperatures, we can design a test to see if the fans activate when they should. That test will likely involve generating GPU load in order to generate heat. We need to know how hot to make the card to activate the fans and we ought to know how long the card can safely withstand operating under load without fans.
Back to top
View user's profile Send private message
ZenoOfElea
n00b
n00b


Joined: 20 Jan 2017
Posts: 4

PostPosted: Sun Jan 29, 2017 6:26 am    Post subject: Reply with quote

After doing some digging online I found that the average temperature was around 65c and the windows based card manipulation program set the threshold before the fans would king in was 66c. After doing some stress bench-marking and basing my reading from the lm_sensors and the raw temp1_input value the time the fans kicked on was indeed 66c. I am thinking that ASUS the cards manufacturer did not build into their card a way to expose the temp1_crit temp1_crit_hyst values

Since the fans kick on roughly at the same time as the 66c threshold is hit and cut off when the temperature drops to 65c. I will go under the assumption that the temp1_input value is valid. However leaving my machine on while idling my card the temperature still creeps up to 66c which seems way to high for an idle card.

Looking around the directory hwmon directory for my card I found 4 other files (pwm1, pwm1_enable, pwm1_max, pwm1_min) and a directory of interest.. Using https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface as a guide to explain what the files mean I have made some observations that I do not fully understand. pwm1_max and pwm1_min are not listed in the kernel sysfs docs and are read only like temp1_input so I am assuming these two are static min/max variables for pwm1 and the kernel docs seem to agree as the values match for the minimum and maximum values pwm1 can have. pwm1_enable and pwm1 are both set to read write. pwm1_enable default value for my system seems to be 1. with 0 meaning no fan control and fan on max and >1 meaning automatic fan control. However despite being a r/w file pwm1_enable will not change its default value of 1. pwm1 is where I start to become confused because the documentation says file has the min value of 0 and max value of 255 and the value of the file is equal to the percentage of the fan speed. The default value is 61 (~20%) despite the fans being off, I can manipulate the variable with the echo command however the value I echo into the file is not the value the file changes to. I have not played with it enough to be sure but changing the files value affects the fan speed, setting a >140 value gives a noticeable impact on the sensor value and with a visual inspection I can tell the fans are on.
Code:
sh -c 'echo 150 >  /sys/class/drm/card0/device/hwmon/hwmon3/pwm1'
seems to produce a value of 145 for my card but
Code:
sh -c 'echo 255 >  /sys/class/drm/card0/device/hwmon/hwmon3/pwm1'
produces a value of 255. I do not understand the logic.

The only other thing of interest in the hwmon directory is a sub-directory labeled power containing 5 files that are not listed in the guide I gave above autosuspend_delay_ms ,control ,runtime_active_time ,runtime_status ,runtime_suspended_time which when examined return cat: /sys/class/drm/card0/device/hwmon/hwmon3/power/runtime_status_time: No such file or directory, auto, 0, unsupported, 0 respectively. These might be vendor specific and since I do not know what they do I am not messing with them.

Assuming my assumption is correct and the pwm1 value is the one I need to manipulate to change my fan speed is there any safety concerns for my card if I use echo to change the value of pwm1 after boot up?
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Sun Jan 29, 2017 12:39 pm    Post subject: Reply with quote

Quote:
Since the fans kick on roughly at the same time as the 66c threshold is hit and cut off when the temperature drops to 65c. I will go under the assumption that the temp1_input value is valid. However leaving my machine on while idling my card the temperature still creeps up to 66c which seems way to high for an idle card.


Nope.

As long as the "processor" does not throttle its fine. i refer to any silicion in this case.

The only issue may be that some parts will degrade faster and need to be replaced earlier. As we are talking about gpus, the lifespan is much shorter as the lifspan of the components, so it should not matter.

--

When you are worried, you can hardwire the fan, improve the cooling system. e.g. replace with a better thermal component, imrpove the air-flow. ... as mentioned a decent case can improve a lot

--

Quote:
Looking around the directory hwmon directory for my card I found 4 other files (pwm1, pwm1_enable, pwm1_max, pwm1_min) and a directory of interest.. Using https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface as a guide to explain what the files mean I have made some observations that I do not fully understand. pwm1_max and pwm1_min are not listed in the kernel sysfs docs and are read only like temp1_input so I am assuming these two are static min/max variables for pwm1 and the kernel docs seem to agree as the values match for the minimum and maximum values pwm1 can have. pwm1_enable and pwm1 are both set to read write. pwm1_enable default value for my system seems to be 1. with 0 meaning no fan control and fan on max and >1 meaning automatic fan control. However despite being a r/w file pwm1_enable will not change its default value of 1. pwm1 is where I start to become confused because the documentation says file has the min value of 0 and max value of 255 and the value of the file is equal to the percentage of the fan speed. The default value is 61 (~20%) despite the fans being off, I can manipulate the variable with the echo command however the value I echo into the file is not the value the file changes to. I have not played with it enough to be sure but changing the files value affects the fan speed, setting a >140 value gives a noticeable impact on the sensor value and with a visual inspection I can tell the fans are on.


You are looking at hardware registers.
maybe they are wrong coded and are read only registers!

It depends on the Firmware of any component involved and on the schematics. Schematics refer to any path the electrons flow through!
Those are int. property of the manufacturer and usually not revealed, so you won*t be able to see if these can be changed, or if it is even possible.

Quote:
equal to the percentage of the fan speed


nope the fan characteristics regarding pwm is for sure not linear. now we are in the analog world, not the digial one.

Quote:
Assuming my assumption is correct and the pwm1 value is the one I need to manipulate to change my fan speed is there any safety concerns for my card if I use echo to change the value of pwm1 after boot up?


from an electronic perspective, nope

you may overheat your card because you ruined the cooling, the processor (again see above), may throttle down to get back in teh safe zone. Thse days marketing speech talks about a power target and a thermal target (budget). just naming things different to make things fancy, complicated!

When you want to be on the safe side, hardwire the fan to max, or pull out the pwm, which usually makes the fan to spin at max.

pwm shall only be used when there is enough headroom. pwm is not that efficent than using a hardwired fan.

some guys replace the stock cooling because its limited.
Back to top
View user's profile Send private message
ZenoOfElea
n00b
n00b


Joined: 20 Jan 2017
Posts: 4

PostPosted: Sun Jan 29, 2017 3:05 pm    Post subject: Reply with quote

Quote:

When you want to be on the safe side, hardwire the fan to max, or pull out the pwm, which usually makes the fan to spin at max.

I am not exactly sure how to hardwire the fan to max when I installed the card I did not see any jumpers or switches that would indicate a way to hardwire the fan to maximum and the cooling system is rather exotic not a reference design.

Quote:

Sounds like you played with the firmware of the card which you should not have done in the first place.

Actually I did not flash the firmware of the card at all.

Quote:
you may mount some additional fans and undervolt them so they are not that noisy.

A proper airflow needs some thinking and decent preparation. A proper case with proper placement of the components helps too

As for Case airflow that really is not an issue my cpu idles around 28c and maxs at around 48c and if I manipulate the pwm register controlling fan speed to a higher value than the one given my card sits idle at 34c and goes up to about 55 - 60c on full load.

Quote:

As long as the "processor" does not throttle its fine. i refer to any silicion in this case.

If by throttle you mean the fan itself it does if I do nothing on a fresh reboot. Starting when it hits 66c and stopping at 64c. However if by throttle you mean the clock speed I do not know why it would I keep it at stock speed.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum