Stage4 for Daniel from NVidia

Lulu's picture

I wont write much about Gentoo Linux, its very customizable, anything, compile flags, used features of the software, version control, you can use mix of old and new software, when you wish to live on the bleeding edge with some thing, but dont ever want to change ( for your own reasons ) some other things.
(This is a targeted article, but it may be useful for somebody else as well)

Sometimes its possible to run into problems, not only because compiling everything on the machine with many possible configurations, but because of GCC optimization flags as well.
When you have x86 architecture, most distro maintainers are bound with incompatabilities between processor, AMD and Intel, different generations, differents sets, so you can find i486, i586 and i686 species, where i686 is using CMOV and MMX, but does not use any of the SSE features, binary distro math is always FPU387 but not SSE, like for amd64. Another thing which distro maintainters are obeying - debugging, some users want to use it, but most are not, for x86 a separate GP register is used to keep debug pointer, the code can be a lot faster when this GPR is used for runtime, therefore here is a flag -fomit-frame-pointer (a CAVEAT: it will be enabled by default with GCC 4.6, because its stable, so distro ppl may want to put -fno-omit-frame-pointer into default flags, or change GCC preferences before compiling GCC itself)
Obviously enabling omitting frame pointer in the libraries will make debug harder, most backtraces will be almost useless:

Program terminated with signal 8, Arithmetic exception.
#0 0xb7773caf in ?? () from /lib/ld-linux.so.2
(gdb) bt
#0 0xb7773caf in ?? () from /lib/ld-linux.so.2
#1 0xb77741ea in ?? () from /lib/ld-linux.so.2
#2 0xb7775f4f in ?? () from /lib/ld-linux.so.2
#3 0xb777e526 in ?? () from /lib/ld-linux.so.2
#4 0xb7779910 in ?? () from /lib/ld-linux.so.2
#5 0xb6803b53 in ?? () from /lib/libdl.so.2

where normally they reveal at least in which function the code has failed, or even lines of the code, if compiled with -g , great for debug, useless for normal system use..

Nvidia offers their drivers in binary form, where the source are not available, there are open drivers - nouveau, but they still are far from perfection, therefore the choice is to use "blob" drivers, they work well on most distributions, but series 260.xx.xx ( and current beta 270 ) introduced a very specific problem with.... Gentoo. It happens often, it happened for me, it happened for at least 2 other people with Gentoo x86, all are using "blob" drivers and have decent CFLAGS in their make.conf, software configuration and kernel versions are various, from 2.6.32 (LTS) to 2.6.37 (current), KDE and GNOME desktops, Glibc 2.11.2 - 2.13-r1 (I'd like to use eglibc 2.11.2 with patchset from Debian sid, instead of using main Glibc branch, which is often released unstable)

The problem belived to happen between Glibc (or eGlibc) ld-linux.so.2 dynamic loader (as seen from backtrace) and Nvidia's libGL.so.1 drop-in library. Anything using Mesa works fine (one of the workarounds is to LD_LIBRARY_PATH=/usr/lib/opengl/xorg-x11/lib ./gimp )

the problem has been fixed in 270.29 drivers

Problem description:

Program doesnt start, the error message says Floating Point Exception

Known not-working programs:

1. GIMP 2.6 and 2.7 series
2. ccsm (Compiz control panel)
3. gnome-panel (duh! for GNOME users)
4. gajim (sometimes happens too...)

as you can see no Qt programs are listed, all are using Gtk+ , or va pyGtk (It has dlopen() to modules too)

Conditions for problem:

1. not generic compiled Linux x86, notably - Gentoo, but i belive with LFS it will happen too.. or with anything else with enough libraries recompiled

2. nvidia-drivers 260 series or later

3. CFLAGS are set to at least -O2 -fomit-frame-pointer -mfpmath=sse with appropriate -march that allows SSE math, such as -march=pentium4 -msse2

4. GCC 4.5 series, 4.5.2 is the current stable release for this branch, i have not tested with 4.4 and earlier, nor with newest 4.6 which is in regression fixing stage. GCC 4.5 has -mstackrealign turned on, and -fexcess-precision=fast is useful to revert to old math behavior

5. Glibc/eGlibc compiled with --enable-omitfp switch (on Gentoo USE-flag glibc-omitfp)

Workarounds:

1. a permanent solution: dont use omitfp for Glibc, just using ld-linux.so.2 is enough, the rest can still be omitfp

2. use older nvidia-drivers 256.53 do fine with LTS kernel

3. LD_LIBRARY_PATH or LD_PRELOAD with libGL.so.1 from Mesa

4. start via debugger: gdb, strace ...
when you try to debug program, it does work fine, whats an irony...
If you get core with ulimit -c unlimited and normal program start you just get the backtrace that reveals location somewhere in ld-linux.so and FPE

Hello Daniel !
Since you wasnt able to reproduce the problem with omitfp glibc, i assume the userland is important, Debian is just -march=i486, so there is a fat stage4 which is compiled for at least Pentium IV with SSE2.
I have an Acer Aspire with Nvidia 9600 GS card (OEM), the mobo is MCP73 with GeForce 7100/630i chipset
(02:00.0 VGA compatible controller: nVidia Corporation G94 [GeForce 9600 GS] (rev a1))

I'm not sure how to reproduce problem in chrooted environment, so i just boot that stage4 from grub, my usual kernel with (AHCI module) disk drivers compiled-in (i hate initrd's) and ext3 fs
(00:0e.0 RAID bus controller: nVidia Corporation MCP73 SATA RAID Controller (rev a2))

You can use any kernel that will work on your machine, just make sure you configure the loader and put modules into lib/modules , including nvidia.ko module version 260.19.36

stage4 has only the base services - dbus, dhcp network for eth0, kdm
I ve added unprivileged user: nvidia with password nvidia , since KDM wont let root login.
root password is nvidia
Available sessions: fluxbox (basic, can run terminal, firefox and whatever is compiled... menu is spartanish)
GNOME - will go into infinite loading loop, due to gnome-panel is FPE'ing
gajim - works
gimp - FPE
ccsm - FPE
gnome-panel - FPE

Compile settings (CFLAGS and USE) are in /etc/make.conf , i would recommend look into /etc/portage/package.use as well, its known that removing ffmpeg use from gegl can sometimes (not always) solve FPE for starting gimp. Good luck in debugging drivers!

Link for download (has been deleted to spare server traffic, archive has reached the recipient).

Additional notes:
/usr/portage content is deleted, you can restore it by getting portage-latest.tar.lzma from any gentoo mirror

Netster's picture

To me the easiest is to

To me the easiest is to install ubuntu hahahaha thats to the people that it simple.

So lucky to be able to use so many dif OS in my life time heheheh

Lulu's picture

Ubuntu x86 wasnt affected, it

Ubuntu x86 wasnt affected, it is not compiled with SSE math and omitfp, only Gentoo can do that (easy way) or you can recompile any distro, even ubuntu with own CFLAGS (harder way)

actually that article been written for Daniel from NVidia Linux OEM support )
But because 260-series of driver will not be fixed, i'll better leave it available to SEO )

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.