This post will evolve over time as more info is found:
Latest Updates
6/24 Custom PIT repartition workaround posted by Jaymoon.
You lose 8GB (some of which might be further recoverable with extra work) and restores using the original PIT will lockup your phone again (a scenario that could happen if you brought your phone back to Sprint for some unrelated problem) so if you have the opportunity to get your phone replaced with little to no cost, IMO that should be your primary option.
http://forum.xda-developers.com/showthread.php?p=27852689#post27852689
E4GT specific PIT file here (theoretically instead of losing 8GB, you'll only lose 2GB):
http://forum.xda-developers.com/showpost.php?p=28070569&postcount=654
6/8 Update for other platforms waiting for fix
Codeworkx's contact with Samsung got following response [discussion]
Update 14:56 CEST:
Patches will be out in form of new official ROMs and also sourcecode releases after testing, which might take some time.
Click to expand...
Click to collapse
6/7 Update
Test plan posted - see bottom of post for results so far (esoteric68, krazy_smokezalot report success)
BIG THANKS to Esoteric68 (and robertm2011 before her) who took the plunge to benefit everyone else. She has completed the test plan and more. 6 flashes of CM9, 3 flashes of AOKP, 3 wipe data/factory resets, and 3 nandroid restores, 1 stock FF02 flash, all successful. We are ready to have more testers try out the test ROM installs. We are getting more confident the code analysis was correct.
6/2 Update
Less technical summary and preparation for new round of testing
5/31 Lots of discussion on the code path detailing how the problem occurs and where to put the workaround, select posts below
Call trace for CWM Recovery - wipe data/factory reset
Call trace for CWM Recovery - restore
Section of update-binary afflicted by same issue as wipe data/factory reset
Recap of where workarounds can be placed
MD5s of various update-binary executables
Pros/Cons of placing workaround in kernel vs libext4_utils.a
Are ICS nandroid backup/restores safe?
Are ICS recoveries safe?
Why do CM9/AOKP installs often brick in ICS but not in GB?
5/24 Update pretty much ties up all the loose ends - Thanks Mr. Sumrall, Garwynn, Entropy, and everyone else who pitched in!
http://forum.xda-developers.com/showthread.php?p=26521643#post26521643
Potentially very GOOD NEWS
It appears Sprint/Samsung tested the EMMC brick issue, confirmed the problem, and tested a fix that appears to resolve the problem:
http://forum.xda-developers.com/showthread.php?p=26465085#post26465085
thirdcoastraised said:
To clarify this...in testing done over the weekend, there was a small "subtest" group which consisted of 20 devices. This group was put together STRICTLY for the propose of testing the emmc bug and fix. The devices were all programmed with the data known to have cause bricks when wiping. Of those 20, all but 6 also had the code patch to resolve that issue, so there was a possibility for 6 hard bricks, only 4 actually bricked, therefore, on the build currently being tested, the "emmc break issue" has been deemed "resolved"
Click to expand...
Click to collapse
We now have an update on why this bug is happening and which PRV/fwrevs are affected. PRV/fwrev 0x19 are susceptible to the EMMC /data corruption issue (which should now be referred to as EMMC lockup issue). PRV/fwrev 0x25 has the fix for the lockup issue but has a separate 32KB of zeros data corruption issue, which is being patched in the kernel (our kernels don't have that patch). All these problems are in the EMMC firmware. It can potentially be updated, but nothing is publicly available. EMMC lockup issue is triggered on erasing the EMMC. The only piece we have not been able to explain is why GB-based kernels seem immune to the EMMC lockup problem whereas ICS seems more susceptible to the problem. Presumably both are doing ERASE commands, but possibly in slightly different ways. See these posts for more details [#1 / #2]
To get your PRV/fwrev, you can use this (if you have busybox installed):
[email protected]:/ $ su
[email protected]:/ # cd /sys/class/block/mmcblk0/device
[email protected]:/sys/class/block/mmcblk0/device # cat cid | cut -b 19,20
19
Click to expand...
Click to collapse
If you don't have busybox installed just visually parse the line, match the serial # (0xd3f24fe6 - example only - yours will be different) with the cid, and look at the 2 numbers before the serial #.
[email protected]:/ $ su
[email protected]:/ # cd /sys/class/block/mmcblk0/device
[email protected]:/sys/class/block/mmcblk0/device # cat serial cid
0xd3f24fe6
1501004d414734464119d3f24fe68e8b
Click to expand...
Click to collapse
It appears after looking at the code more closely and examining the results of the card info dumps, we do not have this fix in our kernel. It isn't clear whether the fix would resolve our /data EMMC brick issues, but the point is moot right now because we don't have the fix.
Possible BRICK here. Please do NOT do any more testing until further notice. Please do NOT use Wipe Data/Factory Reset. It is the main difference between first and 2nd round of testing and is the current suspect
FE10 repacks added to Resource section
Esoteric68, azyouthinkeyeiz, and robertm2011 are testing flashing different ROMs with FE07/FE10 repacked with unlocked recovery. We all owe them our thanks for risking their phones to help the community (taking one for the team) No bricks so far.
Separately we are still discussing whether the fix Samsung checked in will get applied to our phone. No firm conclusions yet. Even if it doesn't apply, the hope is the data we get from testing will help us produce more flexible "safe" flashing practices.
Please do NOT test CWM Touch for now. We want to isolate just the FE07 kernel and unlocked stock recovery before introducing new variables.
Executive Summary
Garwynn has found a recent checkin from Samsung in the kernel code handling EMMC memory that fixes a data corruption problem. It is possible this might fix the /data EMMC corruption we have been seeing, but we aren't sure if it is fixing the same problem. The first release to include that checked in code is FE07. There has been some communication with the developers in charge of that area to gather further info.
This thread's purpose is to foster discussion on the issue and to determine if the potential fix actually does fix our issue. Even if the fix doesn't address the issue, it is hoped in the process we are able to gather more info into specific "safe" and "unsafe" scenarios.
Please do NOT jump ahead and think it is fixed. It is TOO EARLY to make that claim.
Background
As many of you are aware since ICS has come out, there has been a nagging issue where in some situations flashing ROMs with an ICS-based kernel and custom recovery has left the phone with EMMC corruption. This EMMC corruption is so far non-recoverable, even with JTAG bit blasting, which should bypass all but hardware issues.
This problem is NOT limited to the Epic 4G Touch. Other GS2 models as well as Galaxy Note are experiencing the same thing as can be seen by this Public Service Announcement in the Galaxy Note section.
The problem first cropped up when people used ROM Manager to temporary "fake" flash CWM Touch onto an ICS-based kernel to do their flashing needs. In particular wipe data/factory reset seemed to often trigger the /data EMMC corruption. However later we found it wasn't limited to just CWM Touch and temporary flashing as CWM repacks with the ICS-based kernel also exhibited that behavior, albeit not as often.
Even more frustrating is that this bug is not always deterministic, in that you could do some operation 3 times and have it work fine, then on the 4th, trigger the /data EMMC corruption.
Complicating the testing/debugging is the issue that once the problem is triggered, your phone is basically not recoverable. You can try and ODIN a stock ROM on top which will basically work for all the components except the /data partition. Once it reaches the /data partition, ODIN will hang. Similarly if you try and wipe data/factory reset, it will hang or timeout after a while. Attempts to repartition and reformat using ODIN have not changed this behavior. Attempts to edit the partition info manually have not been successful. JTAG bit blasting has not been successful.
You can read about the past experiences in the Stuck at "Data.img" thru odin thread. By the time you get to ODIN, the damage to /data EMMC is already done. ODIN is NOT causing the damage. ODIN is hanging on data.img because the hardware won't let it write successfully to that area of EMMC.
This has led to many custom ROMs giving special procedures to go back to a GB-based kernel repacked with CWM recovery to do all your flashing (EL26+CWM). It is also the motivation for the How Not To Brick Your E4GT thread.
Details
The code checkin that has piqued our interest is in regards to data corruption caused by problem in the wear-level firmware code of the emmc. This is low-level code that runs on a processor in the emmc module. It basically tries to spread out the data writes so you get an even distribution of writes so as any one section of emmc memory does not get worn out prematurely. This code apparently can corrupt data by writing 32KB of incorrect data under some situations.
https://bitbucket.org/franciscofranco/android-tuna-omap/changeset/cea631bdac53
The code appears to restrict the firmware fix to only certain "affected" emmc modules. Also it is not able to persistently/permanently patch the firmware so this code must run at each startup. The following modules were identified in the code:
Name: VYL00M
HwRev: 0x0
FwRev: 0x25
Name: KYL00M
HwRev: 0x0
FwRev: 0x25
Name: MAG4FA
HwRev: 0x0
FwRev: 0x25
Unfortunately during ad-hoc polling we have found a case of an EMMC /data bricked phone with fwrev 0x0, so either we are not understanding what Samsung's fix is doing or they may not have addressed the full scope of the problem. Do NOT assume if your fwrev is 0x0 you are safe.
At this point, this does NOT mean the fix is not applicable. We might be looking at the wrong data. The kernel might not be exporting the data to us. The fix might need to be expanded to more modules. The fix could be for something else entirely but we might be able to avoid the bug anyway using stock recovery.
To determine what version you have (keep in mind we are at the preliminary stage, so this info might not be the right info to collect or could be meaningless for the /data EMMC corruption issue)
[email protected]:/ $ su
[email protected]:/ # cd /sys/class/block/mmcblk0/device
[email protected]:/sys/class/block/mmcblk0/device # cat name hwrev fwrev manfid oemid date type serial cid
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
0xd3f24fe6
1501004d414734464119d3f24fe68e8b
Click to expand...
Click to collapse
The comments for the code checkin give the following info:
/*
* There is a bug in some Samsung emmc chips where the wear leveling
* code can insert 32 Kbytes of zeros into the storage. We can patch
* the firmware in such chips each time they are powered on to prevent
* the bug from occurring. Only apply this patch to a particular
* revision of the firmware of the specified chips. Date doesn't
* matter, so include all possible dates in min and max fields.
*/
Click to expand...
Click to collapse
The critical piece of code appears to be the following:
Code:
/* set value 0x000000FF : It's hidden data
* When in vendor command mode, the erase command is used to
* patch the firmware in the internal sram.
*/
err = mmc_movi_erase_cmd(card, 0x0004DD9C, 0x000000FF);
if (err) {
pr_err("Fail to Set WL value1\n");
goto err_set_wl;
}
/* set value 0xD20228FF : It's hidden data */
err = mmc_movi_erase_cmd(card, 0x000379A4, 0xD20228FF);
if (err) {
pr_err("Fail to Set WL value2\n");
goto err_set_wl;
}
Action items
At this point we would like to
1) gather more info on which emmc modules folks have and see if we can detect any patterns, so if you could post your EMMC info and optionally include whether you have the ability to do testing (presumably because you have a way to replace your phone if it is damaged)
2) solicit one volunteer to try different flashing scenarios using the unlocked stock recovery and FE07 kernel repack (bigpeng indicated earlier he would be willing to do this for the community, but that was before the fwrev info, so he might have had a false sense of security, so no pressure on him if he changed his mind)
If we find that the volunteer does not see any corruption despite trying to do so, then we can expand testing to a few more people and also work on getting CWM repacks.
If the volunteer hits the bug, then we will know the issue is still there even with stock recovery and FE07 kernel.
Keep in mind, at some point someone will need to take one for the team or we will be forever in fear of bricking our phones using ICS-based kernels.
Resources
1) FE07-based repacks
Unlocked Recovery Only [update.zip / tar]
Plus (unlocked recovery, init.d, adb-root) [update.zip / tar]
2) FE10-based repacks
Unlocked Recovery Only [update.zip / tar]
Plus (unlocked recovery, init.d, adb-root) [update.zip / tar]
3) JEDEC eMMC documentation
Related threads
Galaxy Note CID investigation thread
Good job sfhub. I am learning new stuff everyday
Sent from my SPH-D710 using xda premium
This is mine. I can try to help but not till weekend when my other phone gets here.
MAG4FA
0x0
0x0
0x000015
0x0100
11/2011
MMC
Sent from my SPH-D710 using Tapatalk 2
Sorry I don't have anything majorly different than the normal, but I have this on my phone:
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
Mine is also the same,
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
I am not available to test. Sorry.
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
Also unable to risk the brick. Good luck guys.
MAG4FA
0x0
0x0
0x000015
0x0100
10/2011
MMC
I'll take one for the team if needed. I've been eyeballing the 720p Evo.
Full readings from my /data bricked device. Let me know if you want me to check anything else out:
MAG4FA
0x0
0x0
0x000015
0x0100
08/2011
MMC
I can risk a brick to save future ones. I can help test.
/*
MAG4FA
0x0
0x000015
0x0100
12/2011
MMC*/
Sent from my SPH-D710 using Tapatalk 2
Sorry for the delay on giving more details. I've got info that I'll be passing along soon, just want to read up on something a little more before I post it out here.
Regardless whether this fix turns out to be related to the issue we've been looking at, I want to throw this in now before getting into the weeds:
Big thanks to Ken Sumrall from the Android team as he's been good enough to share info with us about this bugfix. As the person who signed off on the change commit he's one of the best resources on this issue. Also want to give credit to Mr. Min of Samsung who developed the fix in question.
I'll be posting more shortly. Those with dev experience, particularly with C++ and/or Assembly may be able to help us as well.
MAG4FA
0x0
0x0
0x000015
0x0100
11/2011
MMC
If need be, will test.
Source Notes - Part 1
Source Notes - Part 1
(Please feel free to skip if you're not interested in the programming)
This section is just to document what models and versions are affected.
I'm posting this in rather lengthy detail in part for peer review and also as I misread this the first time.
If you want to see the results of this documentation you can skip to the bottom.
Again, you'll need the link to the change:
https://bitbucket.org/franciscofranco/android-tuna-omap/changeset/cea631bdac53
If I look at this part of code alone:
Code:
cid_rev(0, 0x25, 1997, 1)
...this tells me to look for a definition of cid_rev. So I do and get here:
Code:
#define cid_rev(hwrev, fwrev, year, month) \
(((u64) hwrev) << 40 | \
((u64) fwrev) << 32 | \
((u64) year) << 16 | \
((u64) month))
It's not included in this change as this was introduced previously.
But you can see the definition here:
https://bitbucket.org/franciscofranco/android-tuna-omap/src/388ae9aa9b26/include/linux/mmc/card.h
OK, so I should look for HW revision 0x0, FW revision 0x25, right?
Nope. This was nested in another function and I didn't look at that right:
Code:
MMC_FIXUP_REV("VYL00M", 0x15, CID_OEMID_ANY,
cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
MMC_FIXUP_REV("KYL00M", 0x15, CID_OEMID_ANY,
cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
MMC_FIXUP_REV("MAG4FA", 0x15, CID_OEMID_ANY,
cid_rev(0, 0x25, 1997, 1), cid_rev(0, 0x25, 2012, 12),
add_quirk_mmc, MMC_QUIRK_SAMSUNG_WL_PATCH),
OK, so back to the card.h file to get my definition of MMC_FIXUP_REV:
Code:
#define MMC_FIXUP_REV(_name, _manfid, _oemid, _rev_start, _rev_end, \
_fixup, _data) \
_FIXUP_EXT(_name, _manfid, \
_oemid, _rev_start, _rev_end, \
SDIO_ANY_ID, SDIO_ANY_ID, \
_fixup, _data) \
I also want to look at _FIXUP_EXT next:
Code:
#define _FIXUP_EXT(_name, _manfid, _oemid, _rev_start, _rev_end, \
_cis_vendor, _cis_device, \
_fixup, _data) \
{ \
.name = (_name), \
.manfid = (_manfid), \
.oemid = (_oemid), \
.rev_start = (_rev_start), \
.rev_end = (_rev_end), \
.cis_vendor = (_cis_vendor), \
.cis_device = (_cis_device), \
.vendor_fixup = (_fixup), \
.data = (_data), \
}
So to properly identify what is affected:
Model Name (.name) -> VYL00M, KYL00M or MAG4FA
Manu. Firwmare ID Manufacturer ID: --> 0x15
OEM ID: Any ID (easy extrapolation of CID_OEMID_ANY)
Revision Start (Range): The result of cid_rev(0, 0x25, 1997, 1). The date indicates a low limit value.
Revision End (Range: The result of cid_rev(0, 0x25, 2012, 12). The date indicates a high limit value.
Fixup: Function to call to add the fixup - in this case, add_quirk_mmc (data.h linked above)
Data: MMC_QUIRK_SAMSUNG_WL_PATH (Not 100% but looks like a label to me. Can't find a definition in change.)
Note:
The fix mentioned right above the models affected in a note: "Date doesn't matter, so include all possible dates in min and max fields." I misread how they were getting the low and high limits of the range.
The corrected eMMCs affected involve VYL00M, KYL00M or MAG4FA at Manufacturer Firmware ID 0x15.
I would like to apologize for providing inaccurate info the first time; after going through the code another time I'm fairly certain the correction to the affected model list is accurate.
This also confirms that those who have posted are in the affected list, which we knew but couldn't confirm until now.
So does this mean the new code in the kernel is to help this problem and what would be the steps to testing it?
Sorry if dumb questions, just trying to learn.
Sent from my SPH-D710 using Tapatalk 2
So theoretically, those of us who have posted are able to make full use of the ability to flash, backup, etc. with the proper modification to our kernels? (Hope I got this right.)
Sent from my SPH-D710 using Tapatalk 2
Azrael.arach said:
So does this mean the new code in the kernel is to help this problem and what would be the steps to testing it?
Sorry if dumb questions, just trying to learn.
Click to expand...
Click to collapse
I'm of the thought that 99% of all questions are never dumb. And that 1% is extremely rare.
The thread is somewhat of peer review and discussion about what we've found and as a community possibly confirm whether this is both the bug causing the bricks (and by doing so confirming that this is the fix.)
robertm2011 said:
So theoretically, those of us who have posted are able to make full use of the ability to flash, backup, etc. with the proper modification to our kernels? (Hope I got this right.)
Click to expand...
Click to collapse
It means that the fix applies to those devices. What still isn't closed is whether the bug that this squishes is *the* bug (causing the ICS based bricks). I'm going to be posting more about that part here shortly for feedback and discussion.
Very interesting but so far over my head....
/* Missed the first paragraph of the details section. Low level corruption would do what I was questioning. Nvm
Sent from my SPH-D710 using Tapatalk 2
Discussion with Android Team
OK, now on to the bug that this fixes. This post will only contain the discussions between myself and Mr. Sumrall of the Android team.
Initial inquiry to Mr. Sumrall:
Garwynn said:
1) Was the bug that this patched causing the eMMC failures on Samsung devices using an 3.0+ kernel?
2) If #1 is yes, is it known if this correct the I/O errors already experienced? Or is this perhaps preventative in nature?
Click to expand...
Click to collapse
Initial Response:
Ken Sumrall - Android Team said:
The bug was in the emmc firmware which ran on a small microprocessor inside the emmc chip, and it didn't matter what kernel was running on the device to which it was attached. However, it may be the case that a particular kernel version was more likely to trigger the bug.
With this patch, the bug is worked around, and the emmc chip should no longer corrupt data.
Click to expand...
Click to collapse
We also knew this from the code:
Code:
* There is a bug in some Samsung emmc chips where the wear leveling
* code can insert 32 Kbytes of zeros into the storage. We can patch
* the firmware in such chips each time they are powered on to prevent
* the bug from occurring.
Note: Snipped last part of comment as it is already covered.
OK, so it's putting potentially zeros in the storage; but it doesn't give us any clues as to where the possible storage was or how this could corrupt the filesystem. So I sent some follow-up questions and got the following responses. (Regular is question to Mr. Sumrall, bold is response.
1)*Can this release fix a device where the bug has already been triggered (resulting in I/O error)?
No. *The 32 Kbytes of zeros have already been inserted into the filesystem (usually in a particularly bad place, like the inode or block bitmaps, or the inode table) and the filesystem is now corrupt.
2) What would happen if a device is rolled back to a previous kernel - one without this fix? Would it be exposed again to the bug?
Yes, the corruption could happen on older kernels. *The fix doesn't permanently fix the firmware, it patches the firmware every time the device is powered on (initial power-on, wakeup from sleep).
The next question was one that has been bugging me so I figured it wouldn't hurt to learn more about this bug. Sorry if this rubs some people the wrong way.
3) The explanation in the bugfix mentions 32 Kb of zeros being added to storage. But I can't see this causing an I/O error unless it was doing this in the storage containing the instruction set. Was this somehow corrupting the I/O instruction set contained within the firmware?*I have spent several weeks defending the opinion that this was not a hardware failure but software-based.
When the ext4 filesystem detects an error, and the filesystem is set to panic or re-mount read-only on error, the function ext4_handle_error() will record an EIO *in the journal:
Code:
static void ext4_handle_error(struct super_block *sb)
{
if (sb->s_flags & MS_RDONLY)
return;
if (!test_opt(sb, ERRORS_CONT)) {
journal_t *journal = EXT4_SB(sb)->s_journal;
EXT4_SB(sb)->s_mount_flags |= EXT4_MF_FS_ABORTED;
if (journal)
jbd2_journal_abort(journal, -EIO);
}
.
.
.
.
}
So it doesn't have to be an actual low-level IO error to cause EIO to be recorded in the journal.
As for hardware vs. software failure, it is a bug in the firmware of the emmc chip, and this kernel patch enables a work-around to prevent the problem from happening.
Click to expand...
Click to collapse
Thanks again to Mr. Sumrall for this information. More soon.
Here's what I have - hope it helps. I would help test, but not until after the weekend. Thanks for all the work and info.
MAG4FA
0x0
0x0
0x000015
0x0100
10/2011
MMC