Pages in topic: [1 2] > | How to have Trados concentrate on relevant text rather than tag material for finding matches Thread poster: Alexandre Oberlin
|
Hi all, Apparently tags are considered just as any text in Trados matching algorithms. When translating heavily tagged (DTP) documents, Trados will give a better score to TUs having similar tags, while underscoring or even discarding the ones where the relevant text is the same but the tags differ more. A particular case is when you change a translation in a new project using the same tm. You might not be able to retrieve your changes later when the exact... See more Hi all, Apparently tags are considered just as any text in Trados matching algorithms. When translating heavily tagged (DTP) documents, Trados will give a better score to TUs having similar tags, while underscoring or even discarding the ones where the relevant text is the same but the tags differ more. A particular case is when you change a translation in a new project using the same tm. You might not be able to retrieve your changes later when the exact same phrase comes up with even slightly different tags. Actually you might well have a 100% match showing your older option which you wanted to override everywhere in the new projects. If this is still fresh in your memory, you will remember that you changed it and make a concordance search to find the new phrasing. If you don't, or if you are not the person who decided to change the translation, you won't be able to consistently change the phrasing. I find this very annoying but I did not find how to change that behavior. The penalties tag does not seem to have much effect on that matter though the project attributes are typically different. Some other translation tools show the fuzzy matches even when a 100% match is found, which does help, but Trados seems to consider that if a 100% match is found then all issues are solved altogether... There *must* be someone who already experienced this! Cheers, Alexandre Oberlin
[Edited at 2010-10-27 19:39 GMT] ▲ Collapse | | | Big Trados swindle... | Oct 27, 2010 |
Alexandre Oberlin wrote: Apparently tags are considered just as any text in Trados matching algorithms. You're wrong The tag weight is TWO times more important than the weight of a "human"word. It's a well known (?) Trados swindle. In some cases (especially in short sencences) you may receive "matches" with no matching words... It was discussed here, just google. find this very annoying but I did not find how to change that behavior. You can't do it. It's hardcoded in the algorithm. There *must* be someone who already experienced this! Yep. So why I switched to DV many years ago. Cheers GG | | | Can we look at this sensibly? | Oct 27, 2010 |
Hi Grzegorz, A strong choice of words, but I think it might be fairer to look at some specific examples so we can try to explain the logic. Tags trigger penalties, while words are counted relative to the segment length. In short segments the tags may (relatively) outweigh the word-based score reductions, but specific examples would help. I don't think we'd call it a swindle..! Regards Paul | | |
|
|
The root of all evil... | Oct 27, 2010 |
Hello Grzegorz, Thank you for todays sermon, it's always a pleasure. I'll gather up your examples, they are really useful, and will use them to help shape a discussion internally on the matching algorithms. We have a single algorithm which is optimized to deliver appropriate scores in most situations and this is more likely to be the reason why most users don't complain about it. It avoids the too low scores of “naïve scoring”, and avoids the too high scores of �... See more Hello Grzegorz, Thank you for todays sermon, it's always a pleasure. I'll gather up your examples, they are really useful, and will use them to help shape a discussion internally on the matching algorithms. We have a single algorithm which is optimized to deliver appropriate scores in most situations and this is more likely to be the reason why most users don't complain about it. It avoids the too low scores of “naïve scoring”, and avoids the too high scores of “direct dice scoring”, but still takes (the amount of) differences in punctuation, tags, and whitespace into account. The common recommendation, as you know, is that if users feel they get too few matches or “TM Silence”, they can lower the minscore. If they get too much noise, they can increase the minscore. Obviously, lowering the minscore may lead to more noise (i.e. reduced precision), while increasing it may lead to silence (i.e. reduced recall). This is an inherent trade-off decision with all information retrieval systems. Like any scoring algorithm, users will sometimes “feel” that it’s too high, and sometimes they will “feel” it to be too low. This is an “emotional quality”, though, which is difficult to capture in an algorithm, particularly as it also depends on the present case. There are other times when you might find particular examples (not made up ones to prove a point) where it clearly doesn't seem useful at all. Practically speaking it may not be worthy of too much attention, as many of the links you quote attempt to suggest (inbetween your lengthy sermons) that overall productivity is the main criteria. But I do take your point, and will use the information in these posts as a discussion point on what could be improved without degenerating the overall perception. Regards Paul ▲ Collapse | | | Just a simple test | Oct 28, 2010 |
I wanted to take a closer look at this as mentioned below, so took first of all a simple example to see how short and long sentences are handled and how simple tags are handled in various tools. I created a word document like this; The tags in segments #3 and #4 are simple formatting tags, and the tags in segments #5 and #6 are bookmarks. Then I just made up some text to... See more I wanted to take a closer look at this as mentioned below, so took first of all a simple example to see how short and long sentences are handled and how simple tags are handled in various tools. I created a word document like this; The tags in segments #3 and #4 are simple formatting tags, and the tags in segments #5 and #6 are bookmarks. Then I just made up some text to extend the sentences for #7 through to #12. I then opened the document in one of the tools I tested, translated and confirmed to the Translation Memory segments #1 and #7. Then I simply looked at the matching to see, out of interest, how a few of the desktop tools we see mentioned in this forum performed. The results were these; I'll leave you to draw your own conclusions from this, but I think it's clear from this example that Studio is not performing as you stated and only applies simple penalties for tags, in the same way as all the rest. So for example in segment #3 we see a pair of formatting tags. This is a matching pair so we apply a single penalty point. In segment #5 the book mark tags are two different tags (start and end) so we apply two penalty points. I'm also aware this is a very simplified example, as are so many of your examples, and we could probably dream up more to make any tool look bad in a particular situation. But I think the message should be that everything is explainable once we have clear examples of what the material is and we are happy to take off forum any examples that are of cause for real concern. Then we can make a reasoned decision on whether we should be changing anything or not. Unless it's a frequent occurrence it could be more costly to investigate and fix safely than it would be to complete the correct translation and move on. I think changes in this area carry a lot of risk and need a lot of testing to ensure that we don't fix the few at the expense of the many. In the meantime, we will look at the real examples we have from these threads as promised. On a final note, I thought I'd post the editing shots for interest so you can see I'm not making them up (apart from our own legacy products as you've probably seen these before) For memoQ and DVX I had to take the scores from a different window as they didn't show up in the same pane as I worked, so you can't see them in here. Maybe an expert user would make a better job of that, but I'm sure of the matching. Studio memoQ Wordfast DVX Regards Paul ▲ Collapse | | | Levenshtein distance | Oct 28, 2010 |
SDL Support wrote: I'll leave you to draw your own conclusions from this, but I think it's clear from this example that Studio is not performing as you stated and only applies simple penalties for tags, in the same way as all the rest. You selected the simplest formatting tags instead of numbers or more complex placeables. For these tags the analysis swindle is different So for example in segment #3 we see a pair of formatting tags. This is a matching pair so we apply a single penalty point. In segment #5 the book mark tags are two different tags (start and end) so we apply two penalty points. This is another face of the flaw in the Trados algorithms. You use absolute valours instead of weighted ones. In this way, another time you underestimate the translator work necessary to make the changes in the segment. If you compare the results of your test to the well known Levenshtein distance algorithm, i.e. the number of strokes (edits) needed to transform the source version proposed from TM to the source version spotted in the segment, you'll see Trados 2009 proposes always too high scores i.e. the translator is not paid as he should be if discounts apply. E.g.: - for the segment #3, you have 99% instead of approx. 94% (16/17). - for the segment #5, you have 98% instead of approx. 88% (16/18). I counted one tag (or paired tag) as one stroke, just like you. Of course, the results for the longer sentences will be closer to the reality. Although the Levenshtein algorithm is the best to evaluate the difference between strings, it needs a damn lot of calculations so why it is not used in the real CAT world. Nonetheless, it should be used as reference of the simplified/approximative algorithms. I'm also aware this is a very simplified example (...) The simple examples are beautiful PS I hope I didn't make some mistake when counting letters, I'm really poor above 3 Cheers GG | | | You can't beat good logic | Oct 29, 2010 |
Grzegorz: Your solid logic is unbeatable. Paul: Bad logic. Conclusion: "Swindle" stands. | |
|
|
Hi GG, I'm a complete ignorant as regards maths, so I thought I would ask you for an explanation: I entered two sentences in a Levenshtein distance calculator: The cat is black. The acata is black. The two "a's" are here to substitute tags from Paul's example. The Levenshtein distance is 2, i.e. a 98% match. Why do you think it should be 88%? Dividing the number of characters in the first sentence by the number of characters in the second sentence seem... See more Hi GG, I'm a complete ignorant as regards maths, so I thought I would ask you for an explanation: I entered two sentences in a Levenshtein distance calculator: The cat is black. The acata is black. The two "a's" are here to substitute tags from Paul's example. The Levenshtein distance is 2, i.e. a 98% match. Why do you think it should be 88%? Dividing the number of characters in the first sentence by the number of characters in the second sentence seems to me... uhm, different than the calculation of the Levenshtein distance. I'm not arguing that it is wrong, but it's different. It may well only be my ignorance that leads me to my conclusion, but currently I simply think you're mixing apples with pears. I'd be interested to hear why you suggest/prefer the division method over the Levenshtein algorithm. ▲ Collapse | | | Weighted value | Oct 29, 2010 |
Stanislav Pokorny wrote: I'm a complete ignorant as regards maths, so I thought I would ask you for an explanation: I entered two sentences in a Levenshtein distance calculator: The cat is black. The acata is black. The two "a's" are here to substitute tags from Paul's example. The Levenshtein distance is 2, i.e. a 98% match. Why do you think it should be 88%? You should use the formula like: match level = (target length - Levenshtein distance)/target length If you have 19 letters in the target sentence, 2 letters are not 2% Dividing the number of characters in the first sentence by the number of characters in the second sentence seems to me... uhm, different than the calculation of the Levenshtein distance. It's a weighted value. The problem with "pure" Levenshtein distance is, let's say, the distance between "Cat" and "Cat(tag)" is 1 in the same way like in two 100 word sentences with only one different tag. A hundred of "Cat/Cat(tag)" like sentences give you 100 tags to change/add/delete while one 100 word sentence needs only one stroke. So, you must consider the sentence/segment length to make 'em comparable. I'm not arguing that it is wrong, but it's different. It may well only be my ignorance that leads me to my conclusion, but currently I simply think you're mixing apples with pears. I'd be interested to hear why you suggest/prefer the division method over the Levenshtein algorithm. As above. I'm not very precise but I suppose the main line is clear. Of course, the results for large sentences are rather OK but the short sentence matching level is obviously wrong. As you see in Paul's table, it's not only a Trados problem. BTW, the classic Trados wordcount was rather OK many years ago but today it's clearly obsolete as it doesn't take in account the tags etc., at least in a transparent explicite way. So why standards like GMX-V are being proposed. http://www.lisa.org/fileadmin/standards/GMX-V.html But almost nobody cares. Nobody wants to pay us per tag. Probably we should start to deliver translation without tags saying "as it doesn't cost a penny, DIY" Cheers GG | | |
Hi GG, thank you for your explanation; now it makes more sense (even to me). | | | But tags are accounted for, aren't they? | Oct 30, 2010 |
Grzegorz Gryc wrote: As you see in Paul's table, it's not only a Trados problem. BTW, the classic Trados wordcount was rather OK many years ago but today it's clearly obsolete as it doesn't take in account the tags etc., at least in a transparent explicite way. So why standards like GMX-V are being proposed. http://www.lisa.org/fileadmin/standards/GMX-V.html But almost nobody cares. Nobody wants to pay us per tag. Probably we should start to deliver translation without tags saying "as it doesn't cost a penny, DIY" Very interesting discussion. Two things, though. In Trados 8 and previous versions, you could allocate a higher penalty to placeables. This made tag-loaded files more expensive to translate that clean files. Another nice feature in the old Trados analysis is that, together with the word and sentence count you got a count of placeables. You could very easily calculate the average number of tags per sentence and keep it into account when preparing your invoice. The technology was there. Whether people used it to make realistic effort estimation is another matter. Aren't these two features present in Studio? Daniel | |
|
|
But those penalties can indeed work! | Oct 30, 2010 |
Hi again, Congratulations for your very interesting tests and developments. As a general rule, documents full of short phrases need significantly more work and are underestimated with a standard word rate. At least this behavior of Trados and tags does not seem to alleviate this bias. Concerning my particular problem of changing small phrases in related projects, I finally figured out something very basic: the filter setting can help! I think I ha... See more Hi again, Congratulations for your very interesting tests and developments. As a general rule, documents full of short phrases need significantly more work and are underestimated with a standard word rate. At least this behavior of Trados and tags does not seem to alleviate this bias. Concerning my particular problem of changing small phrases in related projects, I finally figured out something very basic: the filter setting can help! I think I had forgotten the true use of those filters somewhere along the way (Trados is not my preferred tool) and had naively come to think that applying filters would just filter out the TUs that have different fields. This of course was not what I wanted so I did not activate the filters. Today I read the fine manual and realized that the filters were (please tell me if I'm wrong) just a precondition for the text/attributes penalties to be effective. So now at least I can trust the 100%, even if there are much less of them... AO ▲ Collapse | | | Increasing Penalties | Oct 31, 2010 |
Daniel García wrote: Very interesting discussion. Two things, though. In Trados 8 and previous versions, you could allocate a higher penalty to placeables. This made tag-loaded files more expensive to translate that clean files. Another nice feature in the old Trados analysis is that, together with the word and sentence count you got a count of placeables. You could very easily calculate the average number of tags per sentence and keep it into account when preparing your invoice. The technology was there. Whether people used it to make realistic effort estimation is another matter. Aren't these two features present in Studio? Daniel Hi Daniel, Yes, at least all these features are there in Studio. This part of the discussion is of course based on the defaults so it is perfectly possible to change this to reflect whatever you think is more appropriate. You can also report seperately on placeables and tags in the analysis (this is all calculated by default). The problem is always whether you agree with the weightings applied by others. Regards Paul | | | More complex placeables | Oct 31, 2010 |
Grzegorz Gryc wrote: You selected the simplest formatting tags instead of numbers or more complex placeables. For these tags the analysis swindle is different Cheers GG Hi Grzegorz, As I'm waiting for a flight I thought I'd take another look at some of these posts. This quote is interesting because numbers, dates etc are all autolocalised so shouldn't really be a problem at all. You can of course apply an increased penalty for autolocalisation or even switch this off, so the options are probably there for this too if you feel the default settings just aren't adequate for your needs on a particular project with lots of tags. I wanted to see if I could create a few examples to look at this but I'm struggling to see why I would complain about the defaults when they can be changed. The more I look at this, the more I don't see a swindle. Rather a need to understand specific cases and how best to use the software to suit your needs as you can only cater for the majority of situations with the default settings. Regards Paul | | | Pages in topic: [1 2] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to have Trados concentrate on relevant text rather than tag material for finding matches Trados Business Manager Lite | Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
| Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |